Saturday, 25 September 2010

Evaluation of Scheme in Python

Those who have ever gone through the SICP ( Structure and interpretation of computer programs) , would have come across the implementation of a Metacircular evaluator ( an evaluator written in a language that it evaluates) for scheme in the 4rth Chapter.


Motivated by that , and drawing ideas from the wizard book , an evaluator for Scheme is created in Python, so that Scheme statements evaluated in Python.

The creation of a metacircular evaluator for scheme contains prominently of two steps:

1) Eval: To evaluate a combination (a compound expression other than a special form), evaluate the subexpressions and then apply the value of the operator subexpression to the values of the operand subexpressions.
2) Apply : To apply a compound procedure to a set of arguments, evaluate the body of the procedure in a new environment.



Implementation of the Evaluator in Python:

Before getting into the creation and implementation details of the evaluator , please read this.
Now that you have got an idea about the concepts of environment , lets get into further details of creation of the evaluator.

The first process is basically to parse the user Input and use it in a way so as to evaluate it. After parsing ,the user input is provided to the function gravity().This particular function performs the task of classifying the user input and route them to appropriate functions. This is similar to 'eval' in the scheme metacircular evaluator implemented in SICP.
According to the user input , the appropriate functions are performed.
Different functions implemented in the evaluator.

evaluate():
This function could be compared with the 'apply' procedure implemented in SICP. The function checks for different operator and its corresponding operands. The operator is applied to the operands and the result returned.

opdet():
The function is called from 'evaluate()' , it takes in a list and returns the operator that has to be applied to the operands.



If a simple expression is provided just to perform basic arithmetic and logical operations , only these two functions are invoked after the gravity () function and we obtain the result. The situation, as is obvious would be a little different when relatively complex tasks are give ( such a processing of an 'if' statement , a function , an assignment operation, a recursive function and so on.


Processing of an 'if' statement involves invoking many more functions.

if_begin_eval():
This function involves the presence of any variables involved in the 'if' statement and fetches the value of such variables from the environment.

if_eval() :
This particular function is used to evaluate the 'predicate ' of the 'if' statement and returns the result of the 'if' statement. The function invokes the 'evaluate()' function and would process the entire 'if' statement according to the value returned by 'evaluate()' .
If a simple 'if' statement'
( if ( > 4 5 ) 4 5 ) is inputted:
if_eval() would pass the predicate part ( > 4 5 ) to 'evaluate()' and process the if statement according to the returned value. Here the value returned by 'evaluate()' would be 0, since the condition ( 4 > 5 ) is not true. The value returned by the 'if_eval()' would be 5.

if_cons_alt():
The function evaluates the consequent and alternative of the 'if' condition'. A separate function for the process would be specially helpful when the if statement consists of larger expressions and values.
The function would return back a 2 element list of the final values obtained after evaluation of the consequent and alternative.



Evaluation of procedures would take us through a different path. We need functions that can extract, evaluate and execute a function. We also need a function to test whether the inputted procedure is recursive or non-recursive.

exec_call():
Checks whether the function is a simple function of the sort :
( define ( lambda ( square ( x ) ( * x x ) ) ) non-recursive
or of the sort:
( define ( lambda ( fact ( x ) ( if ( = x 1 ) 1 ( * x fac ( - x 1 ) ) ) ) ) ) recursive :
executes the procedure if non-recursive.

extract_funcbody():
extracts the body of the function. Passes the body of the function to evaluate_funcbody .

evaluate_funcbody():
Fetches the value of the variables in the function body and returns back the function body to extract_funcbody(). This function is invoked only when the procedure is non-recursive.

exec_func_body():
Fetch the value of the variables in the function body . This function is invoked only when the inputted procedure is recursive.



Use of the module ops().
You would see a module named ' ops' being imported. This module consists of certain functions that can be performed on lists. The module is imported so as to get a feel of the way programming is done in scheme , any operation on lists can be performed on lists using 5 basic functions. The attempt has been made to stick to the combination of these basic operations rather than using the full functionality of Python.

The source code of the project can be obtained here.

Concept and creation of Environment

One of the most important thing you need to grasp while doing a project on interpreters and evaluator is the idea of an environment. Within the program the environment is a collection of the names(whether variables or functions) and its corresponding values. While using such names you will need an environment where you can store the value of those names so that you could fetch it for further use. While implementing the evaluator for scheme in python , the concept of environment was implemented using associative arrays in python. 3 dictionaries were used to implement the concept of environment. Lets go through a bit of details on the creation and use of environments.

Consider writing the scheme statements:

scheme>>>( define n 7 )
scheme>>>( define s 6 )
scheme>>>( if ( > n s ) n s )

In our 'if' statement we have not specified the exact value , but instead we have used variables. The question comes then that from where do we retrieve the values of those variables.
The answer is simple: 'Here the variables are global variables and are accessible from anywhere within the program'. Hence we need create a dictionary which consists of 'n' and 's' as its keys and their values as the values of the dictionary. This dictionary is created when the statement 'define ( not followed by a parenthesis) is inputted.

1)dict={'s':'7','n':'6'}

Obviously we cannot use this dictionary for all our purpose. We also need a dictionary where we could store variables and values that are local to a function. We have created another dictionary for this very purpose .

Consider the following:
scheme>>>( define ( sq ( lambda ( x ) ( * x x ) ) ) -----> statement 1
scheme>>>( sq 4 ) -----> statement 2

When the second statement is executed ,a dictionary 'local_dict' is created where this value 4 is associated with the variable 'x'.

2)local_dict={'x':'4'}

We use the 3rd dictionary for associating a function with its body. When the 'statement 1' above is inputted , a dictionary 'func_dict' is created which contains the 'function name' as its key and the 'body' as its value.

3)func_dict={'sq','( lambda ( x ) ( * x x ) )' }

We have functions 'create_env ()' and 'create_localenv()' for creating the first and third dictionaries that we have mentioned above. The second dictionary is created within the function 'exec_call()' where the processing of each function is started. Hence its appropriate to create the dictionary needed only for the local use of that function within exec_call().

Thursday, 23 September 2010

Analyzing the Tinypy Virtual Machine

The major task in tinypy is to analyze the working of the virtual machine. Analyzing and understanding its source code is, obviously, the major challenge that is faced. The code itself is not well arranged and the lack of proper documentation makes the challenge even bigger. Proper use of the GNU debugger, 'gdb', and the indexing tool, Ctags makes the process of analysis of the VM a lot simpler. Perhaps the best way to understand the working of the VM is to use the VM to execute the byte code of a simple python program and see and understand the actual flow of control. Here a sample program 'sample.py' is used:

def add(a,b):
c = a + b
return(c)
print(add(1,9))



Note that the use of the parentheses for built-in functions too is a part of the syntax in TinyPy. Obtaining sample.tpc is explained here.

cc vmmain.c -1m
./a.out sample.tpc


gives you the output:
10

Understanding the flow of control for this program would give an idea of the working of a segment of the virtual machine.
Each time the VM is invoked, a new instance of the VM is created .
tp_vm *tp =tp_init(argc,argv);

The key point to be understood in the analysis of VM is that each object in tinypy is stored as a union in the C API.
Lets get into the details: tp_obj is the tinypy's object representation.

typedef union tp_obj {
int type;
tp_number_ number;
struct{int type;int *data;}gci;
tp_string_ string;
tp_dict_ dict;
tp_list_ list;
tp_fnc_ fnc;
tp_data_ data;
} tp_obj;

The union tp_obj with its fields are shown. The field 'type' of the union indicates the type of the object. A type value of 1 indicates that a number is being used, type = 2 indicates object is a string and type = 4 indicates a list. In our sample program, our objective is to add two numbers 'a' and 'b' and print their output. A function in the virtual machine 'tp_add' does the job of adding two arguments that is passed to the function as input. The arguments could be numbers, strings or lists. Lets analyze the function 'tp_add'.

tp_obj tp_add(TP,tp_obj a,tp_obj b)

The Function definition of 'tp_add' contains three arguments:
TP -> the VM instance
• tp_obj a -> the union variable representing the object 'a' in the sample program
• tp_obj b -> the union variable representing the object 'b' in the sample program

Within the function tp_add, the type of the two objects are checked and the appropriate function is performed. In our example, two numbers are to be added. As we mentioned above, 'a' and 'b' are stored as unions and the union contains fields for types such as number, strings, lists etc. We access the field tp_number_ in the union 'a' as 'a.number'. 'number' is a variable of the structure tp_number_ which contains a variable 'val' that stores the exact value of the number . Therefore a.number.val would give you the actual object . Analyzing the addition of strings would be interesting as well. 'a.string' would give you the union which represents the string object. The union tp_obj contains a field tp_string_ which is a structure and includes the pointer that points to the address where the string 'a' is stored. The structure of tp_string_:

typedef struct tp_string_ {
int type;
struct tp_string_ info*;
char const *val;
int len;
} tp_string_;


An idea about the manipulation of lists would do no harm to our primary objective, the study of the working of the VM. Consider two simple lists in python :
a = [1,2,3] , b = [4,5,6]

Again we start from tp_add which needs us to go back to our function. The VM uses a function 'tp_extend' which returns a union that contains the new (extended) list. You would see that the value of the field 'type' of the union that is returned would be 4 which indicates a list. Access the field 'list' as 'a.list'. 'list' is a structure variable that includes a pointer *val that points to another structure '_tp_list':

typedef struct _tp_list {
int gci;
tp_obj *items;
int len;
int alloc;
} _tp_list;


The pointer *items as you could see is of type tp_obj and de-referencing it would give you the union tp_obj. This union contains a single element of the list. Starting from the function tp_add:

*r.list.val -> items

would give you the union mentioned above. Accessing the field 'val' of the field 'number' (variable of the structure tp_list_) of the union would give you the element of the list. Now , the obvious question is how do we obtain the next element of the list. For this you need to have some idea about the storage of unions.

r.list.val -> items

would give you the address of the union. Suppose the address be '0x96ca048' . Now to obtain the union which contains the next element of the list, you need to add the address of the current union with the size taken by each union (sizeof(union)).
Union containing the next element of the list = Address of the current union + size of each union. For example :
r.list.val -> items = 0x96ca048 + 10
would give you the union that stores the next element of the list . The size of the union is 10, in hexadecimal = 16 bytes.


The path of tp_add can be traced using the 'bt' option in gdb. The path of the function tp_add is as follows:

An Overview of TinyPy

TinyPython is a minimalist implementation of Python in 64K code. It is a parser and byte-code compiler written in TinyPython itself. It is also fully bootstrapped in the sense that initially, TinyPython converts a Python script (.py) into a special TinyPy byte-code format (.tpc), and this generated code is then passed into a subset of the TinyPython source code called the Virtual Machine, where the actual execution takes place.

One can even extend the idea that if the VM is compiled into a low-level format adaptable to a particular micro controller, then the VM will reside inside that chip, and any .tpc files can be downloaded into the chip as its input.

As outlined above, TinyPy comprises of two phases: the byte-code generation, and the execution of the byte-code in the TinyPy Virtual Machine. Out of these, the first phase will not be mentioned here.

Building Up
The TinyPython source code used was downloaded from here. Initially, the file listing will look like the following figure.



You can find that the 'build' folder will be empty. The 'doc', 'examples' and 'modules' folder may or may not contain any documents, Python scripts and batteries (or modules) respectively, depending on the downloaded package. The LICENSE.txt and README.txt are self-explanatory names. The CHANGES.txt contain a record of all the changes that the author of TinyPy thought of making in the source code at some point. The ROADMAP.txt gives a brief description of the features of TinyPy, and an idea about the future developments to be implemented. The 'setup.py' contains the initial code to build the TinyPy from scratch. At the terminal, type as follows:
python setup.py linux

It is implied that you need a Python interpreter available in your system. The 'linux' option is to specify that the code will be compiled so as to make it work in a Linux environment. After running this command, a new executable 'tinypy' will appear in the 'build' folder as shown.



To fully bootstrap and test TinyPy, give the command as,
python setup.py linux boot

Now, in the 'tinypy' folder shown in the above figure, two new executables, 'tinypy' and 'vm' will appear, which are the TinyPy parser, and Virtual Machine respectively. It can be noticed that all the .pyc files for the corresponding .py files have been generated by the Python interpreter. In addition to that, some test scrips - named as Temp - will be invoked too. The most interesting thing will be the presence of new .tpc files for some of the initial .py files. The general usage and some options available are listed below.

python setup.py command [option] [module]
• 64 k - build a a64k version of TinyPy source code
• blob - build a single tinypy.c and tinypy.h

The tinypy folder will now have the following contents: Out of these, the 'py2bc.py' script is used to convert a user-generated Python script into its corresponding .tpc file. The format will be:
py2bc.py sample.py sample.tpc

Here, tinypy_path is the path (relative to current position) of either the tinypy executable in the 'build' folder, or the one in the 'tinypy' folder. 'sample.py' is the name of the user-script. 'sample.tpc' is the name given for the byte-code converted file. Or you can simply give it as:
python py2bc.py sample.py sample.tpc

Finally, the generated byte-code (.tpc) is to be passed into the VM for compilation and execution. Assuming the current directory as 'tinypy' folder, it is done as:
vm sample.tpc

Or logically,
gcc vmmain.c -lm ./a.out sample.tpc

The 'vmmain.c' will be present in the 'tinypy' folder. It is the main function of the VM which runs and automatically links to all the other files necessary for the VM. It is necessary to link the math module too, hence the option '-lm'. And now the output is obtained and displayed. For a better picture, the files actually needed for VM are:



Writing and compiling the code only accounts to half of the process. The other half is debugging and understanding the flow of control within the source code. To do that, make use of the GNU debugging tool, 'gdb'.
gcc -g vmmain.c -lm gdb ./a.out

Inside the 'gdb', you can set breakpoint for any function. Then run the process for the byte-code you need. Here, 'sample.tpc' is used as example.

(gdb) run sample.tpc OR r sample.tpc

Another essential tool will be 'ctags'. After its installed, go to the 'tinypy' folder and build the tag stack as follows:

ctags *.c *.h

You can see that a new file named 'tags' is now available. Now when you are inside a .c or a.h file, you can use 'ctrl + ]' and 'ctrl + T' to jump back and forth between cross references spanning different files.

Wednesday, 22 September 2010

Find the Endianity of your Processor:

Your processors can be of two types:
1)Little Endian : Little Endian means that the low-order byte of the number is stored in memory at the lowest address.For example, a two byte short int would be stored in memory as:
Base address + 0 : Byte 1
Base address + 1: Byte 0


2)Big Endian : 'Big Endian means that the high-order byte of the number is stored in memory at the lowest address. The 'short int' above would be stored
Base address + 0 : Byte 0
Base address + 1 : Byte 1


where Byte 0 and Byte 1 are the first and second bytes of the short integer.

You could verify the endianity of your processors by writing a short and sweet C code as follows:

{
unsigned char *p;

short int i=0x1234;

p=&i;

printf("%x\n",*p);

}

The result you obtain would show you the endianity of your processor.

Lets give a thought on what actually happens.

0x1234 is a hexadecimal number and is stored in a variable 'i' of type short int ( occupies 2 bytes of memory space ).

Address of variable 'i' is stored in a pointer of type unsigned char '*p' ( I.e 'p' points to an object of type 'unsigned char').

Dereferencing the pointer p would result in the access of just a single byte . This obviously would be the lower of the two bytes. This shows the fact that accessing a memory location with a pointer is purely dependent on the type of the pointer and not on the type of the actual value stored in the address being accessed. Note that the types of the' pointer to the memory location' and the 'actual value being stored in the location' differ.

You obtain the value that is stored in the lower byte of the two bytes that is allocated for the 'short integer' with the use of a pointer of type 'unsigned char' ( size = 1 byte). The value stored at the lower byte would indicate to you the endianity.

From our sample program,and from what we know about little and big endians , an output of 34 indicates that your processor is 'little endian' and an output of 12 indicates that its big endian.

Just to mention - 'Intel x86' processors are one common example for little endian processors whereas 'Motorola 6800' is big endian.

Tuesday, 14 September 2010

Crash your python Virtual machine!!!

Causing a python virtual machine to crash is not what I intended but it just happened so. But since this happened , I would just like to give an explanation about what happened and the reason for it.

While trying out memoization , I tried to increase the maximum recursion limit which was set to 1000 in my system.
To change your maximum recursion limit , what you need to do is :
import sys
sys.setrecursionlimit( MY LIMIT )


This would change the maximum number of times that you could recurse your code segment.
Everything worked fine until I did this:
sys.setrecursionlimit(46765)

Now my recursion limit is set to 46765. I tried to run my python code which displayed a segmentation fault that you wouldn't normally associate with your python code. This tells you that your python virtual machine has crashed.
Lets try and find out the reason behind that:

Recursion consumes the stack size in the system . Setting the recursion limit to 46765 means that now the amount of stacksize that will be consumed is much greater .We know the fact the python virtual machine is written in C . As you increase your recursion limit , a point reaches where there is not enough memory for the C back end to execute our python code and this results in a segmentation fault in C that is back propagated to python and displayed in our terminal.

Memoization:Run your programs faster.

Memoization is an optimization technique to speed up programs . Here the function calls are not required to make repetitive calculations of previously calculated results. The processing time of the program is reduced and the memory consumed is increased. Memoized functions become optimized for speed while they use a higher memory .

Lets go through an analysis of memoization: Consider the code segment for finding out the term of a fibonacci series:


def fib(n):

if n==0:

return 0

if n==1:

return 1

else:

n=fib(n-1)+fib(n-2)

return n


You could find out the time taken to run the code :
$ time python filename.py


You would find out then that the time complexity of your program is O(exp(n)) . Try and find out the 40th term of the fibonacci series . This code segment will make you wait for a considerable amount of time before you view your result in the terminal. Trying to find out the 50th term would cause you to manually interrupt the process because you can not wait any longer. Therefore we need a method whereby we could find out the higher terms of the series in much faster time.

Heres exactly where memoization comes in. Memoization is not any kind of magic. What it does is that is caches the results that have been calculated earlier and stores them in a lookup table so that these don't have to be calculated again. We make use of an associative array as a lookup table here. The following code segment shows the memoized version.

table={0:0,1:1}

def fib1(n):

if not n in table:

table[n]=fib1(n-1)+fib1(n-2)

return table[n]


Here we create an associative array with the first two numbers of the fibonacci series as its keys. Now
upon the call of each function , the lookup table ('table') is checked for the existence of the term . If the term is not present in the associative array , the term is calculated and stored in the associative array. In both cases the resulting term is returned. Now see for yourself , the pace of the results being processed.
Try find out the 100th term of the series:
Do
$ time python myfilename.py
You would see the 100th term of the fibonacci series in 12/1000th of a second.Try them for the 1000th,10000th terms...(provided that you have a higher recursion limit). Did you ever think that it could be this faster.???

Sunday, 5 September 2010

Intermezzo:Coding style

When writing long codes , its important that , as a programmer , you write your code in the best possible structure that is readable to others. Most languages can be written in different styles, and one may be better than the other. Choosing an appropriate coding style is important so that reading your code is not itching to the eyes.
PEP8 has emerged as the style guide that most codes adhere to. The most important points that needs to be followed are:
1)Use 4-space indentation and no tabs .

2)Use blank lines to separate functions and classes and large blocks of code within functions

3)Use comments when possible

4)Use docstrings in classes and functions.

5)Use spaces around operators and commas e.g ----- 'c = a + b'

6)Naming your classes and functions consistently. Always use 'self' as the name for the first argument a method in the class.

7)Wrap lines so that they don't exceed 79 characters.

Try using all of the above listed particulars and make you code eye-pleasing and obviously easy to understand.

The import search path in Python

As a python programmer, i feel it necessary to mention about the library search path in python. Python looks in several places in your system when we try and import a module in python.
Precisely , it searches in all directories mentioned in sys.path. Just try and see the directories in sys.path.
>>>import sys
>>>sys.path
This returns you a list of the directories where python does a 'look in' to import your module. This list may vary from system to system and depends on the python version that you use.For me, sys.path is :

['', '/usr/lib/python2.5', '/usr/lib/python2.5/plat-linux2', '/usr/lib/python2.5/lib-tk', '/usr/lib/python2.5/lib-dynload', '/usr/local/lib/python2.5/site-packages', '/usr/lib/python2.5/site-packages', '/usr/lib/python2.5/site-packages/Numeric', '/usr/lib/python2.5/site-packages/PIL', '/usr/lib/python2.5/site-packages/gst-0.10', '/var/lib/python-support/python2.5', '/usr/lib/python2.5/site-packages/gtk-2.0', '/var/lib/python-support/python2.5/gtk-2.0']

A few points about sys.path:

Importing the sys module makes all the attributes and functions of the module available to you.

sys.path as i mentioned gives you the current search path.

You could add a new directory to the python path by appending the list 'sys.path'
sys.path.append(/my_directory/my_setup)

Again view sys.path and you will find the directory mentioned above in the new search path.

Functions in Python.

Just like most other languages , Python also has functions in it. Lets go through some of the aspects of functions in python.

def my_function(var1,var):
This is how you declare a function in Python.
The keyword def, the function name ,the arguments separeted by commas in parenthesis. Its important to understand that functions in python specify no return type. Each and every function returns something. Either the value returned or else 'none' , the python 'null value'. The arguments dont specify a type either , the type is made out from the values that are passed in .
Python is a :
dynamically typed language: The types are discovered at run time.
strongly typed language: One particular type cannot converted to another type without explicit type conversion.

Higher order functions in python: Functions that operate on other functions, (functions that can take other functions as its arguments).
Eg: filter(my_function,list_1) is a built in higher order function , which takes the function 'my_function' as one of its arguments.

Functions in python are first class:
Consider :
def my_function2(var1):
We have defined a function 'my_function2'
Now we could do this:
var=my_function.
Now try printing the value of 'var'
Its the address of the function 'my_function2' that will be printed because assigning a function name to a variable assigns it address to the variable.
Now we could invoke the function as:
var(my_parameter)
Just check out the glob module in python (/usr/lib/python2.5/glob.py) for a good example of the first class nature of functions in python.
Function composition is perhaps the best way to explain the fact that function are first class in python.
def f(my_funct):

def g(x):

def compose (f(g(x))): ----- function composition....functions are passed in as variables to other functions.
The way variables are used in function in also important . A good knowledge of the use of local and global variables are required to avoid any kind of misperception while using variables in python.

Local and Global variables in python..

Lets see the working of global and local variables in python with the help of a few small examples: What you need to understand is that variables that are declared inside functions are local to that functions (cannot be accessed from outside the function), whereas variables that are declared outside the function are global to that function , they can be accessed within the function.

>>>a=1
>>>def func():
... print a
... a=10
...
>>>func()

would produce an error as:
UnboundLocalError: local variable 'a' referenced before assignment.
This is because the local variable 'a' is used in the 'print' statement before it is being assigned a value.
However ,

>>>a=1
>>>def func():
... global a
... print a
... a=10
...
>>>func()

wouldn't print an error of that sort.
Note the use of the keyword ' global' . This keyword indicates that the variable 'a' that is being printed is a global variable. The python interpreter understands that there is a variable 'a' that has been assigned to '1' outside the function and hence prints the value '1'.

Now do the following:

>>>a=1
>>>def func():
... global a
... print a
... a=10
... print a
>>>func()

The result would be:
1
10
Now try and print the value of 'a'.
The value printed would be:
'10'
This is because the value of the variable 'a' has been changed to 10 by the function
'func' because 'a' has been declared as a global variable in the function 'func'.

Now do :
>>>def func():
... print b
... b=10
...
>>>print b

would result in an error :
NameError: name 'b' is not defined

because no variable 'b' has been assigned a value outside the function. The variable 'b' used in the function 'func' is local to that function and cannot be accessed from that outside the function, that is , the variable 'b' is local to the function.

Version control Systems...

Version control is a system that records changes to your files and allows to view specific versions of a file or set of files.VCS is used so that you don't screw up your files or your entire project as a result of the silly , careless mistakes that you have made.
Many people do their own version control by copying their files to another directory.This is a quite simple and easy approach if you are careful enough so that you won't agonize by doing something disastrous so that you may lose all your stuff later.Therefore keeping in mind that such things can happen ,after all we are all humans, we use version control systems.

Centralized VCS and Distributed VCS.
Centralized VCS
CVCS includes a single server that contains different versions of the files and the clients could check into that.The major advantage of this system is that administrators have fine grained control over whats happening ,but it has serious drawbacks as well. They are:
Single point of storage if crash would be fatal.
The clients could pull anything from the repository but they need a commit access to push something into it.This might lead to a client getting frustrated if he/she has not been granted a 'commit access' even after working for a longer period of time. Also there might something political issues related to administration.

Distributed VCS.
In DVCS ,you could push data into the repository with the same ease with which you could pull from it. All the repositories are at the same level and therefore we could collaborate with any of the repositories.Examples of DVCS are Git , mercurial , bazaar..etc...

Friday, 3 September 2010

Shebang lines...

Shebang(also known as 'hashbang') refers to the characters #! when they are the first two characters (occurring at the left most corner, i.e the beginning of the very first line) of your script. The particular line that contains the shebang is the shebang line in your script.
Now what's shebang meant to do.?
In a unix-like operating system , the program loader (part of the operating system thats responsible for loading programs) finds the presence of these two characters(#!) , identifies it as a script and tries to execute it with the interpreter that will be specified using the remaining lines of the script.
For example consider writing a shebang line for a python script..
First and foremost you need to find the location of the interpreter for that particular script...
$which python
/usr/bin/python

Therefore...the shebang line should be set as...
#! /usr/bin/python
Specifying the shebang line would execute your script as follows:
./'script name'.py instead of doing 'python 'script name'.py'
You could try out the use of shebang lines for any of the following types of scripts
'werbel', 'tinypy' ,'perl'.....and many more...

Wednesday, 1 September 2010

Optimizing your Python code.

Compare python with some of the other languages that are familiar to many of you. The obvious difference that will be caught by you , is the fact that python code, for the same program ,is much shorter than codes written in some of your favorite languages like C ,Java etc.
Understanding the essence of python programming would help you write shorter and interesting codes and are easier to understand. Lets just have a look into some of the core points that need to be followed so that your python program becomes short and effective.

First and foremost, you need to understand the fact that python has got a huge library with functions for almost every activity you would need frequently. The only thing you need to do is to identify whether a function is present that would handle your case and then call the function .There are lots of built in functions that we could call. Other functions that reside in certain modules can also be invoked after the module containing the function has been imported.
As an example , consider the following:
map(sqr,range(1,100)):

'map' is a built -in higher order function that takes two or more arguments and returns you a list,'range' is another built-in which would return a list containing all the elements from the starting element up to the ending element.

This particular function returns a list of squares of all numbers from 1 up to 99 using a single line code. Compare this to a C code to find the squares of this many numbers,certainly ,a single line code wouldn't do that for you.

Python has got a large collection of such built ins which includes functions like 'filter','reduce','hex','cmp' and many more.

Python has got many more functions that are defined in different modules:To use those functions , you just need to import those modules and invoke them. Some of the most commonly invoked ones are :
Module 're' for pattern matching. Contains power full functions like findall' , search' , 'match' etc.
Module 'os' for operating system services. Includes functions 'mkdir' , 'chdir' ,' path.join,' path.abspath'' etc.
Module 'shutil' for shell utilities like copying files.

Module 'urllib' for dealing with 'URLs'.

Module 'sys' :sys.argv[index] would return the input given on the command line at position 'index.'

Identifying the correct functions that are present in these and more modules in python would result in much shorter code length.

A proper knowledge about the use of sequences and variables would do no harm in our purpose of decreasing the code size.

Consider the code to swap the values of two variables. In other languages you would write:
temp=a:
a=b:
b=temp:

Once again, in python all you need is a single line code :

a,b=b,a

The logic behind whats happening is to be understood. Consider the right hand side of the expression -variables are packed into a tuple. On the left hand side, the tuple is unpacked and the values are entered into the variables. Thus the values are swapped in a single line as opposed to three lines taken in other languages.

Use of generator expressions is another way to reduce your code size.
Consider:
xvec=[10,20,30]
yvec=[30,40,50]

The expression sum(x*y for x,y in zip(xvec,yvec)) gives you the dot product of xvec and yvec in a single step where 'sum' is another built in function. Think of writing a C code instead of this and the number of writing you would take.


There are many more such tricks that would decrease the code size considerably . Slicing of lists , usage of the appropriate sequences are all important in producing a short and elegant code. Obviously,the fact remains that to understand the methodology behind decreasing the code size is to keep on coding ,understanding the definitions of the functions and identifying where to use them. But again, compressing your code size heavily might result in your code being hard to digest by others, specially those beginners trying to understand a python code. Therefore its important to develop a skill where you would maintain a balance between decreasing the code size and the readability of your code. This can be achieved only through rigorous coding.

Garbage collection in Python.

In C, once the memory allocated becomes garbage , that chunk of memory is gone wasted. Nothing more is possible on it.
Consider
int a[10];
Some amount of memory would be allocated for the array 'a'. Now if do a='some value' , the memory that was allocated to the array becomes garbage because 'a' no longer uses it as it is pointing to something else now. The memory is gone wasted and is no more usable.
The bugs that create loss of memory are hard to detect as well ,simply because of the fact that they cannot be seen. Its only after executing the particular code for some amount of times that you might actually get to see the effects of the bug .By the time your system may have crashed.

In python ,the interpreter is smart enough to do 'garbage collection'. That is the memory which is unused is collected by the interpreter and is not gone wasted. The process of garbage collection is done by using a method called ref counting. Understand ref counting as :
Consider a=[1,2,3]
'a' points to a particular block of memory. At the end of the block of memory ,there would be a count that indicates the number of variables that point to the same block of memory. So now the count would be 1.

Now if we do,

b=a......means that now b points to same block of memory. Therefore the count becomes 2.

Now consider doing

b=0.

The count in the block of memory where b was pointing earlier reduces by 1. , hence count is now 1.
Do

'a=0'

and the count in that block becomes 0.Now the python interpreter can allot the chunk of memory to some other variable as it may feel. The significant thing that you need to understand is the fact that the memory is not gone wasted as in C .

Garbage collection keeps track of the variables that point to a block of memory and collects the unused memory accordingly. There is never a shortage of memory space. But there are obvious disadvantages with the process of garbage collection.

One drawback , as you would except would be the matter of speed which would decrease.
But the more significant drawback would be the non-deterministic behavior of the interpreter.
Consider the two statements.
.i=0
.j=0
You would expect the statement j=0 to be executed immediately after the statement i=0 has been executed. But as a result of this process of garbage collection ,this is not a guarantee. There might be a certain time delay in executing the second statement if the interpreter feels like doing some 'garbage collection' after the first statement has been executed . This results in the delay.

Therefore its clear that its not best to use python over languages like C when it comes to real time systems.