Advanced Python users might know generators and generator comprehensions, this article is meant for people who are just getting started with Python and want to speed up some of their tasks.
Firstly I’ll start off with a little introduction of what list comprehensions are. List comprehensions are a nifty little syntactic sugar to make the creation of lists a lot easier.
Take this list for example:
lst = [] for x in xrange(10): lst.append(x)
That is an awful lot of code for a for creating a simple list. We could do better:
lst = [x for x in xrange(10)]
Now that is a whole lot better. Easier to read, and far shorter to write. You can make list comprehensions as nifty as you like. Take this for example:
lst = [check(x) for x in xrange(10) if x % 2]
This creates a list of the outcome of check() on all uneven number from zero to ten. You could even go berserk and start nesting list comprehensions but that would just be to fondle your inner perl coder.
lst = [x for x in [y for y in xrange(10)]]
Or build a general list if you add them to together:
lst = [(x,y,z) for x in xrange(10) for y in xrange(10) for z in xrange(10)]
Which generates a nice list of a thousand items.
If you create a list comprehension like this, python executes the statements to form the list and puts that whole list in memory.
Now let’s go on to a real world test case:
I have a text file containing numbers, a random amount of number per line and the numbers are of random size. Can you quickly give me a sum of all numbers? The file has to be read from stdin.
Sounds hard? Well I’ll show you the code straight away:
#!/usr/bin/env python import sys data = [[int(i) for i in line.split()] for line in sys.stdin] total = 0 for line in data: total += sum(line) print 'Grand total %d' % total
As you can see I use a nested list comprehension. The first list comprehension (the outer one) reads all data from stdin and creates a second list of every number on that line.
The output of the list comprehension would be like this:
[[1,2,3], [4,5,6]]
And the total sum would be: 17677692470 (I used this 17MB text file which I generated). I’ve ran this script ten times and got an average of 4.571 seconds to execute it, not too bad considering it has to spit throught 17 MB’s of text and 322456 lines.
However, we can make that faster, 35.33% faster.
Python also knows generators and generator comprehensions. This is not the article to go into depth about generators and the specifics of generator comprehensions (read the two PEP’s I linked to get a feel for them). I’ll just show you the edited code which uses generator comprehensions to do the same trick.
#!/usr/bin/env python import sys data = ([int(i) for i in line.split()] for line in sys.stdin) total = 0 for line in data: total += sum(line) print 'Grand total %d' % total
That’s all there is to it, I’ve changed the outer [ and ] to ( and ) and the comprehension has become a generator. Generators are lazy, so the next value does not get calculated until their .next() method gets called. This makes Python not read the whole of stdin to memory, it will just read a line when it’s necessary.
The effects on this file are already dramatic, it takes 2.956 seconds on average to parse the same input file with the generator.
However, if you would be working with even larger input files (let’s say 2 GB) you might not even be able to do this with lists. Do you want to put 2 GB in memory? I don’t think so.
Read up on iterators if you are working with this kind of stuff a lot, especially parsing log files.
Any comments on the code of this article can be left in the comments, I’ll check them!