Tuesday, February 23, 2016

Functional Python (UPDATE)

I had some very interesting responses to my previous post.


One concern some people had was the fact that my solution didn't scale for large files because I was reading the entire file into memory all at once. That's certainly a valid point.

It took me a while to come up with an alternate implementation, but I think I've got it. The trick is using reduce() to sum the 3 columns in parallel.


from functools import reduce

def wc_func(filename):

    '''
    Description:

    1. use a generator to read the file and form

       a (1, word_count, char_count) tuple from each line

    2. use reduce to calculate a running total of each column

    '''

    with open(filename, 'r') as f:
        data = ((1, len(line.split()), len(line)) for line in f)
        return reduce (lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2]), data)

print (wc_func('bigdata.txt'))


What about performance?

It depends. Using the standard CPython, the performance is painfully slow (basically unusable). But using pypy tells a very different story. With a 1.7Gb test file, performance is actually a bit better than the native "wc" Linux command. 


18:54 ~$ which wc /usr/bin/wc   18:54 ~$ which pypy /usr/local/bin/pypy   18:54 ~$ pypy --version Python 2.7.9 (9c4588d731b7, Mar 23 2015, 16:30:30) [PyPy 2.5.1 with GCC 4.6.3]   18:54 ~$ ls -lh bigdata.txt -rw-rw-r-- 1 mpoulin registered_users 1.7G Feb 22 23:48 bigdata.txt   18:54 ~$ time pypy wc_func.py ; time wc bigdata.txt (100000000, 400000000, 1800000000) real 0m39.695s user 0m30.376s sys 0m1.051s 100000000 400000000 1800000000 bigdata.txt real 0m40.433s user 0m31.336s sys 0m0.533s 18:56 ~$


No comments:

Post a Comment