The Python Report: Functional Python (UPDATE)

Tuesday, February 23, 2016

Functional Python (UPDATE)

I had some very interesting responses to my previous post.

One concern some people had was the fact that my solution didn't scale for large files because I was reading the entire file into memory all at once. That's certainly a valid point.

It took me a while to come up with an alternate implementation, but I think I've got it. The trick is using reduce() to sum the 3 columns in parallel.

from functools import reduce

def wc_func(filename):
'''
Description:

1. use a generator to read the file and form
a (1, word_count, char_count) tuple from each line

2. use reduce to calculate a running total of each column
'''

with open(filename, 'r') as f:
data = ((1, len(line.split()), len(line)) for line in f)
return reduce (lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2]), data)

print (wc_func('bigdata.txt'))

What about performance?

It depends. Using the standard CPython, the performance is painfully slow (basically unusable). But using pypy tells a very different story. With a 1.7Gb test file, performance is actually a bit better than the native "wc" Linux command.

18:54 ~$ which wc                                                                                                   

/usr/bin/wc                                                            

18:54 ~$ which pypy                                                                                                 

/usr/local/bin/pypy                                                     

18:54 ~$ pypy --version                                                                                             

Python 2.7.9 (9c4588d731b7, Mar 23 2015, 16:30:30)                                                                  

[PyPy 2.5.1 with GCC 4.6.3]                                             

18:54 ~$ ls -lh bigdata.txt                                                                                         

-rw-rw-r-- 1 mpoulin registered_users 1.7G Feb 22 23:48 bigdata.txt     

18:54 ~$ time pypy wc_func.py ; time wc bigdata.txt                                                                 

(100000000, 400000000, 1800000000)                                                                                  

real    0m39.695s                                                                                                   

user    0m30.376s                                                                                                   

sys     0m1.051s                                                                                                    

 100000000  400000000 1800000000 bigdata.txt                                                                        

real    0m40.433s                                                                                                   

user    0m31.336s                                                                                                   

sys     0m0.533s                                                                                                    

18:56 ~$           

Tuesday, February 23, 2016

Functional Python (UPDATE)

What about performance?

No comments:

Post a Comment

Blog Archive