One concern some people had was the fact that my solution didn't scale for large files because I was reading the entire file into memory all at once. That's certainly a valid point.
It took me a while to come up with an alternate implementation, but I think I've got it. The trick is using reduce() to sum the 3 columns in parallel.
from functools import reduce
def wc_func(filename):
'''
Description:
1. use a generator to read the file and form
a (1, word_count, char_count) tuple from each line
2. use reduce to calculate a running total of each column
'''
with open(filename, 'r') as f:
data = ((1, len(line.split()), len(line)) for line in f)
return reduce (lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2]), data)
print (wc_func('bigdata.txt'))
What about performance?
It depends. Using the standard CPython, the performance is painfully slow (basically unusable). But using pypy tells a very different story. With a 1.7Gb test file, performance is actually a bit better than the native "wc" Linux command.
No comments:
Post a Comment