Tuesday, February 23, 2016

Functional Python (UPDATE)

I had some very interesting responses to my previous post.


One concern some people had was the fact that my solution didn't scale for large files because I was reading the entire file into memory all at once. That's certainly a valid point.

It took me a while to come up with an alternate implementation, but I think I've got it. The trick is using reduce() to sum the 3 columns in parallel.


from functools import reduce

def wc_func(filename):

    '''
    Description:

    1. use a generator to read the file and form

       a (1, word_count, char_count) tuple from each line

    2. use reduce to calculate a running total of each column

    '''

    with open(filename, 'r') as f:
        data = ((1, len(line.split()), len(line)) for line in f)
        return reduce (lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2]), data)

print (wc_func('bigdata.txt'))


What about performance?

It depends. Using the standard CPython, the performance is painfully slow (basically unusable). But using pypy tells a very different story. With a 1.7Gb test file, performance is actually a bit better than the native "wc" Linux command. 


18:54 ~$ which wc /usr/bin/wc   18:54 ~$ which pypy /usr/local/bin/pypy   18:54 ~$ pypy --version Python 2.7.9 (9c4588d731b7, Mar 23 2015, 16:30:30) [PyPy 2.5.1 with GCC 4.6.3]   18:54 ~$ ls -lh bigdata.txt -rw-rw-r-- 1 mpoulin registered_users 1.7G Feb 22 23:48 bigdata.txt   18:54 ~$ time pypy wc_func.py ; time wc bigdata.txt (100000000, 400000000, 1800000000) real 0m39.695s user 0m30.376s sys 0m1.051s 100000000 400000000 1800000000 bigdata.txt real 0m40.433s user 0m31.336s sys 0m0.533s 18:56 ~$


Monday, February 22, 2016

Functional Python

Python allows you to write code in a variety of styles: procedural, object oriented, and functional. What is functional programming and why should you care?

What is functional programming?

Functional programming is a declarative style of programming that tries to minimize side effects and the use of variables. You might already be doing functional programming without realizing it.
  • Have you ever written an SQL query for a database? 
  • Have you ever piped a series of commands in a Linux shell?
If so, you've done functional programming.

A Python example

Linux has a "wc" command that counts lines, words, and characters in a text file. Let's compare two versions, one written in a procedural style and the other in a functional style.

Procedural wc

def wc_1(filename):
   '''
   Description:

   1. initialize some counter variables
   2. loop through a file line by line
   3. for each line, increment the counters
   '''

   line_count = 0
   word_count = 0
   char_count = 0

   with open(filename, 'r') as f:
       for line in f:
           line_count += 1
           char_count += len(line)
           word_count += len(line.split())

   return (line_count, word_count, char_count)


Functional wc

def wc_2(filename):
    '''
    Description:

    1. read the file into a list
    2. get the total number of lines
    3. sum the number of characters in each line
    4. sum the number of words in each line
    '''
    
    with open(filename, 'r') as f:
        lines = f.readlines()

    line_count = len(lines)
    char_count = sum(len(line) for line in lines)
    word_count = sum(len(line.split()) for line in lines)

    return (line_count, word_count, char_count)

Comparing the two versions

In the procedural version
  • you have to create 3 separate counter variables
  • you have to initialize them correctly
  • you have to update the counts for each line in the file
In the functional version
  • there are no counter variables
  • you let Python's sum() function tally the totals
I prefer the functional version. I can say "sum the number of characters in each line" or "sum the number of words in each line" and Python knows how to do that. And I know that Python won't make any stupid mistakes like forgetting to initialize a variable or increasing the count by the wrong amount.

Performance

The profiler shows us that performance is about the same for both versions. So with this particular input file there's no performance penalty for reading the entire file into memory.


prun wc_1('python_kwic.txt') 6046 function calls in 0.020 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.011 0.011 0.020 0.020 wc.py:21(wc_1) 2014 0.004 0.000 0.004 0.000 {method 'split' of 'str' objects} 4028 0.004 0.000 0.004 0.000 {len} 1 0.001 0.001 0.001 0.001 {open} 1 0.000 0.000 0.020 0.020 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}


prun wc_2('python_kwic.txt') 6050 function calls in 0.017 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.008 0.008 0.017 0.017 wc.py:1(wc_2) 4029 0.004 0.000 0.004 0.000 {len} 2014 0.004 0.000 0.004 0.000 {method 'split' of 'str' objects} 1 0.001 0.001 0.001 0.001 {method 'readlines' of 'file' objects} 1 0.001 0.001 0.001 0.001 {open} 2 0.000 0.000 0.000 0.000 {sum} 1 0.000 0.000 0.017 0.017 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}


Additional Resources


Functional Programming HOWTO -- https://docs.python.org/2/howto/functional.html

Functional Programming in Python -- http://www.oreilly.com/programming/free/functional-programming-python.csp

Mastering Python Lists -- http://amzn.com/B01BIHNWI0

Thursday, February 18, 2016

Mastering Python Lists (Kindle version)


How to Get Rid of Software Bugs Once and For All

If you suffer from confusing, hard to understand code… if you spend hours tracking down bugs… if you keep missing your project deadlines… then this message is just for you. Here’s why…

I can show you how to use Python’s built in list functions to write code that is faster, easier to understand, and bug-free.

And you need to realize, there is a cost to not dealing with this…

If You Ignore It, It Just Gets Worse

What most people do when facing a looming deadline is start pulling all-nighters. But for most people, that doesn’t work.

  • you start making stupid mistakes
  • you deliver code that’s barely been tested (or not tested at all)
  • your end users are unhappy with your slow, buggy code


And what happens if you just do nothing? If you just keep doing what you’ve been doing? You develop a reputation for poor quality and no one wants to deal with your inferior code.

How I Learned To Unleash The Power Of Python

I’ve got an answer that works. Here’s the story:

Over the last 30 years I’ve programmed in most of the popular languages. I slowly realized that each language has its own “super power” that makes it unique. C has special syntax for working with pointers. Java is designed around classes and objects. C++ has a special template notation. SQL is built on relational algebra and data sets. My AHA! moment came when I realized Python also has a “super power” : working with lists..

It Worked For Me, And It Will Work For You

Here’s what it did for me… By using Python’s built in functions, I was writing less code. My programs started to get smaller. I was finally able to test them and PROVE they were correct.

Finally, It’s Your Turn

When you order Mastering Python Lists today, you’ll get the keys to…

  • eliminating long, hard to understand blocks of code
  • speeding up your programs by a factor of 2X, 3X, or more
  • delivering reliable software that “just works”


And you are 100% safe to try this out. That’s all I’m suggesting. Just try it for 7 days to see if it works for you. If it does, you’ll be delighted - and I think that’s exactly what’s about to happen. If for some reason you’re not delighted with writing clean, reliable code that your boss or clients will love, then just let Amazon know - and you’ll get all your money back.

It’s Decision Time

You have a choice to make: Do what you’ve been doing (or worse, do nothing at all). You know where that will lead. Is that really where you want to go? Take a new action, and get a new result. Finally start writing the kind of quality code you know you are capable of.

Which do you really want for yourself? Buy today and have Mastering Python Lists delivered instantly to your Kindle device or PC.

Wednesday, February 17, 2016

Slice assignment

Slice assignment is a feature of Python that allows you to make bulk updates to a list.

The FizzBuzz test is described here: http://c2.com/cgi/wiki?FizzBuzzTest

"Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”."
 I'll implement a solution using slice assignment to demonstrate how it works.

Solution

Start with a list of numbers.

fizzbuzz = list(range(101)) 

Replace every multiple of 3 with the word "Fizz'. The length of the slice tells us how many replacement values we need,

fizzbuzz[3::3] = ['Fizz'] * len(fizzbuzz[3::3])

Do the same for the multiples of 5 and 15.

fizzbuzz[5::5] = ['Buzz'] * len(fizzbuzz[5::5])
fizzbuzz[15::15] = ['FizzBuzz'] * len(fizzbuzz[15::15])

Now print the results (skipping the first value of 0)

for i in fizzbuzz[1:]:
    print(i)



You can paste this code into Python Tutor if you want to step through it line by line.

Other solutions

Compare this solutions to others you might have seen. For example

for n in range(1, 101):
      line = ""
      if n%3 == 0:
          line = line + "Fizz"
      if n%5 == 0:
          line = line + "Buzz"

      if line:
          print line
      else:
          print n

Which one do you prefer? Why?

What if I told you the slice based solution runs 6X faster? Does that change your opinion?


You can find more tips and tricks in my book Mastering Python Lists.

Tuesday, February 16, 2016

Solving Project Euler Problem #1

A procedural solution

Drew Davis presents  a very nice solution to this problem at his blog
http://rdrewd.blogspot.com/2016/02/project-euler-problem-1-in-python.html?m=1

The final code he ends up with is

sum=0
for num in range(1000):
    if (num % 3)==0 or (num % 5) == 0:
        sum += num
print(sum)



There's nothing wrong with this solution, but it's very procedural and it requires examining every number from 1 to 1000.

A set based solution

There's another way to approach this.  We're being asked to sum the multiples of 3 and 5. Python's range function can give us those numbers directly without looping and without using the % operator.

multiples_of_3 = range(3, 1000, 3)
multiples_of_5 = range(5, 1000, 5)

There will be some duplication because any number that is a multiple of both 3 and 5 (like 15) will appear in both lists. The fix for this is to convert each list to a set and combine the sets with the "union" operator. This will eliminate any duplicates.

multiples = set(multiples_of_3) | set(multiples_of_5)

Now the only thing left is to print the sum.

print sum(multiples)

What about speed?

Here are both implementations and the timing results. The set based approach is about 3X faster.



def euler1a():
    multiples_of_3 = range(3,1000,3)
    multiples_of_5 = range(5,1000,5)
    multiples = set(multiples_of_3) | set(multiples_of_5)
    return sum(multiples)


def euler1b():
    sum=0
    for num in range(1000):
        if (num % 3)==0 or (num % 5) == 0:
            sum += num
    return sum




timeit euler1a()
10000 loops, best of 3: 30.4 µs per loop
timeit euler1b()
10000 loops, best of 3: 108 µs per loop


Monday, February 1, 2016

A List Oriented Approach To The Fizz Buzz Test

What is the FizzBuzz Test?


"Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”."
http://c2.com/cgi/wiki?FizzBuzzTest

Here is a classic solution.

for n in range(1, 101):
     line = ""
     if n%3 == 0:
         line = line + "Fizz"
     if n%5 == 0:
         line = line + "Buzz"

     if line:
         print line
     else:
         print n

Can we do better?

The for-loop solution gives the correct answer, but it's completely untestable. There's no way to check the intermediate results.

Here is a list-based solution that will allow us to check the results to any level of confidence we desire.

(NOTE: I'm only using the numbers 1..15 to keep the examples short.)

First, we generate 2 lists of words. 

fizz = ['Fizz' if i % 3 == 0 else '' for i in range(1,16)]
buzz = ['Buzz' if i % 5 == 0 else '' for i in range(1,16)]

You can check your work


list(enumerate(fizz,1))

[(1, ''),
 (2, ''),
 (3, 'Fizz'),
 (4, ''),
 (5, ''),
 (6, 'Fizz'),
 (7, ''),
 (8, ''),
 (9, 'Fizz'),
 (10, ''),
 (11, ''),
 (12, 'Fizz'),
 (13, ''),
 (14, ''),
 (15, 'Fizz')]


list(enumerate(buzz,1))
[(1, ''),
 (2, ''),
 (3, ''),
 (4, ''),
 (5, 'Buzz'),
 (6, ''),
 (7, ''),
 (8, ''),
 (9, ''),
 (10, 'Buzz'),
 (11, ''),
 (12, ''),
 (13, ''),
 (14, ''),
 (15, 'Buzz')]


Then we zip them and concatenate the strings to produce the output text.


fizzbuzz = [f+b for f,b in zip(fizz,buzz)]
['',
 '',
 'Fizz',
 '',
 'Buzz',
 'Fizz',
 '',
 '',
 'Fizz',
 'Buzz',
 '',
 'Fizz',
 '',
 '',
 'FizzBuzz']

Enumerating this list gets us the final answer.

list(enumerate(fizzbuzz,1))
[(1, ''),
 (2, ''),
 (3, 'Fizz'),
 (4, ''),
 (5, 'Buzz'),
 (6, 'Fizz'),
 (7, ''),
 (8, ''),
 (9, 'Fizz'),
 (10, 'Buzz'),
 (11, ''),
 (12, 'Fizz'),
 (13, ''),
 (14, ''),
 (15, 'FizzBuzz')]

Now we just have to decide whether to print the text or the number.

Putting it all together


# processing

fizz = ['Fizz' if i % 3 == 0 else '' for i in range(1,16)]
buzz = ['Buzz' if i % 5 == 0 else '' for i in range(1,16)]
fizzbuzz = [f+b for f,b in zip(fizz,buzz)]
solution  = [s if s else i for i,s in enumerate(fizzbuzz,1)]

# output

for x in solution:
    print x 


This solution is 4 lines shorter than the original for-loop solution, but I think there are two other advantages that are more important:

Advantage #1: clean separation between the calculations and output. It's easier to focus on the logic of solving the problem. The print statement is almost an afterthought.

Advantage #2: we've captured the intermediate calculations and we can check them. For example, what if you made this mistake in your logic (getting the modulo test backwards)

fizz = ['Fizz' if i % 3 else '' for i in range(1,16)]

It would be easy to see that the logic is backwards

[(1, 'Fizz'),
 (2, 'Fizz'),
 (3, ''),
 (4, 'Fizz'),
 (5, 'Fizz'),
 (6, ''),
 (7, 'Fizz'),
 (8, 'Fizz'),
 (9, ''),
 (10, 'Fizz'),
 (11, 'Fizz'),
 (12, ''),
 (13, 'Fizz'),
 (14, 'Fizz'),
 (15, '')]

That's a lot harder to see when the calculation is buried inside a for-loop.

A direct solution

There is a direct solution which is shorter and faster than anything else. It can be done with

  • no looping
  • no if tests
  • no use of the % operator

Start with a list of numbers.

fizzbuzz = range(16)

Replace every multiple of 3 with 'Fizz'.

fizzbuzz[3::3] = ['Fizz' for i in fizzbuzz[3::3]]

Replace every multiple of 5 with 'Buzz'.

fizzbuzz[5::5] = ['Buzz' for i in fizzbuzz[5::5]]

Replace every multiple of 15 with 'FizzBuzz'.

fizzbuzz[15::15] = ['FizzBuzz' for i in fizzbuzz[15::15]]

for i in fizzbuzz[1:]  :  print i


I have a lot more tips and tricks in my book Mastering Python Lists.