Roman on Software Engineering: Python profiling with cProfile

In this post I'd like to venture outside of the Agile territory for a change. There's still of course enough left to be said on that subject, but a bit of variety never hurt any blog.

So, I'll jump to "something completely different", i.e. Python profiling. It makes sense to start small, so today I'll go over the basics: i.e. take a simple programming task, implement a non-optimised solution, use profiler to find it, fix, and then move on to grander things.

The task for today is simple: take a text file as an input and count the number of sentences in it.

Attempt 1

Here's my first stab at the task. It's obviously non-optimised (as we want to show off profiling), but it also happens to be a bit incorrect.

import re, sys

SENTENCE_DELIMITER_PATTERN = r'[.!?]'

def countSentences(fileName):
   inputFile = open(fileName, "r")
   result = 0

   for line in inputFile:
      splitRes = re.split(SENTENCE_DELIMITER_PATTERN, line.strip())
      result += len(splitRes) - 1

   return result

if __name__ == "__main__":
   sentencesCount = countSentences(sys.argv[1])
   print sentencesCount

An input like that ?! Or maybe like that!! Or even like that...could easily throw it off. I could pull out a proper regex that gets most sentences right, but our focus today is optimisation, so I hope that the great regex spirit in the sky will forgive me for now.
(And yes, the error handling is non-existent, and output is not very descriptive etc etc)

In any case, running it over a sample input below gives the right answer of "6", so at least we do not have basic defects and can move on:

Test sentence 1.
Test sentence
2 ? Test
sentence 3!
Test sentence 4. Test sentence
5. Finished.

Now, let's unleash this upon War and Peace and do some timing:

    roman@localhost:~/blog$ time python countSentences.py WarAndPeace.txt
    37866
    real    0m0.810s
    user    0m0.388s
    sys     0m0.004s

Result looks about right, Count Leo Tolstoy could sure crack out a few sentences in his prime.
But the timing isn't too good, is it? Yes, I did not use top spec laptop for this, but still, we can do far, far better. Let's add some profiling.

Attempt 1 - Profiled

To profile, we need to do just this one change:

sentencesCount = countSentences(sys.argv[1])

with

   import cProfile
   cProfile.run('sentencesCount = countSentences(sys.argv[1])')

Disclaimer: There are more elegant ways of introducing profiling which do not involve source changes. This was the the quickest way to zoom in.

These are the results:
         390168 function calls in 3.206 seconds

   Ordered by: standard name

   ncalls tottime percall cumtime percall filename:lineno(function)
        1    0.000    0.000    3.206    3.206 <string>:1(<module>)
        1    0.838    0.838    3.206    3.206 countSentences.py:5(countSentences)
    65007    0.703    0.000    2.012    0.000 re.py:164(split)
    65007    0.551    0.000    0.753    0.000 re.py:226(_compile)
        2    0.000    0.000    0.003    0.002 sre_compile.py:178(_compile_charset)
        2    0.000    0.000    0.003    0.002 sre_compile.py:207(_optimize_charset)
        6    0.000    0.000    0.000    0.000 sre_compile.py:24(_identityfunction)
        2    0.003    0.002    0.003    0.002 sre_compile.py:258(_mk_bitmap)
        1    0.000    0.000    0.003    0.003 sre_compile.py:32(_compile)
        1    0.000    0.000    0.000    0.000 sre_compile.py:359(_compile_info)
        2    0.000    0.000    0.000    0.000 sre_compile.py:472(isstring)
        1    0.000    0.000    0.003    0.003 sre_compile.py:478(_code)
        1    0.000    0.000    0.004    0.004 sre_compile.py:493(compile)
        1    0.000    0.000    0.000    0.000 sre_parse.py:138(append)
        1    0.000    0.000    0.000    0.000 sre_parse.py:140(getwidth)
        1    0.000    0.000    0.000    0.000 sre_parse.py:178(__init__)
        8    0.000    0.000    0.000    0.000 sre_parse.py:182(__next)
        5    0.000    0.000    0.000    0.000 sre_parse.py:195(match)
        7    0.000    0.000    0.000    0.000 sre_parse.py:201(get)
        1    0.000    0.000    0.000    0.000 sre_parse.py:301(_parse_sub)
        1    0.000    0.000    0.000    0.000 sre_parse.py:379(_parse)
        1    0.000    0.000    0.000    0.000 sre_parse.py:67(__init__)
        1    0.000    0.000    0.000    0.000 sre_parse.py:675(parse)
        1    0.000    0.000    0.000    0.000 sre_parse.py:90(__init__)
        1    0.000    0.000    0.000    0.000 {_sre.compile}
        3    0.000    0.000    0.000    0.000 {isinstance}
    65030    0.158    0.000    0.158    0.000 {len}
       41    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
    65007    0.198    0.000    0.198    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
    65007    0.556    0.000    0.556    0.000 {method 'split' of '_sre.SRE_Pattern' objects}
    65007    0.198    0.000    0.198    0.000 {method 'strip' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {min}
        1    0.000    0.000    0.000    0.000 {open}
        3    0.000    0.000    0.000    0.000 {ord}
So what does this all mean?

The tottime column shows time spent in a function not counting further function calls. The cumtime column does the same including function calls.
For example, we spent 0.556 seconds in regex split calls; as it does not have any profiled invocations, both columns show the same value. On the other hand, our entry point, countSentences, has 0.838 as tottime and 3.206 as cumtime. It means that we spent most of the time in the various regex calls, and about 25% was spent on file iteration and other primitive calls.

ncalls is straightforward: it shows the number of times a particular function got called, and it is our first hint in optimising - we compiled the same regex 65007 times!

    65007    0.551    0.000    0.753    0.000 re.py:226(_compile)

Let's try and get rid of that.

Attempt 2

import re, sys

SENTENCE_DELIMITER_REGEX = re.compile(r'[.!?]')

def countSentences(fileName):
   inputFile = open(fileName, "r")
   result = 0

   for line in inputFile:
      splitRes = SENTENCE_DELIMITER_REGEX.split(line.strip())
      result += len(splitRes) - 1

   return result

if __name__ == "__main__":
   sentencesCount = countSentences(sys.argv[1])
   print sentencesCount

The main difference is that we are pre-compiling the regex this time.
Let's do a bit of timing, and see how far it took us:

   roman@localhost:~/blog$ time python countSentencesRoundTwo.py WarAndPeace.txt
   37866

   real    0m0.511s
   user    0m0.228s
   sys     0m0.016s

Not bad! We went from 0.81 sec down to 0.51 sec - almost 40% speed-up. We also have the same result, which is always good to verify when optimising.
We can still do better though; let's crank up the profiler once more (the technique is the same as in the first round):

         195025 function calls in 1.607 seconds

   Ordered by: standard name

   ncalls tottime percall cumtime percall filename:lineno(function)
        1    0.000    0.000    1.607    1.607 <string>:1(<module>)
        1    0.763    0.763    1.607    1.607 countSentencesRoundTwo.py:5(countSentences)
    65007    0.147    0.000    0.147    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    65007    0.507    0.000    0.507    0.000 {method 'split' of '_sre.SRE_Pattern' objects}
    65007    0.189    0.000    0.189    0.000 {method 'strip' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}
An immediate observation - there are far fewer lines: regex compilation was dominating our profiling result. Of course, we still have one regex object created outside of the profiling call, however considering that each compilation took below a millisecond (percall column in the first attempt), we can safely ignore it.

So, where do we go from here? Split and strip take their time - maybe is it possible to get rid of them?
Strip is definitely easy; just removing it should not affect the result.
Split is less obvious, but come think of it: we do not need the actual sentences - we just need to count them. This means that we can iterate over the input and count, rather than split it - this should also save the {len} call.
Another idea - why do we process line-by-line? There is nothing in our code that forces that: we could easily switch to chunk-based processing, which would spare some of the parsing.

With all of that, we come to

Attempt 3

import re, sys

SENTENCE_DELIMITER_REGEX = re.compile(r'[.!?]')
CHUNK_SIZE = 32 * 1024

def countSentences(fileName):   
   inputFile = open(fileName, "r")
   result = 0

   counter = 0
   while True:
      inputChunk = inputFile.read(CHUNK_SIZE)
      if len(inputChunk) == 0:
         break

      for _ in SENTENCE_DELIMITER_REGEX.finditer(inputChunk):
         result += 1
   return result

if __name__ == "__main__":
   sentencesCount = countSentences(sys.argv[1])
   print sentencesCount

A few items worthy of note:

We are using finditer to go over each sentence separator. As we're not interested in the actual sentences, we use an unnamed variable (underscore as per common convention).
We read chunk-by-chunk and break out of the loop as soon as the read comes empty.
CHUNK_SIZE is 32 KB. I cheated here a bit, and tuned it behind the scenes; of course it does not take a great effort to automate it and run comparative tests with various chunk sizes.

How does this perform then?

   roman@localhost:~/blog$ time python countSentencesRoundThree.py WarAndPeace.txt
   37866

   real    0m0.248s
   user    0m0.112s
   sys     0m0.000s

Almost a twofold speedup compared to the second attempt, and four times faster than our starting point!

Can we make it even faster?
Possibly - for example, by replacing the regex with bespoke string matching. However, that would be too much cheating: if you recall, at the start of the post I conceded the point that we are very far from matching sentences accurately. If I wanted to do it properly, we need regexes, so by removing them we'd stray too far from our original intent. Tolstoy would not approve.

Let's have one final glance at profiling:

          303 function calls in 0.212 seconds

   Ordered by: standard name

   ncalls tottime percall cumtime percall filename:lineno(function)
        1    0.000    0.000    0.212    0.212 <string>:1(<module>)
        1    0.175    0.175    0.212    0.212 countSentencesRoundThree.py:6(countSentences)
      100    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       99    0.000    0.000    0.000    0.000 {method 'finditer' of '_sre.SRE_Pattern' objects}
      100    0.037    0.000    0.037    0.000 {method 'read' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {open}

Not too much here, is there? It's still possible to make a few conclusions though - for example, that increasing the chunk size can't save more than 37 msec.
However, far more important is what is not there.
For example, looping over the iterator must be doing the heavy lifting, so where is it? Here we come to idiosyncrasies of how Python profiling works; it does not play that well with iterators and generators (amongst other things, which I hope to come to in due time). As a side note, it is possible though to gather profiling data for the hidden .next() iterator calls, but I'll skim over that to save a bit of space for now.

Summary

This all might seem like basic stuff, especially to those who have been doing Python development for a while. Yet, it is surprising how many times I've seen in the past all the inefficiencies covered in this post, and more. Partially this is because many of us make the journey into Python from other languages, which means that even senior developers may fall into a language-specific trap.

More specifically, it shows that profiling is not that hard, at least when we deal with serial algorithms, and that a combination of human+profiler can go a long way.

In later posts, I'd like to take this further and explore profiling of concurrent and asynchronous software with Twisted.

Acknowledgment

I've used SyntaxHighlighter for the coding examples - impressive and very easy to set up.

Resources

For more related resources, see:

- https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Profiling_Code
- http://developerblog.redhat.com/2014/02/17/profiling-python-programs
- https://docs.python.org/2/library/profile.html

Roman on Software Engineering

Saturday, 31 January 2015

Python profiling with cProfile - Part I: Intro

Attempt 1

Attempt 1 - Profiled

Attempt 2

Attempt 3

Summary

1 comment :

Blog Archive