Sunday 15 March 2015

Python optimisation - Part IV: Another look at concurrency

The last optimisation post has left a few threads hanging. 
Yes, there was parallelism, but the performance gain was, shall we say, underwhelming. Back then I've alluded to the I/O element of the task, but had not showed what happens if we get rid of it.
Also, I moved the test bed to a different machine, but never said what actually happens if we remained on the single-core one.

So, while this post might be a bit light on code and heavy on numbers, it will add more background on parallelism.


What happens when multiple threads run on a single core?

Let's take the multi-threaded example from last time and run it on a single core machine.

time python pythonWithoutGIL.py urls.txt words.txt

real    0m9.721s
user    0m8.004s
sys     0m0.216s


And now let's remove the magic GIL-release line from the C++ code, recompile and run again:

time python pythonWithGIL.py urls.txt words.txt

real    0m9.158s
user    0m7.268s
sys     0m0.208s


The locking solution is actually faster by about 10%! When you think about it, it is not surprising: the one core can execute only a single thread at a time anyway, while having multiple active threads increases the number of context switches. The OS actively spends CPU cycles on preempting threads, while with the GIL locked it would not have had that option.

The possibility of having too much concurrency occasionally takes people by surprise; even fairly experienced developers sometimes hardcode thread pool size rather than dynamically derive concurrency based on number of vcores. 

What performance gain do we get from parallelism when applied to pure processing tasks?

Let's take the networking element out. I'll slightly amend our long running example by fetching scanned content from the local filesystem. Here's an updated Python wrapper:
from twisted.internet import reactor, defer, threads
import sys, re
import contentMatchPattern

def stopReactor(ignore_res):
   reactor.stop()

def printResults(filename, matchingPatterns):
   print ': '.join([filename, ','.join(matchingPatterns)])

def scanFile(filename, pattern_regex):
   pageContent = open(filename).read()
   matchingPatterns = contentMatchPattern.matchPatterns(pageContent, pattern_regex)
   printResults(filename, matchingPatterns)

def parallelScan(filenames, patterns):
   patterns.sort(key = lambda x: len(x), reverse = True)
   pattern_regex = '(' + '|'.join(patterns) + ')'

   deferreds = []
   for filename in filenames:
      d = threads.deferToThread(scanFile, filename, pattern_regex)
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(stopReactor)

if __name__ == "__main__":
   with open(sys.argv[1]) as filenamesListFile:
      filenames = filenamesListFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelScan(filenames, patterns)
   reactor.run()
Very similar to what we had before - the only difference is that we take filenames rather than URLs as an input. The content is exactly the same: I've pre-downloaded the Alexa top sites. Actually, everything else is exactly the same too; the C++ module was consuming content as a string, so it could not care less whether it arrived from filesystem or network.
Now, we will do a few runs on the same 6-core system from the previous post while increasing the set of words we search for (which further emphasizes the CPU-bound part of this exercise).

There will be three programs: Python-only, C++ extension with GIL removed, C++ extension with GIL left in place.
For the Python-only program, we will recompile the regexes for each of the input files: same way as the C++ extension does. This is to level the playing field, and make sure that the skew from long regex compilation (especially with longer sets of search words) does not affect us.

I've already shown the Python wrapper for C++ extension above, and the Python-only code is below:
from twisted.internet import reactor, defer, threads
import sys, re, copy

def stopReactor(ignore_res):
   reactor.stop()

def printResults(filename, matchingPatterns):
   print ': '.join([filename, ','.join(matchingPatterns)])

def scanFile(filename, patterns):
   pageContent = open(filename).read()
   matchingPatterns = set()
   patterns.sort(key = lambda x: len(x), reverse = True)
   pattern_regex = re.compile('|'.join(patterns))
   
   for matchObj in pattern_regex.finditer(pageContent):
      matchingPatterns.add(matchObj.group(0))

   printResults(filename, matchingPatterns)

def parallelScan(filenames, patterns):
   deferreds = []
   for filename in filenames:
      d = threads.deferToThread(scanFile, filename, copy.deepcopy(patterns))
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(stopReactor)

if __name__ == "__main__":
   with open(sys.argv[1]) as filenamesListFile:
      filenames = filenamesListFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelScan(filenames, patterns)
   reactor.run()
As promised, we are recompiling the regexes for each file.
So, without further ado, here are the results:
Pattern set size Python Only Python with C++ (GIL locked) Python with C++ (GIL released)
500 1.44s 1.19s 1.15s
4000 4.97s 2.69s 1.63s
8000 8.82s 4.99s 2.36s
16000 16.65s 10.23s 4.48s


This puts things nicely into perspective:

Observation #1: As soon as we hit sizable sets of patterns, the execution time starts growing linearly; it does not matter if use C++ or Python for our heavy duty matching.
This is as expected: our algorithm is driven by the complexity of the regex. In turn, the regex grows linearly with the pattern set size.

Observation #2: Releasing the GIL on multi-core is paramount. Yes, we arrived to the same conclusion the last time, but back then it was 14%. Not any longer - even comparing apples to apples (or C++ to C++), we get a speed-up of 2.
Actually, here you might ask - "why not 6?". Indeed, why not - aren't there six cores? Well, we use inputs of various sizes, and the longest ones form the critical path. If I'd run all the tests on exactly the same input, we would have had an even higher speed-up factor.

Observation #3: C++ provides value irrespective of parallelism. Yes, you might say that we hobbled Python by recompiling the regex each time. But then, we could say that it is C++ that was hampered; with some effort we could have cached the compiled regex in static memory.
Going back to cold, hard data - we get 70% speedup from C++ even before the GIL is gone.
Before the flames start shooting in my direction: I'm not saying that Python is always slower than C++. Understanding how performance compares between the two in various cases is important, and the answer can never be said in one syllable or paragraph. Definitely worth exploring further.

Time to wrap things up

It was a long road from basic sentence counting to parallel, asynchronous performance tuning. We found Python's GIL, navigated around it, and looked at cases when parallel execution does more harm than good.
There's always more to say about Python/C++ performance comparison, so the wider topic is very likely to remain on this blog for a while.

No comments :

Post a Comment