Saturday 28 February 2015

Python optimisation - Part III: Parallelism and C++ extensions

In this post, I'm continuing the series on Python optimisation

The task in hand is/was to get a set of URLs as an input, download them and match their content against a set of patterns.

We started with the most naive serial implementation, and gradually enhanced it using asynchronous I/O operations with Twisted, and compiling the patterns into a regular expression

Today, I'd like to look at further performance improvements by taking our example into the land of parallelism and multi-core execution.

Before we kick off - two small adjustments to the test set and environment.
  • The test bed is moved to a 6-core PC; the single-core laptop used for the previous optimisations is left behind.
  • The set of URLs has been increased from four to six to utilise all of the cores in parallel.
To ensure we keep ourselves honest, I've re-run exactly the same code from the previous post on the new platform (and using the expanded URL set) with the following results.

Serial fetch and scan: 7.95 sec
Parallel fetch: 5.66 sec
Regex optimisation: 3.50 sec

Note: You might correctly point out that these numbers highly depend on network latency. I've done a few runs to derive those, averaged them, and verified the variance. If not precise, they should be in the ballpark.

This will be our baseline. Now, with the introductions behind us, let's start covering new ground.



Deferring to Python threads


Whenever the spectre of parallelism appears, many people naturally see the words "thread" and "process" entering their conversations. 

Of course, modern parallelism is far from being restricted to OS constructs, but Python is a classical language in this particular aspect, and we are going to be using those words too.

If you had any mileage with Python and multi-core optimisation, you probably already raise both hands in favour of processes. The creator of Python definitely doesI'm a bit less categorical; for sure, multi-process uses the cores, but there are situations where the lack of convenient shared memory, and cost of IPC start weighing down.


However, I'm jumping ahead. The best way to understand why something might not work is to actually try it, so let's give threads a go.

Moreover, we're in luck, since Twisted provides a convenient mechanism called deferToThread. This does exactly what it says on the tin: take a function, and execute it, as well as its callbacks, in a thread.
from twisted.internet import reactor, defer, threads
from twisted.web import client
import sys, re

def stopReactor(ignore_res):
   reactor.stop()

def printResults(url, matchingPatterns):
   print ': '.join([url, ','.join(matchingPatterns)])

def matchPatterns(pageContent, url, pattern_regex):
   matchingPatterns = set()
   for matchObj in pattern_regex.finditer(pageContent):
      matchingPatterns.add(matchObj.group(0))

   printResults(url, matchingPatterns)

def gotPage(pageContent, url, pattern_regex):
   return threads.deferToThread(matchPatterns, pageContent, url, pattern_regex)

def parallelFetchAndScan(urls, patterns):
   patterns.sort(key = lambda x: len(x), reverse = True)
   pattern_regex = re.compile('|'.join(patterns))

   deferreds = []
   for url in urls:
      d = client.getPage(url).addCallback(gotPage, url, pattern_regex)
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(stopReactor)

if __name__ == "__main__":
   with open(sys.argv[1]) as urlFile:
      urls = urlFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelFetchAndScan(urls, patterns)
   reactor.run()


This is not far removed from the last optimisation, with two key differences:

- Line 19 makes use of thread deferral. It's as simple as providing the function name with its parameters to the Twisted API.
- In line 16, we directly print the results as we get them rather than saving them in a static data collection. Leaving things as they are would be a recipe for race conditions, and avoiding shared writable state looks like a good idea.

Before we look at the performance (yes, a bit of suspense), let's have an inside look into Twisted thread management. 
I've sneaked in the following line of code within matchPatterns to confirm that we are within the realm of multi-threading:
print 'I am in thread: %s' % threading.currentThread().name
These are the lines it generated (of course, they would vary from run to run): 

I am in thread: PoolThread-twisted.internet.reactor-3
I am in thread: PoolThread-twisted.internet.reactor-5
I am in thread: PoolThread-twisted.internet.reactor-4

I am in thread: PoolThread-twisted.internet.reactor-1
I am in thread: PoolThread-twisted.internet.reactor-3
I am in thread: PoolThread-twisted.internet.reactor-2

Interesting. Twisted uses a pool thread instead of spawning a new thread each time, and reuses threads as needed (we managed to process one URL in full, while another was still downloading).

Ok, this is all fine and good, but how do the numbers look like?

time python parallelFetchThread.py urls.txt words.txt

real    0m3.470s
user    0m1.388s
sys     0m1.278s


 
Looks like all these good people in Intel and Twisted community have worked for no gain! We did not get any benefit from threads or extra cores, and I can finally mention the dreaded GIL also known as Global Interpreter Lock.
It is the reason our performance has not shifted; we never had any parallelism since the lock was held whenever any instance of matchPatterns was running. This is the reason people usually go for spawning processes rather than threads with Python.


However, let's for the sake of the argument assume that we want to stubbornly stick with threads. Maybe we have a large pool of shared configuration data, or maybe have a memory-hungry cache that makes multi-processing a challenge.


Is there any way of escaping the GIL while staying on threads? Of course, the question is semi-rhetorical, since if the answer were "no", this post would have been of little value.


Escaping the GIL via C++ extensions


The answer we're going to shoot for is escaping the GIL by reimplementing matchPatterns in C++.

Since compiled code does not touch any of the interpreter state, we can free the global lock, and (hopefully) use the 6 cores as we were meant to.
Of course, C is also an option, but C++ can create more concise and maintainable code, and I haven't seen many resources on the Web that go specifically into Python->C++ extensions.

Here's the master plan:

  • Replace matchPatterns from our Python module with invocation of a new C++ extension we're going to create. Since we cannot pass compiled regexes across, do not pre-compile, but rather pass the regex in its string form.
  • Create wrapper C interface which accepts page content and regex string, and writes back the list of found patterns.
  • Re-implement matchPatterns using C++11 with Boost::Regex.
  • Compile and install the new extension using the Python distutils module.
  • Profit!
Let's start with the easier bit, which is the slightly cut down Python code.

from twisted.internet import reactor, defer, threads
from twisted.web import client
import sys
import contentMatchPattern

def stopReactor(ignore_res):
   reactor.stop()

def printResults(url, matchingPatterns):
   print ': '.join([url, ','.join(matchingPatterns)])

def matchPatterns(pageContent, url, pattern_regex):
   matchingPatterns = contentMatchPattern.matchPatterns(pageContent, pattern_regex)
   printResults(url, matchingPatterns)

def gotPage(pageContent, url, pattern_regex):
   return threads.deferToThread(matchPatterns, pageContent, url, pattern_regex)

def parallelFetchAndScan(urls, patterns):
   patterns.sort(key = lambda x: len(x), reverse = True)
   pattern_regex = '(' + '|'.join(patterns) + ')'

   deferreds = []
   for url in urls:
      d = client.getPage(url).addCallback(gotPage, url, pattern_regex)
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(stopReactor)

if __name__ == "__main__":
   with open(sys.argv[1]) as urlFile:
      urls = urlFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelFetchAndScan(urls, patterns)
   reactor.run()

contentMatchPattern is the upcoming extension I've talked so much about. Another minor difference is in line 21 where we no longer compile the regex, and add parentheses to conform to differences between Boost and Python regex processing.

Now, let's have a look at the wrapper C code:

#include <Python.h>

// TBD - matchPatternsImpl

static PyObject *pythonExt_matchPatterns(PyObject *self, PyObject *args)
 {
 const char *content, *pattern_regex;
        PyArg_ParseTuple(args, "ss", &content, &pattern_regex);
 
 set<string> matchedPatterns = matchPatternsImpl(content, pattern_regex);

 PyObject *list = PyList_New(matchedPatterns.size());
 size_t listPos = 0;
 for (auto pattern : matchedPatterns)
  {
  PyList_SetItem(list, listPos, Py_BuildValue("s", pattern.c_str()));
  ++listPos;
  }
        return list;
 }
  
extern "C"
{
 
static PyMethodDef module_methods[] = {
   { "matchPatterns", (PyCFunction)pythonExt_matchPatterns, METH_VARARGS, NULL },
   { NULL, NULL, 0, NULL }
};   
 
void initcontentMatchPattern(void)
 {
    Py_InitModule3("contentMatchPattern", module_methods,
                   "Match patterns in text");
 }

}

Just don't try compiling this, as we haven't finished yet. A couple of C++11 bits have been carried along, but the main interface is implemented via module_methods, init<extension name> and various PyList* and PyArg calls. 
You can either use it as a mini-template for your own extensions, or look at more detailed documentation of Python.h

Let's augment that with the C++ implementation of the actual algorithm:
#include <Python.h>

#include <iostream>
#include <set>
#include <boost/regex.hpp>
using namespace std;

set<string> matchPatternsImpl(const string &content, const string &pattern_regex)
 {
 set<string> ret;
 boost::regex regex(pattern_regex);
 boost::match_results<std::string::const_iterator> what;
 boost::match_flag_type flags = boost::match_default;

 auto searchIt = content.cbegin();
 while (boost::regex_search(searchIt, content.cend(), what, regex, flags))
  {
  ret.insert(string(what[1].first, what[1].second));
  searchIt = what[1].second;
  }
 return ret;
 }

static PyObject *pythonExt_matchPatterns(PyObject *self, PyObject *args)
 {
 PySaveThread threadStateGuard;

 const char *content, *pattern_regex;
    PyArg_ParseTuple(args, "ss", &content, &pattern_regex);
 
 set<string> matchedPatterns = matchPatternsImpl(content, pattern_regex);

 PyObject *list = PyList_New(matchedPatterns.size());
 size_t listPos = 0;
 for (auto pattern : matchedPatterns)
  {
  PyList_SetItem(list, listPos, Py_BuildValue("s", pattern.c_str()));
  ++listPos;
  }
 return list;
 }
  
extern "C"
{
 
static PyMethodDef module_methods[] = {
   { "matchPatterns", (PyCFunction)pythonExt_matchPatterns, METH_VARARGS, NULL },
   { NULL, NULL, 0, NULL }
};   
 
void initcontentMatchPattern(void)
 {
    Py_InitModule3("contentMatchPattern", module_methods,
                   "Match patterns in text");
 }

}

Your mileage may vary here, but even if C++ is not entirely your thing, the general gist should come across; we compile a regex, then iterate over the content, insert the matches into a list, and return it.

I could not resist slipping in a couple of C++11 constructs. Here they are for those of you who have C++ background, but not fully up to speed with the latest and the greatest:

The one new feature I did not use was std::regex. The reason is simple - my compiler's performance with it was snail-like in comparison to Boost, so I stuck to "old guns".

There could have been more, but this is not a primer on C+11, and neither do I claim to have a full expertise in it. (Though it still might get more stage presence on this blog)

Now, if you looked closely at the code, one line should raise suspicions. Where did this PySaveThread come from? 
Yep, guilty as charged - this is my class and I haven't defined it yet. Here it is:
class PySaveThread
 {
 public:
  PySaveThread()
   {
   _threadState = PyEval_SaveThread();
   }
  ~PySaveThread()
   {
   PyEval_RestoreThread(_threadState);
   }
 private:
  PyThreadState *_threadState;
 };

This is the reason we shook the dust off C++ compiler. The PyEval_SaveThread() and PyEval_RestoreThread() calls free up the GIL and lock it again, saving the thread state in the process.
Without those, the GIL would persist in our extension.

Note that I'm using the RAII (Resource Acquisition Is Initalization) paradigm. This way we make sure that PyEval_RestoreThread is called even if any of the C++ code throws an exception.


Note: You might argue at this point that Python became secondary, and we might as well write the whole thing in C++. Well, yes and no. Python is a concise language, and even this small example would be inflated with C++. Moreover, consider this task in hand as a part of a larger framework which includes less performance conscious code that could be written quickly in Python.


All that remains now is building and installing. Here is the mini-Python script I'm using for that:

from distutils.core import setup
from distutils.extension import Extension

setup(name='contentMatchPattern',
      ext_modules = [Extension('contentMatchPattern', 
                     sources = ['contentMatchPattern.cpp'], 
                     extra_compile_args = ['-std=c++0x'], 
                     libraries = ['boost_regex'])])
This makes use of the Python distutils package. Note the extra_compile_args and libraries parameters which take care of compiling with C++11 support and linking with Boost::Regex.

Now, we can install the extension by running python extInstall.py install. Phew! Finally, we can run and time this:


time python parallelFetchThreadCpp.py urls.txt words.txt

real    0m3.000s
user    0m1.871s
sys     0m1.232s

14% saving. Of course, it's not zero, but is that all? Well, for big applications, 14% is not a laughing matter - it can still mean a large CapEx reduction. 
Another angle to point out is the example itself; since my aim was to exemplify both I/O asynchronous optimisation and parallelism, it has a strong networking element. We still do quite a bit of URL fetching, and there's only so much we can save on the processing side.

Just to show what we have gained by escaping the GIL, let's take the magic PySaveThread definition out, recompile, and run again:

real    0m3.900s
user    0m1.872s
sys     0m1.278s

Yes, we did. Moreover, C++ implementation with GIL is slower than the Python one, which might be a bit counter-intuitive. There are two reasons for that:


  1. Python's re module is implemented with highly-optimised C code; it is at least at parity with the Boost::Regex implementation in my environment.
  2. If you recall, we no longer pre-compile the regex for all URLs, so we ended up recompiling it 6 times. 
This also shows that the gain from removing the GIL is greater than 14% when we do a like-for-like comparison.

So, it has been quite a journey. We started with a naive, serial implementation, and gradually introduced asynchronous programming with Twisted, optimised matching with regexes, and parallelism with C++ extensions.
We went down all the way from 7.95 seconds execution time to 3 seconds.
Each of these stages could merit its own series of posts/articles/books, but others have already written them. The goal here was to show various techniques working in practice on a single example, and let the combination of your interest and Google do the rest.

This example has served its purpose, but the series is not at an end just yet. In the next post, I'd like to explore more the benefits of removing the GIL on pure processing tasks.

2 comments :

  1. Did I got it right, that in Python we can not get parallelism ?

    ReplyDelete
  2. Not exactly. You can get it through multi-processing, and you can get it through multi-threading if your threads spend most of their time in C/C++ extensions.

    But yes, my experience has been that you need to jump through a few hoops (as per post above).

    ReplyDelete