Sunday 28 June 2015

Cython and regular expressions

Recap

I'm returning yet again to the long suffering text matching example from a couple of months back. The goal is/was to take a text document, a set of patterns, and see which of these patterns could be found. We went through a variety of techniques: Python micro-optimisations, regexes, C++ extensions, Twisted, reactors and improved string matching algorithms. My intent this time around is to try improvements brought out by Cython.

In case you're unfamiliar with Cython, I'll do a two sentences' introduction: its aim is to preserve the ease of development inherent to Python while extending it with static type definitions and easy C/C++ bindings. Considering that dynamic typing comes at a great cost, it's perfectly suited for optimising hotspots in Python code and striking a balance between productivity and performance.

As before, the end result turned out to be surprising (at least to me), but what matters is the journey. Let's embark on it.

Firstly, I'll shake the dust off the pure Python example:
from twisted.internet import reactor, defer, threads
import sys, re

def stopReactor(ignore_res):
   reactor.stop()

def printResults(filename, matchingPatterns):
   print ': '.join([filename, ','.join(matchingPatterns)])

def scanFile(filename, pattern_regex):
   pageContent = open(filename).read()
   matchingPatterns = set()
   for matchObj in pattern_regex.finditer(pageContent):
      matchingPatterns.add(matchObj.group(0))

   printResults(filename, matchingPatterns)

def parallelScan(filenames, patterns):
   patterns.sort(key = lambda x: len(x), reverse = True)
   pattern_regex = re.compile('|'.join(patterns))

   deferreds = []
   for filename in filenames:
      d = threads.deferToThread(scanFile, filename, pattern_regex)
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(stopReactor)

if __name__ == "__main__":
   with open(sys.argv[1]) as filenamesListFile:
      filenames = filenamesListFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelScan(filenames, patterns)
   reactor.run()

This was just a copy/paste to save you the bother of clicking through to an older post. I still preserved the reactor and multi-threading, but they won't be playing a major role today.

We know that it's scanFile where we spend most of the time, and this is the hotspot that needs optimisation treatment with Cython. Since the matching is done via regexes, we cannot in good faith claim that we're optimising without also switching their library; here, I'm going to use regex.h.

Writing Cython code is not a common skill, and we're not dealing with Hello, World here, so as the narrator, I have two choices: jump to the end result, and go for an autopsy, or build it piece by piece. Taking the mantra that it's the journey that matters, I'll go for the second option.

Cython

Going for the easy bit, here's the updated scanFile function:
def scanFile(filename, pattern_regex):
   pageContent = open(filename).read()
   matchingPatterns = contentMatchPatternCython.matchPatterns(pageContent, pattern_regex)
   printResults(filename, matchingPatterns)

Nothing glamorous here, we just defer matching to the Cython extension that will be built.
Let's tick the build checkbox too:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(name='contentMatchPatternCython',
      cmdclass={'build_ext': build_ext},
      ext_modules = [Extension('contentMatchPatternCython', 
                     sources = ['contentMatchPatternCython.pyx'])])

Still nothing exciting, as this is just a small bootstrapping script that will build the Cython extension.

Time to take a deep breath, and look at how we write the actual algorithm.
This is the skeleton:

def matchPatterns(bytes pageContent, bytes regex):
   cdef set matchingPatterns = set()
   return matchingPatterns
Baby steps here. We just return an empty set to satisfy the function signature, however it already differs from vanilla Python code by defining types for pageContent, regex and matchingPatterns.
Why? We'd like to use Cython's memory and performance optimisations by pre-defining the variable types which will get us optimised variable storage and access. In this specific case, there is no requirement to support unicode, hence the bytes type definition.

Now, let's import regex.h functions:
cdef extern from "regex.h" nogil:
    ctypedef struct regmatch_t:
       int rm_so
       int rm_eo
    ctypedef struct regex_t:
       pass
    int REG_EXTENDED
    int regcomp(regex_t* preg, const char* regex, int cflags)
    int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
    void regfree(regex_t* preg) 
Note that we only import what we need later on, and that we tell Cython to release the GIL when executing the imported functions via the magic nogil keyword.

Ok, so we have the interface and the required C library function imported into PCython. All that's left is the algorithm, and tying it all together:
cdef extern from "regex.h" nogil:
    ctypedef struct regmatch_t:
       int rm_so
       int rm_eo
    ctypedef struct regex_t:
       pass
    int REG_EXTENDED
    int regcomp(regex_t* preg, const char* regex, int cflags)
    int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
    void regfree(regex_t* preg) 

def matchPatterns(bytes pageContent, bytes regex):
   cdef set matchingPatterns = set()
   cdef regex_t regex_obj
   cdef regmatch_t regmatch_obj[1]
   cdef int regex_res = 0
   cdef int current_str_pos = 0
   
   regcomp(&regex_obj, regex, REG_EXTENDED)
   regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)
   while regex_res == 0:
      matchingPatterns.add(pageContent[current_str_pos + regmatch_obj[0].rm_so: current_str_pos + regmatch_obj[0].rm_eo])
      current_str_pos += regmatch_obj[0].rm_eo
      regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)

   regfree(&regex_obj)
   return matchingPatterns

The interesting stuff happens at lines 19-24 where the usual compilation and regex execution takes place. The code is not one-to-one to pure Python since the underlying C library does not support the notion of generators, and we have to do string slicing (which is cousin once removed to pointer arithmetic).

Cython is relatively new to me as well, and even though I've written the code above, it caused a certain degree of cognitive dissonance: a bit akin to someone seeing platypus for the first time.


The code looks like C, but semicolons are nowhere to be found, while there are Python indentations and weird brackets around strings. If you've been developing in both languages for a while, seeing them unified in matrimony within a single function is a new experience. 
Anyhow, time to cut the nostalgia, stop coding and start measuring! At this point, my intent was to try this on a single <file, pattern> pair, and move on to doing further Cython tweaks to ensure we can release the GIL on the entire regex matching loop. Reality, however, turned to be different...

Performance

From here, and until the end of the post, the inputs are fixed: we use a single haystack (17KB source of google.html) and a fixed set of needles (first 2000 tokens of /usr/share/dict/words).

Making sure to run a few iterations and take the fastest one, here's the timing using the Python-only example:

$ time python pythonOnly.py shortFilenamesList.txt shortWords.txt

real    0m0.868s
user    0m0.546s
sys     0m0.312s

All right, so how does Cython compare? Drum roll please...

$ time python pythonWithCython.py shortFilenamesList.txt shortWords.txt

real    0m2.396s
user    0m2.012s
sys     0m0.374s

Ouch - 2.7 times slower! Of course, such a difference has to be explained. In my mind, there are two possible sources: different regex library or Cython itself. The former is far more likely, but we should never make fast assumptions with performance tuning. Hence, the natural next step was to write the same in C. 
I'll omit the C code for brevity, as it mirrors the Cython code with the semicolons and pointer arithmetics thrown in. Let's jump to the result:

$ time ./a.exe

real    0m1.547s
user    0m1.513s
sys     0m0.031s

No joy - it's still way slower than Python. Time for

Conclusions

So, we know by now that it's the difference in the regex library that matters; no low-level optimisations in Cython or C can overcome that. In hindsight, there was little basis to suppose that Cython will matter on this specific task, as Python's re implementation is also native C.

However, this was not time wasted, and here's why:

  1. This is yet another tangible example of why blanket statements such as Language X is slow, and Y is fast are untrue. In this case, Python came up faster; in another it may well be slower. 
  2. Thanks to user @nhahtdh on StackOverflow, I gained insight into the difference in performance. regex.h implements a backtracking engine, and does not optimise non-backtracking regexes, which is almost certainly something that Python re does.
  3. Good Cython practice. There is a reasonable template in place to import C libraries into Cython, link with it, build etc. (There are of course plenty of other resources that show the same, but it's always helpful to do and gradually demonstrate it by yourself)

What next?

Coming back to the task at hand: can we truly say that Python's re engine is simply better? My answer is a definite 'no' since engines shine on different patterns. Ours has been a long sequence of OR expressions, and it is just one possibility among many.

So, there are many options to take it further: try other matching patterns, compare with C++ boost::regex, or go for examples that are less dependent on supporting libraries. While Cython has not been of much help with optimising our venerable string matching example, it yet has a future in this blog.

No comments :

Post a Comment