Roman on Software Engineering: Regex engine and parallel string matching

The last post on Cython ended up with a surprising revelation on regexes - it turned out that Python matching of patterns OR-ed within a single regex was orders of magnitude faster than its regex.h cousin.

However, there were a few why-s, if-s and but-s left:

What is so different between the two engines? I alluded to backtracking specifics unearthed in a StackOverflow post, but that explanation wasn't truly satisfactory, as there was no other public source that referred to the same disparity.
Is this unique to regex.h? What about boost::regex or std::regex?
Does the size of the matched string set matter? Remember, I've fixed the set of needles to be searched in the haystack, but could it be that one engine performs better on smaller/larger sets?

These are exactly the questions I'd like to answer today, and here's how we are going to proceed:

Prepare three extensions that implement pattern matching: one for boost::regex, one for C++11 regex, and one for regex.h.
Run a set of tests for 2, 20, 200 and 2000 patterns which go over the same set of inputs.
Compare, deduce and write a blog post.

All of these shall be unveiled shortly, with an important correction that had to be performed between steps b) and c). However, I'm jumping ahead: let's step back and look at step a. As usual, I'll share the sources, but for compile/build/embed specifics will refer to previous posts.

Here's the boost::regex one:

#include <boost/python.hpp>
#include <iostream>
#include <set>
#include <boost/regex.hpp>

using namespace std;

boost::python::list matchPatternsImpl(const string &content, const string &pattern_regex)
   {
   set<string> tempSet;
   boost::regex regex(pattern_regex);
   boost::match_results<std::string::const_iterator> what;
   boost::match_flag_type flags = boost::match_default;

   auto searchIt = content.cbegin();
   while (boost::regex_search(searchIt, content.cend(), what, regex, flags))
      {
      tempSet.emplace(what[1].first, what[1].second);
      searchIt = what[1].second;
      }
  
   boost::python::list ret;
   for (auto item : tempSet)
      ret.append(std::move(item));
   return ret;
   }

BOOST_PYTHON_MODULE(contentMatchPattern)
{
    using namespace boost::python;
    def("matchPatternsImpl", matchPatternsImpl);
}

Now, it all looks as same old, but in fact, there are a couple of novelties. First, note the usage of Boost::Python. Rather than doing complicated marshalling of C++ data structures to Python and back via Python.h, we use a nice shortcut on lines 28-32. Who said that writing C++ extensions for Python was hard?

To follow up on that, we're not returning a primitive type here, but rather a Python list. This is where boost::python::list does its little magic by defining a C++ type that can be automatically coerced to Pythonic structures.
On lines 22-25 we take our C++ set and create such a list while using C++11 move semantics to avoid unnecessary copy constructors. Another minor landmark is the usage of emplace which avoids additional copies and constructs a string directly in the container. As always, my advice is to read and experiment with those if you're serious about C++11.

However, let's get back to business. We have the Boost example sorted, so let's move on to std::regex. Here, there is a bit of an anti-climax, as it is almost a carbon copy of the code above, which is entirely unsurprising - std::regex was modelled after Boost. To get the code. just replace boost:: with std:: in the right places.

All we have remaining is regex.h, which is a throwback to my previous post. The main difference though is that we're back to conventional weapons; Cython for sure looked unorthodox, but we want to do a like-for-like comparison, so this will be a pure C++ wrapper.

#include <boost/python.hpp>
#include <iostream>
#include <set>
#include <regex.h>

using namespace std;

boost::python::list matchPatternsImpl(const string &content, const string &pattern_regex)
   {
   set<string> tempSet;
   regex_t regex_obj;
   regcomp(&regex_obj, pattern_regex.c_str(), REG_EXTENDED);
   
   size_t current_str_pos(0);
   regmatch_t regmatch_obj[1];
   int regex_res = regexec(&regex_obj, content.c_str(), 1, regmatch_obj, 0);

   while (regex_res == 0)
      {
      tempSet.emplace(content.begin() + current_str_pos + regmatch_obj[0].rm_so, content.begin() + current_str_pos + regmatch_obj[0].rm_eo);
      current_str_pos += regmatch_obj[0].rm_eo;
      regex_res = regexec(&regex_obj, content.c_str() + current_str_pos, 1, regmatch_obj, 0);
      }
  
   boost::python::list ret;
   for (auto item : tempSet)
      ret.append(std::move(item));
   return ret;
   }
   
BOOST_PYTHON_MODULE(contentMatchPattern)
{
    using namespace boost::python;
    def("matchPatternsImpl", matchPatternsImpl);
}

The Boost::Python wrapper is one and the same, and the main difference is with the regcomp, and regexec.

Right, we have step a) complete, so it's time to run and compare.

Searched string set size	Python	regex.h	boost::regex	std::regex
2	100	140	151	120
20	110	400	440	410
200	130	2310	2372	2355
2000	248	31270	31263	31278

Note: All measurements are in milliseconds, and the platform is Python 2.7.9, Cygwin/Windows, 4-core 2.10 Ghz CPU

Erm - what's going on here? It's reasonable that Python would perform similarly to the Boost regex engine, but having two orders of magnitude is entirely unexpected and counter-intuitive. This points to a defect at the user's end (i.e. my code) rather than a radical difference between the libraries. After a bit of head scratching, the problem was found - this line:

boost::match_flag_type flags = boost::match_default;

needed to be replaced with this:

boost::match_flag_type flags = boost::match_default | boost::match_any;

We have to find any match, and not necessarily scan for all of them; this is taken care of by the outer loop.
After changing the one line, we get a saner set of results:

Searched string set size	Python	boost::regex
2	100	90
20	110	100
200	130	130
2000	248	270

How much of a difference does one line make! Python terminates the regex search as soon as it finds a match, while the other libraries press on. This is why regex.h was non-performant: it simply does not have such an option (std::regex does, with very similar metrics to boost).

Here are my takeaways from this exercise:

Boost::Python is a very convenient way of exposing C++ APIs. Unless you do highly customized parameters management or do not have access to Boost, it should be preferred to manual Python_ function calls.
Whenever you hit a non-performant library, blame yourself first.
Whenever doing parallel string matching via regexes, always look for match_any parameter, or variant thereof.
Avoid regex.h, unless the powers to be force development in pure C.

Roman on Software Engineering

Saturday, 11 July 2015

Regex engine and parallel string matching

No comments :

Post a Comment

Blog Archive