Roman on Software Engineering: July 2015

Introduction

As I was doing the detour into various programming topics, something was nagging me a bit in the background, something that was left unsaid in the previous series on Scrum.
After a short while, I realised that many of the recommendations in those series were underpinned by the invisible hand of time management, and this is the one meta-subject that haven't got the attention it deserves yet - at least on these pages.

For me, effective time management is a preliminary condition to success. Technical leaders have their attention drawn in many directions in many different ways, and many times within each day and each hour. Moreover, the more they lead the higher the frequency. At some point, we face a non-exclusive choice:

Work long hours
Ignore some of the ~~detractors~~ pleas for help
Actively decide how much time each particular subject merits

For years, I've been actively employing (3), though of course (1) is also unavoidable when the situation demands it. In a separate post, I'll bring up a few examples and guidelines on how one can quickly prioritise and allocate the right time slots, but first let's circle back: how all of this related to the subject of this post, namely "meetings"?

I can answer thus: meetings is whenever someone else decides for you how long you should spend on X. Hence, time management and meetings are deeply intertwined.

Now, before there is another step further, let me add a big disclaimer: I'm not saying that meetings are bad, and all of us should permanently sit in our rooms/cubicles/open spaces and lonely type at the keyboard.
The art consists of organising efficient and effective meetings on stuff that matters and do not make them longer than they should be. After all, if you spent an hour in a meeting that helped neither human nor beast, you'll have either to ignore an hour-worth of something else, something that matters, or go home one hour later.

Anyhow, let's be a bit more positive: here's the list of

Three meeting types that work

The brainstorm

Few attendees: definitely less than 10, but ideally less than 5. Everyone participates, and new decisions get formulated.

Examples include a production incident war room, or interactive technical design.

The review

Also fewer than 10 attendees, with one main presenter and a preliminary material distribution. There are all kinds of reviews: deployment, design, planning, project etc. You can also add an internal product demo in the same bucket.

The update

This meetings tend to have a big number of attendees: anything from 5 up to 50000. Their purpose is to convey information, and not kick off discussions, or take new decisions.

Company-wide quarterly update, Scrum stand-up, product training, architecture overview all belong to this group.

So far, so good, so what?

Ok, so this all beyond obvious, so why even bother listing it? As with many other phenomena, the trick is not realising what works, but what doesn't. Our enemy in this particular post are inefficient meetings, and those exist in the big galactic void between those three constellations. In not so many words, they happen whenever someone does not have a clear picture of what type of meeting they are going to call in.

The organiser might go for a brainstorm with twenty people, where only two speak, and eighteen wish for earth to open underneath their feet and swallow them whole. They might call in a review where nobody knows in advance what this is about and end up spending eons discussing mundane topics. They might go as far as call in an update with 50 people who don't care about what they are going to be updated on.

Taking out a page from a diagnostician's book, let's see how we recognize an inefficient meeting once it's upon us, and how we prevent those from occurring in the first place.

How to recognize an inefficient meeting?

This section seems a bit redundant - criteria such as "am I bored out of my mind?" and "am I answering my e-mail while it's going along?" come to mind. But they are not precise: if you and the other guy are engaged and are chatting away, while the rest of the room is doing stress testing for Facebook, the meeting is still ineffective.

My criterion has been surveying the room from time to time: if 20%+ of the attendees are mentally elsewhere, then the meeting isn't going great. You can tweak the number up to derive more adjectives: for example, with 80%+ it can be safely called "waste of time and money".

This occasional visual inspection is not a meaningless exercise. It is important lesson and feedback, since some meetings are not inherently bad; it is just that the speaker needs to do a better job of engaging. One example I covered at length in the past was the status update meeting.

However some meetings cannot be made whole even with the most eloquent presenter; and it's possible to detect those in advance.

How to recognize an inefficient meeting before it starts?

Here, the meeting types come to the fore again:

The brainstorm

The most common fallacy in my book is lack previous material or preliminary discussion.

I had the pleasure of receiving generic invites in the past; for example, "Design widget A", or "How do we make our customers happy?".
The cause is noble, but then what helps everyone to be productive? What if one guy might have been thinking about widget A all of his wakeful hours, and everyone else would not consider widget A until you served it with watercress on a plate?

This means that rather than having a brainstorm, we are going to have an update, while we won't really know whether there are other alternatives to design the widget, and/or whether we've got the right one.

Organiser's job is making sure that all the invitees:

Care about widget A
Have something to say about widget A
Know what others think about widget A

Brainstorming is much closer to playing chess than playing poker. The more we know which pieces/cards everyone else holds, the better we can choose what topics we spend time discussing, and the better we can build on each other's ideas.

In not so many words, brainstorms should not happen unless people were encouraged to think, contemplate and feed back offline first, so that we know exactly which points we are going to solve.

Note: The one exception to that rule are war-room discussions, where there's no time for niceties, but these should be rare in a healthy organisation.

Of course, there are many other reasons why these meetings might not go well - more vocal people dominating the proceedings, political agenda etc. These are already covered to a great extent elsewhere, and are more about running rather than organising a meeting.

The review

Here, as with the brainstorm, more often than not people come to the meeting from different starting points.
For example, let's say that you're reporting on your team's progress with creating widget A. Two managers/architects in the room collaborated on said widget, two more know just what it's supposed to do, and two more got tacked on to the meeting and are vaguely aware that said widget exists.

Do you cater for the last group, and explain in minute details what this widget is for and how it came to be? Then 5.5 guys will be wasting their time (the 0.5 is you).
Do you cater for the first group and do technical update only on what happened in the last 2 weeks? Then only three people will be actually interested and get to ask meaningful questions.
Even with the best of intent, it's hard to make a great meeting out of this.

Another typical fallacy is people being undecided between review and brainstorm. I.e. they do a presentation, ask open questions, and let others have a discussion. That doesn't tend to succeed for two reasons:

Too many people in the meeting: brainstorms should have a small group, while reviews spread themselves out.
Lack of preparation: as per above, brainstorms need more thinking done in advance.

In short, this kind of meeting works when all attendees know the subject, care about it to a reasonable degree, and already have a few questions in mind. The best way of getting there is pre-distributing the slides, answering basic questions offline if need be, and allowing attendants to be ready with the hard questions for the meeting itself.

The update

Updates have the most attendees, and tend to live or die by the skills of their presenter. Thus, they are a bit of a special case - it's harder to tell from a meeting invite how they are going to pan out. The only realistic warning sign are the words "discuss", "decide" or their siblings - they imply that the organiser is going for an open forum in a meeting least suited for that purpose.

What to do when an inefficient meeting lands in your calendar?

Rejecting is easy: Outlook has a convenient button for that very purpose. However, this is neither safe nor very helpful - especially if done without an accompanying note.
Explaining why this meeting might not work is less fraught with political peril, though also requires caution.

For example, it is legitimate to ask for a crisp agenda definition in a brainstorm, but demanding slides in advance for a review might be construed as unnecessary pressure. Here, we are both feet within the minefield of office politics, and it has to be tread very carefully.

The important point though is that it's you who is going to go home extra three hours later, and it's you whose priorities were forced by the meeting invite. If you firmly believe that your presence there will not provide any value, then it's best to state that and bail out using the best political tactics at your disposal - or help the organiser do a better job.
Obviously, you'll also be spawning meetings yourself; here it's best to keep to the saying: "Do unto others as you would have them do unto you".

The last post on Cython ended up with a surprising revelation on regexes - it turned out that Python matching of patterns OR-ed within a single regex was orders of magnitude faster than its regex.h cousin.

However, there were a few why-s, if-s and but-s left:

What is so different between the two engines? I alluded to backtracking specifics unearthed in a StackOverflow post, but that explanation wasn't truly satisfactory, as there was no other public source that referred to the same disparity.
Is this unique to regex.h? What about boost::regex or std::regex?
Does the size of the matched string set matter? Remember, I've fixed the set of needles to be searched in the haystack, but could it be that one engine performs better on smaller/larger sets?

These are exactly the questions I'd like to answer today, and here's how we are going to proceed:

Prepare three extensions that implement pattern matching: one for boost::regex, one for C++11 regex, and one for regex.h.
Run a set of tests for 2, 20, 200 and 2000 patterns which go over the same set of inputs.
Compare, deduce and write a blog post.

All of these shall be unveiled shortly, with an important correction that had to be performed between steps b) and c). However, I'm jumping ahead: let's step back and look at step a. As usual, I'll share the sources, but for compile/build/embed specifics will refer to previous posts.

Here's the boost::regex one:

#include <boost/python.hpp>
#include <iostream>
#include <set>
#include <boost/regex.hpp>

using namespace std;

boost::python::list matchPatternsImpl(const string &content, const string &pattern_regex)
   {
   set<string> tempSet;
   boost::regex regex(pattern_regex);
   boost::match_results<std::string::const_iterator> what;
   boost::match_flag_type flags = boost::match_default;

   auto searchIt = content.cbegin();
   while (boost::regex_search(searchIt, content.cend(), what, regex, flags))
      {
      tempSet.emplace(what[1].first, what[1].second);
      searchIt = what[1].second;
      }
  
   boost::python::list ret;
   for (auto item : tempSet)
      ret.append(std::move(item));
   return ret;
   }

BOOST_PYTHON_MODULE(contentMatchPattern)
{
    using namespace boost::python;
    def("matchPatternsImpl", matchPatternsImpl);
}

Now, it all looks as same old, but in fact, there are a couple of novelties. First, note the usage of Boost::Python. Rather than doing complicated marshalling of C++ data structures to Python and back via Python.h, we use a nice shortcut on lines 28-32. Who said that writing C++ extensions for Python was hard?

To follow up on that, we're not returning a primitive type here, but rather a Python list. This is where boost::python::list does its little magic by defining a C++ type that can be automatically coerced to Pythonic structures.
On lines 22-25 we take our C++ set and create such a list while using C++11 move semantics to avoid unnecessary copy constructors. Another minor landmark is the usage of emplace which avoids additional copies and constructs a string directly in the container. As always, my advice is to read and experiment with those if you're serious about C++11.

However, let's get back to business. We have the Boost example sorted, so let's move on to std::regex. Here, there is a bit of an anti-climax, as it is almost a carbon copy of the code above, which is entirely unsurprising - std::regex was modelled after Boost. To get the code. just replace boost:: with std:: in the right places.

All we have remaining is regex.h, which is a throwback to my previous post. The main difference though is that we're back to conventional weapons; Cython for sure looked unorthodox, but we want to do a like-for-like comparison, so this will be a pure C++ wrapper.

#include <boost/python.hpp>
#include <iostream>
#include <set>
#include <regex.h>

using namespace std;

boost::python::list matchPatternsImpl(const string &content, const string &pattern_regex)
   {
   set<string> tempSet;
   regex_t regex_obj;
   regcomp(&regex_obj, pattern_regex.c_str(), REG_EXTENDED);
   
   size_t current_str_pos(0);
   regmatch_t regmatch_obj[1];
   int regex_res = regexec(&regex_obj, content.c_str(), 1, regmatch_obj, 0);

   while (regex_res == 0)
      {
      tempSet.emplace(content.begin() + current_str_pos + regmatch_obj[0].rm_so, content.begin() + current_str_pos + regmatch_obj[0].rm_eo);
      current_str_pos += regmatch_obj[0].rm_eo;
      regex_res = regexec(&regex_obj, content.c_str() + current_str_pos, 1, regmatch_obj, 0);
      }
  
   boost::python::list ret;
   for (auto item : tempSet)
      ret.append(std::move(item));
   return ret;
   }
   
BOOST_PYTHON_MODULE(contentMatchPattern)
{
    using namespace boost::python;
    def("matchPatternsImpl", matchPatternsImpl);
}

The Boost::Python wrapper is one and the same, and the main difference is with the regcomp, and regexec.

Right, we have step a) complete, so it's time to run and compare.

Searched string set size	Python	regex.h	boost::regex	std::regex
2	100	140	151	120
20	110	400	440	410
200	130	2310	2372	2355
2000	248	31270	31263	31278

Note: All measurements are in milliseconds, and the platform is Python 2.7.9, Cygwin/Windows, 4-core 2.10 Ghz CPU

Erm - what's going on here? It's reasonable that Python would perform similarly to the Boost regex engine, but having two orders of magnitude is entirely unexpected and counter-intuitive. This points to a defect at the user's end (i.e. my code) rather than a radical difference between the libraries. After a bit of head scratching, the problem was found - this line:

boost::match_flag_type flags = boost::match_default;

needed to be replaced with this:

boost::match_flag_type flags = boost::match_default | boost::match_any;

We have to find any match, and not necessarily scan for all of them; this is taken care of by the outer loop.
After changing the one line, we get a saner set of results:

Searched string set size	Python	boost::regex
2	100	90
20	110	100
200	130	130
2000	248	270

How much of a difference does one line make! Python terminates the regex search as soon as it finds a match, while the other libraries press on. This is why regex.h was non-performant: it simply does not have such an option (std::regex does, with very similar metrics to boost).

Here are my takeaways from this exercise:

Boost::Python is a very convenient way of exposing C++ APIs. Unless you do highly customized parameters management or do not have access to Boost, it should be preferred to manual Python_ function calls.
Whenever you hit a non-performant library, blame yourself first.
Whenever doing parallel string matching via regexes, always look for match_any parameter, or variant thereof.
Avoid regex.h, unless the powers to be force development in pure C.

Roman on Software Engineering

Sunday 26 July 2015

Taming meeting invites

Introduction

Three meeting types that work

The brainstorm

The review

The update

So far, so good, so what?

How to recognize an inefficient meeting?

How to recognize an inefficient meeting before it starts?

The brainstorm

The review

The update

What to do when an inefficient meeting lands in your calendar?

Saturday 11 July 2015

Regex engine and parallel string matching

Blog Archive