Roman on Software Engineering: 2015

Sunday, 6 September 2015

Time Management

Last time around I've spoken about taming meeting invites, and in passing mentioned that this technique is one of the cornerstones upon which effective time management resides.

Time management is a bit like playing a musical instrument: many people try it, few persevere with it, and even fewer get it right.
However, this is yet another case where mastery pays off. As a superficial example, let's take two imaginary technical leads: each gets, say, 20 daily e-mails/instant messages/mailing posts with requests for help. Due to efficient schedule organisation, one manages to respond to each and every one in the same day; the other responds on average within a couple of days.

What happens is a positive feedback loop. People, seeing that the first guy provides more immediate help, start flocking to him. More importantly, he gains more visibility, and more popularity points. And, eventually, when the time comes to move up the ladder, he will have the edge over the other one, even if the latter is as good on pure professional merit.

Here's a bit of visualisation:

Of course, it's ~~a little bit~~ idealistic, and no deity has cancelled office politics, or pure element of luck, but it's undeniable that strong time management skills are an asset for both individual and their company.
Since you're hopefully convinced by now that the topic is not negligible, we can take a look at a few helpful practices.

Time as currency

When we buy items (soft drinks, cars, houses, curiously shaped lava lamps), we always attach intrinsic value: i.e. how much we're willing to spend on them. However, dollars/pounds/euros/yuans (insert your monetary unit of choice) is not the only currency we hold. In today's hectic IT environment, another important and finite asset we've got is time.

Let me stress and underline the word finite here. Spend an hour on an obscure design question, and that will be an hour you cannot spend on helping others with their challenges, brushing on new technologies, or spending with your family or hobby.

Now, with material objects, we usually know how much we're willing to spend. I'm happy to invest thousand pounds in a high-end guitar, and would not depart from more than a pound on a bicycle (since I don't ride them). For another person, it will be completely the other way around: both approaches are individual and perfectly valid.

Importantly, this analogy still holds true if we replace pounds (or your currency of choice) with minutes, hours and days.

Defining a set of responsibilities for a new micro-service is something that as an architect I'd be willing to several hours for. Vice versa, as a manager, I'd be unwilling to examine an interface in microscopic detail, and will restrict my time there to 30 minutes at most; the goal will be just to ensure that I understand it at a high level, and can tie it to business goals.

Same task - different intrinsic values.

Just to avoid displaying managers as superficial - there are plenty of opposite examples. An engineer does not have to spend more than 15 minutes drafting a daily update e-mail. Manager can and should pay more than that as their role demands accurate and unequivocal messaging.

All in all, there are little mental price tags attached to all that we do; we just normally don't notice them, Being oblivious to those little tags can easily cause to over- or under-spend; it is as if we were pulling random wads of cash from the wallet when paying over the counter in a supermarket. Understanding how many minutes we're willing to pay on X, and staying true to that is a cornerstone technique for efficient time management.

Context switches

Despite what philosophers and humanists may say, people do resemble machines in more ways than one, and a prime example of that are context switches. You're probably familiar with the topic, but let me mention it anyway - context switches occur whenever we move from one unfinished task to another. Both humans and machines do not perform any useful work when they swap contexts; at worst, if we constantly switch back and forth between tasks rather than completing each one in turn, we're thrashing in a state of unproductive semi-panic.

Of course, it's not hard to arrive to the solution: stop thrashing and do your work in a strict order. Or, as one Russian writer pointed out - "if you want to be happy, be so".

The problem is that things don't always go our way, and we cannot always complete all our work sequentially. If a developer starts writing a module that would take 8 hours to complete, they ~~can't~~ should not ignore the world around them for an entire day, and not deal with important code reviews, cries for help etc.

The trick is quick prioritisation. When a new request comes in, I tend to take a quick priority call among the following response times:

Drop everything and answer immediately. E.g. a widespread outage, an onsite sales engineer needing help, or a immediate call for action from my boss.
30 minutes. For example, someone else in the company is stuck with their (non-critical) task, and could be unblocked with my help.
2-4 hours. Typical examples are: a general request for information on a non-urgent customer case, or a code review.
Next working day. Planned document review, request for a weekly report etc.

The idea is that unless the environment is extremely chaotic, and most incoming transmissions do not fall in the first bucket, I can at least manage my interruptions, and reach a certain point where it's easier to switch to something else.

But, things can sure fall in the second and third bucket a lot, and quick and dirty prioritisation does not bring full salvation.

Forecasting interruptions

If your role does not include a lot of coordination, i.e. your title does not have the magic words Manager, Lead or Architect in it, then you can safely stop reading here. In many ways, you can count your blessings: people and priorities don't pull in different directions, and you get to create something new without being constantly tapped on the shoulder. So, if you're still with me (either by the virtue of your role, or due to sheer curiosity), let's say that you get a lot - and I mean, a lot - of interruptions that demand response within the same day or sooner.

Here, the challenge of being able to get on with the day-to-day job is still pretty much there. Yes, we just mitigated it a bit by doing a quick pre-filter, but mitigating does not mean solving. It is still very easy to get bogged down in a constant reactive cycle, where you just respond to requests rather than generate new value.

Yet, there is one little factor though that gives IT professionals a little edge towards solving this, and distinguishes software professionals and, say, football goalkeepers. The latter do not get any pattern in how often balls come their way. It might be once every 5 minutes, a cluster of ten balls towards the end of the game, or never. With software professionals, there's always a pattern. Normally, emails and questions come mostly during the first hour of the working day, towards the end of the day, and whenever any time zones we work with kick in.
If we zoom out for a moment, we can also observe a seasonal flow: end of fiscal year tends to be way more active than mid-July.

Understanding the organisation and its industry brings in more insight into when it's safer to plunge into long, focused, endeavours, and when it's better to deal with superficial tasks.
With all of that, I tend to pre-plan time slots which are open for involved, uninterruptible engagements; for example, it might be 10-12 AM, and 2-3 PM on a typical mid-week, mid-fiscal quarter day. Both slots avoid the time needed to answer morning and last night's questions, and also late afternoon when other time zones become active.

All that builds into day pre-planning. Before the day starts, I tend to know what meetings are scheduled, when I need to tend the goalposts (that goalkeeper analogy keeps on working!), and when I can switch into uninterruptible tasks, such as long-term planning or architecture design, without great risk of forsaking something important.

Summary

Altogether, there's no magic bullet. Yes, there's plenty of books about 7, 10, or 1024 habits of highly effective people, and no, you or me won't become highly efficient overnight by reading them, or indeed, this blog post.
A large part of being efficient stems from good knowledge of your company, industry, people around you, and your own strength and weaknesses, including ability to organise the workday, as well as concentrate and switch from one topic to another. Moreover, time management employed by a CEO might not be right for a developer, or a mid-level manager.

As with the other posts, my goal was not to proclaim a world cure, but to share the little tips and tricks that worked for me, with the hope that they might help a few people in their careers and work/life balance.

P.S. Interestingly enough, it took over a month to write a blog on time management. That's life: it just loves irony.

Sunday, 26 July 2015

Taming meeting invites

Introduction

As I was doing the detour into various programming topics, something was nagging me a bit in the background, something that was left unsaid in the previous series on Scrum.
After a short while, I realised that many of the recommendations in those series were underpinned by the invisible hand of time management, and this is the one meta-subject that haven't got the attention it deserves yet - at least on these pages.

For me, effective time management is a preliminary condition to success. Technical leaders have their attention drawn in many directions in many different ways, and many times within each day and each hour. Moreover, the more they lead the higher the frequency. At some point, we face a non-exclusive choice:

Work long hours
Ignore some of the ~~detractors~~ pleas for help
Actively decide how much time each particular subject merits

For years, I've been actively employing (3), though of course (1) is also unavoidable when the situation demands it. In a separate post, I'll bring up a few examples and guidelines on how one can quickly prioritise and allocate the right time slots, but first let's circle back: how all of this related to the subject of this post, namely "meetings"?

I can answer thus: meetings is whenever someone else decides for you how long you should spend on X. Hence, time management and meetings are deeply intertwined.

Now, before there is another step further, let me add a big disclaimer: I'm not saying that meetings are bad, and all of us should permanently sit in our rooms/cubicles/open spaces and lonely type at the keyboard.
The art consists of organising efficient and effective meetings on stuff that matters and do not make them longer than they should be. After all, if you spent an hour in a meeting that helped neither human nor beast, you'll have either to ignore an hour-worth of something else, something that matters, or go home one hour later.

Anyhow, let's be a bit more positive: here's the list of

Three meeting types that work

The brainstorm

Few attendees: definitely less than 10, but ideally less than 5. Everyone participates, and new decisions get formulated.

Examples include a production incident war room, or interactive technical design.

The review

Also fewer than 10 attendees, with one main presenter and a preliminary material distribution. There are all kinds of reviews: deployment, design, planning, project etc. You can also add an internal product demo in the same bucket.

The update

This meetings tend to have a big number of attendees: anything from 5 up to 50000. Their purpose is to convey information, and not kick off discussions, or take new decisions.

Company-wide quarterly update, Scrum stand-up, product training, architecture overview all belong to this group.

So far, so good, so what?

Ok, so this all beyond obvious, so why even bother listing it? As with many other phenomena, the trick is not realising what works, but what doesn't. Our enemy in this particular post are inefficient meetings, and those exist in the big galactic void between those three constellations. In not so many words, they happen whenever someone does not have a clear picture of what type of meeting they are going to call in.

The organiser might go for a brainstorm with twenty people, where only two speak, and eighteen wish for earth to open underneath their feet and swallow them whole. They might call in a review where nobody knows in advance what this is about and end up spending eons discussing mundane topics. They might go as far as call in an update with 50 people who don't care about what they are going to be updated on.

Taking out a page from a diagnostician's book, let's see how we recognize an inefficient meeting once it's upon us, and how we prevent those from occurring in the first place.

How to recognize an inefficient meeting?

This section seems a bit redundant - criteria such as "am I bored out of my mind?" and "am I answering my e-mail while it's going along?" come to mind. But they are not precise: if you and the other guy are engaged and are chatting away, while the rest of the room is doing stress testing for Facebook, the meeting is still ineffective.

My criterion has been surveying the room from time to time: if 20%+ of the attendees are mentally elsewhere, then the meeting isn't going great. You can tweak the number up to derive more adjectives: for example, with 80%+ it can be safely called "waste of time and money".

This occasional visual inspection is not a meaningless exercise. It is important lesson and feedback, since some meetings are not inherently bad; it is just that the speaker needs to do a better job of engaging. One example I covered at length in the past was the status update meeting.

However some meetings cannot be made whole even with the most eloquent presenter; and it's possible to detect those in advance.

How to recognize an inefficient meeting before it starts?

Here, the meeting types come to the fore again:

The brainstorm

The most common fallacy in my book is lack previous material or preliminary discussion.

I had the pleasure of receiving generic invites in the past; for example, "Design widget A", or "How do we make our customers happy?".
The cause is noble, but then what helps everyone to be productive? What if one guy might have been thinking about widget A all of his wakeful hours, and everyone else would not consider widget A until you served it with watercress on a plate?

This means that rather than having a brainstorm, we are going to have an update, while we won't really know whether there are other alternatives to design the widget, and/or whether we've got the right one.

Organiser's job is making sure that all the invitees:

Care about widget A
Have something to say about widget A
Know what others think about widget A

Brainstorming is much closer to playing chess than playing poker. The more we know which pieces/cards everyone else holds, the better we can choose what topics we spend time discussing, and the better we can build on each other's ideas.

In not so many words, brainstorms should not happen unless people were encouraged to think, contemplate and feed back offline first, so that we know exactly which points we are going to solve.

Note: The one exception to that rule are war-room discussions, where there's no time for niceties, but these should be rare in a healthy organisation.

Of course, there are many other reasons why these meetings might not go well - more vocal people dominating the proceedings, political agenda etc. These are already covered to a great extent elsewhere, and are more about running rather than organising a meeting.

The review

Here, as with the brainstorm, more often than not people come to the meeting from different starting points.
For example, let's say that you're reporting on your team's progress with creating widget A. Two managers/architects in the room collaborated on said widget, two more know just what it's supposed to do, and two more got tacked on to the meeting and are vaguely aware that said widget exists.

Do you cater for the last group, and explain in minute details what this widget is for and how it came to be? Then 5.5 guys will be wasting their time (the 0.5 is you).
Do you cater for the first group and do technical update only on what happened in the last 2 weeks? Then only three people will be actually interested and get to ask meaningful questions.
Even with the best of intent, it's hard to make a great meeting out of this.

Another typical fallacy is people being undecided between review and brainstorm. I.e. they do a presentation, ask open questions, and let others have a discussion. That doesn't tend to succeed for two reasons:

Too many people in the meeting: brainstorms should have a small group, while reviews spread themselves out.
Lack of preparation: as per above, brainstorms need more thinking done in advance.

In short, this kind of meeting works when all attendees know the subject, care about it to a reasonable degree, and already have a few questions in mind. The best way of getting there is pre-distributing the slides, answering basic questions offline if need be, and allowing attendants to be ready with the hard questions for the meeting itself.

The update

Updates have the most attendees, and tend to live or die by the skills of their presenter. Thus, they are a bit of a special case - it's harder to tell from a meeting invite how they are going to pan out. The only realistic warning sign are the words "discuss", "decide" or their siblings - they imply that the organiser is going for an open forum in a meeting least suited for that purpose.

What to do when an inefficient meeting lands in your calendar?

Rejecting is easy: Outlook has a convenient button for that very purpose. However, this is neither safe nor very helpful - especially if done without an accompanying note.
Explaining why this meeting might not work is less fraught with political peril, though also requires caution.

For example, it is legitimate to ask for a crisp agenda definition in a brainstorm, but demanding slides in advance for a review might be construed as unnecessary pressure. Here, we are both feet within the minefield of office politics, and it has to be tread very carefully.

The important point though is that it's you who is going to go home extra three hours later, and it's you whose priorities were forced by the meeting invite. If you firmly believe that your presence there will not provide any value, then it's best to state that and bail out using the best political tactics at your disposal - or help the organiser do a better job.
Obviously, you'll also be spawning meetings yourself; here it's best to keep to the saying: "Do unto others as you would have them do unto you".

Saturday, 11 July 2015

Regex engine and parallel string matching

The last post on Cython ended up with a surprising revelation on regexes - it turned out that Python matching of patterns OR-ed within a single regex was orders of magnitude faster than its regex.h cousin.

However, there were a few why-s, if-s and but-s left:

What is so different between the two engines? I alluded to backtracking specifics unearthed in a StackOverflow post, but that explanation wasn't truly satisfactory, as there was no other public source that referred to the same disparity.
Is this unique to regex.h? What about boost::regex or std::regex?
Does the size of the matched string set matter? Remember, I've fixed the set of needles to be searched in the haystack, but could it be that one engine performs better on smaller/larger sets?

These are exactly the questions I'd like to answer today, and here's how we are going to proceed:

Prepare three extensions that implement pattern matching: one for boost::regex, one for C++11 regex, and one for regex.h.
Run a set of tests for 2, 20, 200 and 2000 patterns which go over the same set of inputs.
Compare, deduce and write a blog post.

All of these shall be unveiled shortly, with an important correction that had to be performed between steps b) and c). However, I'm jumping ahead: let's step back and look at step a. As usual, I'll share the sources, but for compile/build/embed specifics will refer to previous posts.

Here's the boost::regex one:

#include <boost/python.hpp>
#include <iostream>
#include <set>
#include <boost/regex.hpp>

using namespace std;

boost::python::list matchPatternsImpl(const string &content, const string &pattern_regex)
   {
   set<string> tempSet;
   boost::regex regex(pattern_regex);
   boost::match_results<std::string::const_iterator> what;
   boost::match_flag_type flags = boost::match_default;

   auto searchIt = content.cbegin();
   while (boost::regex_search(searchIt, content.cend(), what, regex, flags))
      {
      tempSet.emplace(what[1].first, what[1].second);
      searchIt = what[1].second;
      }
  
   boost::python::list ret;
   for (auto item : tempSet)
      ret.append(std::move(item));
   return ret;
   }

BOOST_PYTHON_MODULE(contentMatchPattern)
{
    using namespace boost::python;
    def("matchPatternsImpl", matchPatternsImpl);
}

Now, it all looks as same old, but in fact, there are a couple of novelties. First, note the usage of Boost::Python. Rather than doing complicated marshalling of C++ data structures to Python and back via Python.h, we use a nice shortcut on lines 28-32. Who said that writing C++ extensions for Python was hard?

To follow up on that, we're not returning a primitive type here, but rather a Python list. This is where boost::python::list does its little magic by defining a C++ type that can be automatically coerced to Pythonic structures.
On lines 22-25 we take our C++ set and create such a list while using C++11 move semantics to avoid unnecessary copy constructors. Another minor landmark is the usage of emplace which avoids additional copies and constructs a string directly in the container. As always, my advice is to read and experiment with those if you're serious about C++11.

However, let's get back to business. We have the Boost example sorted, so let's move on to std::regex. Here, there is a bit of an anti-climax, as it is almost a carbon copy of the code above, which is entirely unsurprising - std::regex was modelled after Boost. To get the code. just replace boost:: with std:: in the right places.

All we have remaining is regex.h, which is a throwback to my previous post. The main difference though is that we're back to conventional weapons; Cython for sure looked unorthodox, but we want to do a like-for-like comparison, so this will be a pure C++ wrapper.

#include <boost/python.hpp>
#include <iostream>
#include <set>
#include <regex.h>

using namespace std;

boost::python::list matchPatternsImpl(const string &content, const string &pattern_regex)
   {
   set<string> tempSet;
   regex_t regex_obj;
   regcomp(&regex_obj, pattern_regex.c_str(), REG_EXTENDED);
   
   size_t current_str_pos(0);
   regmatch_t regmatch_obj[1];
   int regex_res = regexec(&regex_obj, content.c_str(), 1, regmatch_obj, 0);

   while (regex_res == 0)
      {
      tempSet.emplace(content.begin() + current_str_pos + regmatch_obj[0].rm_so, content.begin() + current_str_pos + regmatch_obj[0].rm_eo);
      current_str_pos += regmatch_obj[0].rm_eo;
      regex_res = regexec(&regex_obj, content.c_str() + current_str_pos, 1, regmatch_obj, 0);
      }
  
   boost::python::list ret;
   for (auto item : tempSet)
      ret.append(std::move(item));
   return ret;
   }
   
BOOST_PYTHON_MODULE(contentMatchPattern)
{
    using namespace boost::python;
    def("matchPatternsImpl", matchPatternsImpl);
}

The Boost::Python wrapper is one and the same, and the main difference is with the regcomp, and regexec.

Right, we have step a) complete, so it's time to run and compare.

Searched string set size	Python	regex.h	boost::regex	std::regex
2	100	140	151	120
20	110	400	440	410
200	130	2310	2372	2355
2000	248	31270	31263	31278

Note: All measurements are in milliseconds, and the platform is Python 2.7.9, Cygwin/Windows, 4-core 2.10 Ghz CPU

Erm - what's going on here? It's reasonable that Python would perform similarly to the Boost regex engine, but having two orders of magnitude is entirely unexpected and counter-intuitive. This points to a defect at the user's end (i.e. my code) rather than a radical difference between the libraries. After a bit of head scratching, the problem was found - this line:

boost::match_flag_type flags = boost::match_default;

needed to be replaced with this:

boost::match_flag_type flags = boost::match_default | boost::match_any;

We have to find any match, and not necessarily scan for all of them; this is taken care of by the outer loop.
After changing the one line, we get a saner set of results:

Searched string set size	Python	boost::regex
2	100	90
20	110	100
200	130	130
2000	248	270

How much of a difference does one line make! Python terminates the regex search as soon as it finds a match, while the other libraries press on. This is why regex.h was non-performant: it simply does not have such an option (std::regex does, with very similar metrics to boost).

Here are my takeaways from this exercise:

Boost::Python is a very convenient way of exposing C++ APIs. Unless you do highly customized parameters management or do not have access to Boost, it should be preferred to manual Python_ function calls.
Whenever you hit a non-performant library, blame yourself first.
Whenever doing parallel string matching via regexes, always look for match_any parameter, or variant thereof.
Avoid regex.h, unless the powers to be force development in pure C.

Sunday, 28 June 2015

Cython and regular expressions

Recap

I'm returning yet again to the long suffering text matching example from a couple of months back. The goal is/was to take a text document, a set of patterns, and see which of these patterns could be found. We went through a variety of techniques: Python micro-optimisations, regexes, C++ extensions, Twisted, reactors and improved string matching algorithms. My intent this time around is to try improvements brought out by Cython.

In case you're unfamiliar with Cython, I'll do a two sentences' introduction: its aim is to preserve the ease of development inherent to Python while extending it with static type definitions and easy C/C++ bindings. Considering that dynamic typing comes at a great cost, it's perfectly suited for optimising hotspots in Python code and striking a balance between productivity and performance.

As before, the end result turned out to be surprising (at least to me), but what matters is the journey. Let's embark on it.

Firstly, I'll shake the dust off the pure Python example:

from twisted.internet import reactor, defer, threads
import sys, re

def stopReactor(ignore_res):
   reactor.stop()

def printResults(filename, matchingPatterns):
   print ': '.join([filename, ','.join(matchingPatterns)])

def scanFile(filename, pattern_regex):
   pageContent = open(filename).read()
   matchingPatterns = set()
   for matchObj in pattern_regex.finditer(pageContent):
      matchingPatterns.add(matchObj.group(0))

   printResults(filename, matchingPatterns)

def parallelScan(filenames, patterns):
   patterns.sort(key = lambda x: len(x), reverse = True)
   pattern_regex = re.compile('|'.join(patterns))

   deferreds = []
   for filename in filenames:
      d = threads.deferToThread(scanFile, filename, pattern_regex)
      deferreds.append(d)

   defer.DeferredList(deferreds).addCallback(stopReactor)

if __name__ == "__main__":
   with open(sys.argv[1]) as filenamesListFile:
      filenames = filenamesListFile.read().split()
   with open(sys.argv[2]) as patternsFile:
      patterns = patternsFile.read().split()

   parallelScan(filenames, patterns)
   reactor.run()

This was just a copy/paste to save you the bother of clicking through to an older post. I still preserved the reactor and multi-threading, but they won't be playing a major role today.

We know that it's scanFile where we spend most of the time, and this is the hotspot that needs optimisation treatment with Cython. Since the matching is done via regexes, we cannot in good faith claim that we're optimising without also switching their library; here, I'm going to use regex.h.

Writing Cython code is not a common skill, and we're not dealing with Hello, World here, so as the narrator, I have two choices: jump to the end result, and go for an autopsy, or build it piece by piece. Taking the mantra that it's the journey that matters, I'll go for the second option.

Cython

Going for the easy bit, here's the updated scanFile function:

def scanFile(filename, pattern_regex):
   pageContent = open(filename).read()
   matchingPatterns = contentMatchPatternCython.matchPatterns(pageContent, pattern_regex)
   printResults(filename, matchingPatterns)

Nothing glamorous here, we just defer matching to the Cython extension that will be built.
Let's tick the build checkbox too:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(name='contentMatchPatternCython',
      cmdclass={'build_ext': build_ext},
      ext_modules = [Extension('contentMatchPatternCython', 
                     sources = ['contentMatchPatternCython.pyx'])])

Still nothing exciting, as this is just a small bootstrapping script that will build the Cython extension.

Time to take a deep breath, and look at how we write the actual algorithm.
This is the skeleton:

def matchPatterns(bytes pageContent, bytes regex):
   cdef set matchingPatterns = set()
   return matchingPatterns

Baby steps here. We just return an empty set to satisfy the function signature, however it already differs from vanilla Python code by defining types for pageContent, regex and matchingPatterns.
Why? We'd like to use Cython's memory and performance optimisations by pre-defining the variable types which will get us optimised variable storage and access. In this specific case, there is no requirement to support unicode, hence the bytes type definition.

Now, let's import regex.h functions:

cdef extern from "regex.h" nogil:
    ctypedef struct regmatch_t:
       int rm_so
       int rm_eo
    ctypedef struct regex_t:
       pass
    int REG_EXTENDED
    int regcomp(regex_t* preg, const char* regex, int cflags)
    int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
    void regfree(regex_t* preg)

Note that we only import what we need later on, and that we tell Cython to release the GIL when executing the imported functions via the magic nogil keyword.

Ok, so we have the interface and the required C library function imported into PCython. All that's left is the algorithm, and tying it all together:

cdef extern from "regex.h" nogil:
    ctypedef struct regmatch_t:
       int rm_so
       int rm_eo
    ctypedef struct regex_t:
       pass
    int REG_EXTENDED
    int regcomp(regex_t* preg, const char* regex, int cflags)
    int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags)
    void regfree(regex_t* preg) 

def matchPatterns(bytes pageContent, bytes regex):
   cdef set matchingPatterns = set()
   cdef regex_t regex_obj
   cdef regmatch_t regmatch_obj[1]
   cdef int regex_res = 0
   cdef int current_str_pos = 0
   
   regcomp(&regex_obj, regex, REG_EXTENDED)
   regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)
   while regex_res == 0:
      matchingPatterns.add(pageContent[current_str_pos + regmatch_obj[0].rm_so: current_str_pos + regmatch_obj[0].rm_eo])
      current_str_pos += regmatch_obj[0].rm_eo
      regex_res = regexec(&regex_obj, pageContent[current_str_pos:], 1, regmatch_obj, 0)

   regfree(&regex_obj)
   return matchingPatterns

The interesting stuff happens at lines 19-24 where the usual compilation and regex execution takes place. The code is not one-to-one to pure Python since the underlying C library does not support the notion of generators, and we have to do string slicing (which is cousin once removed to pointer arithmetic).

Cython is relatively new to me as well, and even though I've written the code above, it caused a certain degree of cognitive dissonance: a bit akin to someone seeing platypus for the first time.

The code looks like C, but semicolons are nowhere to be found, while there are Python indentations and weird brackets around strings. If you've been developing in both languages for a while, seeing them unified in matrimony within a single function is a new experience.

Anyhow, time to cut the nostalgia, stop coding and start measuring! At this point, my intent was to try this on a single <file, pattern> pair, and move on to doing further Cython tweaks to ensure we can release the GIL on the entire regex matching loop. Reality, however, turned to be different...

Performance

From here, and until the end of the post, the inputs are fixed: we use a single haystack (17KB source of google.html) and a fixed set of needles (first 2000 tokens of /usr/share/dict/words).

Making sure to run a few iterations and take the fastest one, here's the timing using the Python-only example:

$ time python pythonOnly.py shortFilenamesList.txt shortWords.txt

real 0m0.868s

user 0m0.546s

sys 0m0.312s

All right, so how does Cython compare? Drum roll please...

$ time python pythonWithCython.py shortFilenamesList.txt shortWords.txt

real 0m2.396s

user 0m2.012s

sys 0m0.374s

Ouch - 2.7 times slower! Of course, such a difference has to be explained. In my mind, there are two possible sources: different regex library or Cython itself. The former is far more likely, but we should never make fast assumptions with performance tuning. Hence, the natural next step was to write the same in C.

I'll omit the C code for brevity, as it mirrors the Cython code with the semicolons and pointer arithmetics thrown in. Let's jump to the result:

$ time ./a.exe

real 0m1.547s

user 0m1.513s

sys 0m0.031s

No joy - it's still way slower than Python. Time for

Conclusions

So, we know by now that it's the difference in the regex library that matters; no low-level optimisations in Cython or C can overcome that. In hindsight, there was little basis to suppose that Cython will matter on this specific task, as Python's re implementation is also native C.

However, this was not time wasted, and here's why:

This is yet another tangible example of why blanket statements such as Language X is slow, and Y is fast are untrue. In this case, Python came up faster; in another it may well be slower.
Thanks to user @nhahtdh on StackOverflow, I gained insight into the difference in performance. regex.h implements a backtracking engine, and does not optimise non-backtracking regexes, which is almost certainly something that Python re does.
Good Cython practice. There is a reasonable template in place to import C libraries into Cython, link with it, build etc. (There are of course plenty of other resources that show the same, but it's always helpful to do and gradually demonstrate it by yourself)

What next?

Coming back to the task at hand: can we truly say that Python's re engine is simply better? My answer is a definite 'no' since engines shine on different patterns. Ours has been a long sequence of OR expressions, and it is just one possibility among many.

So, there are many options to take it further: try other matching patterns, compare with C++ boost::regex, or go for examples that are less dependent on supporting libraries. While Cython has not been of much help with optimising our venerable string matching example, it yet has a future in this blog.

Monday, 15 June 2015

Interviews: what not to say or do

My previous couple of posts on hiring were for interviewer's sake: what to ask, what to tell, and in general, how to land that perfect candidate.

However, what about the interviewees? They are the ones that withstand a barrage of questions, and have far more at stake: shouldn't we give them a bit of advice too?

My recent experience runs towards the cushy part, i.e. the interviewer, but after talking to dozens of potential colleagues, I've acquired a mini-collection of behaviours that either put me off a particular candidate or vice versa. This is precisely the collection I'm going to share below.

As for the order: a while back I've been taught that in each monologue it's good to cover the negatives first and finish on positives. So, let's start with

What not to do on interviews

Don't go on a tangent. There's a great British comedy talk show called QI. In that show, one gains points by bringing up quite interesting - hence the name - facts that might or might not be related to the question asked (hint: it's usually the latter).

Well, job interview is exactly the opposite of that. Talking about B while being asked about A is a bad idea: and in case you're unsure why, I'll elaborate. Firstly, it gives me, the interviewer, a clear indication that the guy on the other end knows nothing about A. Secondly, it leads me to a deduction that on this subject they might only know about B. Thirdly, it points to a potential communication problem.
It may well sound trite and obvious, but this is by far the most frequent violation I've encountered.

A while back, I hired senior C++ developers, and one of the more advanced questions was:

When would you not define a virtual destructor?

I took great pains to underline the word not when asking the question. Nevertheless, many people chose to explain when one would have defined a virtual d'tor, which is of course a far simpler question. A large subset answered the same inverted question again after I gave them another chance.
Needless to say, they would have been in a far better situation by just admitting ignorance.

Don't guess. Speaking of admitting ignorance - guessing an answer is never the path to fame and glory. If you honestly say that you're unfamiliar with the answer, then it perhaps highlights a technical gap, but not a personal one. Guessing at an interview, however, shows that the candidate might do the same when working with others and writing production code, and let's face it - none of us are omniscient.

One might object that you have a chance of guessing correctly, and thus avoid the double whammy. Well, unless we have a software developer and an actor all rolled into a single package, the lack of confidence in the answer will be still audible and noticeable. Note that there are shades of grey here, and it's perfectly valid to reason about the possible answer, e.g. "I'm not sure how database X implements durability, but based on database Y, I'd guess they use a temporary log buffer".

Again, it happened to me a number of times. On one interview, I've asked about the difference between HTTP/1.0 and 1.1, and after a short pause got a seemingly confident and completely incorrect answer about the latter introducing load balancing. In my mind, the interview ended right there, although of course we ran it to completion.

In short, "Don't know" is not a rude or dirty phrase. It is much better than giving the right answer to the wrong question or guessing.

Don't project over confidence. Continuing the previous thought: it's crucial to have a realistic assessment of your skills.

Picture yourself two candidates: both being able to write basic Python code, but not going far into intermediate concepts such as generators, property methods or mixin classes. One claims his Python skills as moderate, while the other puts emphasises Python experience on the CV, and declares full mastery. Which of the candidates would you progress with?

This exact situation happened to me, and obviously the second candidate got filtered out early on. The frustrating thing here is that his skill and salary expectations were good enough for the role, which was fairly junior. He simply overestimated himself, and that means that he would have been doing the same at work, and that is a recipe for interpersonal issues.

Basically, try and grade yourself ahead of the interview on each of the keywords in the CV. Go to sites such as StackOverflow and try answering questions on your favourite keywords. Does your self assessment stack up?

Don't be emotional. This is a very big no-no. It's great to express a gamut of emotions when auditioning for a role. It's also fine to be expressive during important keynotes and motivation speeches. Not so good on job interviews.

From my point of view: if a person is unbalanced even at such an artificial and special setting as an interview, what will happen when they starting working with us? Would they be a living incarnation of jack-in-the-box?
Of course, it might be that they are simply nervous, and in day-to-day interactions, they'll be an oasis of calm and reliability. They just might. But, this interview is all I have to go by, and employing a person erroneously is a costly and emotionally charged mistake.

Don't speed up. It's not the most common fallacy, but it did happen to me on a number of occasions. The interviewee treats the end of my question as a starter gun in a 100 meter dash and proceeds to firing out sentences at a serious rate of knots per minute.

I can (sort of) follow the thought pattern here: "I'm not quite sure what he is looking for, so let's do a scatter shot and see if one of the answers hits the target".

Unfortunately, the main result achieved by this method is mild headache at the other end of the wire (or room). Also, even if one of the answers is correct, it's the wrong ones that are going to matter.

However, let's switch subjects a little bit. Apart from knowing the answers to the technical questions and avoiding the pitfalls above,

What to do before interviews?

Read up on the company and its products. It always makes a good impression when the candidate knows more than just the title page of the website. It does not take long to Google the company, and become aware about their business model and latest happenings. It also kills two birds with one stone: shows the interviewer that you are a diligent person, and helps with figuring out whether it's the right place for you.

Again, must sound very obvious, and again, not many people do this - especially software engineers. I guess sometimes the mindset is: "I'm going to develop code and get paid, their business model is not my business."
There are more than enough reasons why this is not the right thought pattern, but nevertheless - even the first bird, i.e. showing diligence, should be convincing enough.

At one call, I had a candidate succinctly describe to me all of our recent developments, with highlights from the latest Gartner report thrown in. It gave him a +50 karma points boost for the rest of the chat.

Read up on the interviewer. Fifteen years ago, average online presence was minimal. Unless you were lucky enough to be quizzed by a public figure, all you had to go by ahead of time was the interviewer name.
Today, this excuse does not hold any longer. We've got LinkedIn, blogs, forums - plenty of input to figure out what your counterpart knows and cares about. For example, if you're interviewing with me (yes, we have roles open!) - finding this blog might help.
Moreover, if that person is also your hiring manager, then you get a chance to understand compatibility and yet again, show personal diligence.

Get good questions ready. Don't think that the interview is finished whenever the technical questions end and you get the virtual or physical microphone. It is still going on, and the questions you ask cast a shadow (or light of glory) on you as a person.

Here, I had the whole range: from silence, to vocalised "10 questions to ask on interview" articles.
Silence is practically the worst option, since it shows detachment and lack of care. Asking stock questions, such as "What do you like best about working in your company?" does not do any harm, but does not do much good either. Asking stock questions gets you stock answers, and does not show personal preparation. It is a bit like the worn off "Name-your-three-best-worst-qualities" question that some employers still stubbornly hang on to.

As with the other bulletpoints, it's best to ask questions pertinent to the company, role and what you heard so far. For example,

You do your development in C++; do you go for the latest standards, and how extensively do you use external libraries?

You mentioned that this role has a customer element to it: can you describe typical customer interactions I'm likely to have?

How often do you release software? Can you describe a typical release cycle?

As you are based on third-part IaaS systems, is there any special testing process that you follow and how do you get development environment similar to the target deployment?

Can you describe the technical career progression in your company?

These questions show that you've been attentive, wish to understand the position in depth, and that you have previous experience with similar processes.

Be ready to back up your CV. This is of course the biggie. Before an important interview, try re-reading your CV, and see if you have any stale skills that need a refresher. Obviously, it does not mean that you stand a chance of getting away with techniques you've never used, and inserted in the resume to gain attention. But, if you genuinely used a specific technology for a few years in the past, then brushing up won't do any harm, and may avoid the embarrassment of stalling up on basic questions.

I was once interviewing a person who was mostly brought in due to his networking stack knowledge: SSL, HTTP, TCP/IP, DNS - he had them all. The most frustrating thing was not even that he could not answer entry level questions on most of them. It was that it was evident that he knew them at some point, since snippets of right terms were popping up while he was searching for the answers. However, there were also plenty of holes in other topics, and hiring based on a hunch is simply too much a risk: including for the candidate who had active employment at the time. We had to pass, while I'm sure (or hope) that if he did a half day refresher ahead of our meeting, the decision could have been different.

Wrapping up

If you followed me all the way here, you're probably regretting the 240 seconds wasted. So much obvious advice, which is already spread upon countless blogs and articles.
If that's the case, you are correct - few people will find genuinely new information here. The hardest part about changing behaviour is not reading about it, but doing it, and my task at hand was giving a few examples and sharing personal experience to stress why it's so important.

Landing the right job is a crucial moment in our lives. It's disappointing, yet understandable, not to get it because we simply lack the skills, experience, or even if we had encountered interviewers on a bad day.
But, losing an opportunity due to not presenting yourself in the best light possible is more than disappointing; it means that we lost a chance to invest a few hours to make the next few years better.

Subscribe to: Comments ( Atom )