A performance regression

,

So, another day at the customer site, another issue to fix:1

Issue 234. Over the past three months, the performance of [PRODUCT] has regressed. In December, at commit 919074d, [TESTCASE] ran in 21 seconds, but now it takes 44 seconds.

The first step is to check that the problem is reproducible (it is), and the next is to run a git bisect to see if there was a bad commit:

$ git bisect start HEAD 919074d
Bisecting: 192 revisions left to test after this (roughly 7 steps)

That’s enough bisection steps to justify some automation:

#!/bin/sh

make --quiet clean
make --quiet --jobs
T=$(/usr/bin/time -f %e tests/TESTCASE >/dev/null 2>&1)
[ "$T" -lt 25 ] && exit 0 # good
[ "$T" -gt 35 ] && exit 1 # bad
exit -1 # ambiguous

and then:

$ git bisect run ./bisect-script
running ./bisect-script
Bisecting: 192 revisions left to test after this (roughly 7 steps)
[...]
running ./bisect-script
9719514a17cc28319f934f57ae28bb44c1d51016 is the first bad commit
commit 9719514a17cc28319f934f57ae28bb44c1d51016
Author: Gareth Rees <gdr@garethrees.org>
Date:   Fri Jan 16 14:02:00 2015 +0000

    Don't busy-wait in the server. Fixes #123.

bisect run success

So, it was all my fault! But hang on a moment, there’s no way that this particular commit could be to blame for a general performance regression, because it was a performance improvement.

The PRODUCT, consisting of a group of processes, communicate with a server process via a named pipe.

Let’s have a quick rundown of the architecture. The ‘server’ is an auxiliary process that provides miscellaneous services to [PRODUCT], which connects to it via a named pipe. [PRODUCT]’s various subprocesses write commands to the pipe; the server reads these commands and does whatever is needful. The server had originally been implemented as a busy-wait loop, repeatedly trying to read from its end of the pipe. This busy-wait loop caused the server to consume 100% of a CPU (as confirmed by ps -o %cpu). This was clearly no good, so in mid-January I rewrote the server to select on the pipe instead. This means that server no longer hogs the CPU, and so surely it could only result in a performance improvement? And yet here we are. So let’s try some experiments.

It’s hard for me to convey the completeness of my bafflement at this point: you’ll just have to do your best to imagine it. How can one simple change cause a program to run twice as fast on one machine and twice as slow on another?

Maybe the Linux version has something to do with it: the machine with the performance regression is running CentOS 5, which runs Linux kernel version 2.6.18, whereas Ubuntu 14.04 runs kernel version 3.13.0. So could it be something to do with the scheduler? Linux 2.6.18 uses the ‘O(1) scheduler’, but from 2.6.23 the default was the ‘Completely Fair Scheduler’. It’s hard to imagine exactly how a difference between schedulers could cause the observed effects, but the following theory is somewhat plausible:

  1. [PRODUCT]’s subprocesses regularly write to the named pipe.
  2. Writing to a named pipe makes readers on the pipe runnable.
  3. If the reader was running anyway on another CPU (as in the busy-wait case), there’s nothing for the scheduler to do.
  4. But if the reader was blocked on a select, then the kernel may well schedule it now.
  5. This scheduling decision might result in the writer process yielding.

However, it’s dangerous to be carried away by a plausible theory. Step 5 surely can’t be right: the scheduling of the reader would surely only cause the writer to yield if the machine were starved for CPU time, and that doesn’t seem to be the case: the load average was only a few percent. So it’s time for some more experiments.

Well, that certainly sinks my theory! If there are no writes on the named pipe, then there can’t be any scheduling decisions based on those writes. It’s beginning to look as though the common factor is that the test case runs fast if the server is hogging 100% of a CPU, regardless of whether there is any communication between the two. So maybe it doesn’t matter whether the CPU-bound process is the server?

So my new theory is that I am seeing the effect of power management. That is, I believe that a CPU-bound process causes the system to power up more CPU resources, and other programs benefit from this. A quick way to test this theory is to run a program that’s completely unrelated to [PRODUCT] but which also runs many subprocesses (let’s use the configure step from a Python source installation, which happens to be handy), and see if it benefits from a concurrent CPU-bound process:

$ time ./configure > /dev/null
real    0m18.358s
user    0m11.157s
sys     0m7.447s
$ (while :; do :; done) &
[1] 30219
$ time ./configure > /dev/null
real    0m13.889s
user    0m8.249s
sys     0m5.629s

So this is a general phenomenon and nothing to do with [PRODUCT]. It seems that what is happening is that the CPU-bound process causes one of the CPUs on the Intel Core i7 to be boosted to full speed (aka Intel Turbo Boost):

$ grep MHz /proc/cpuinfo
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
$ (while :; do :; done) &
[1] 22222
$ grep MHz /proc/cpuinfo
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 1200.000
cpu MHz     : 2801.000

and then this benefits other processes when they get scheduled onto the boosted CPU. This explains why the performance results were so different on the Ubuntu virtual machine in experiment 2: the virtual machine was configured to use a single CPU, so the CPU-bound server process competed with [PRODUCT] for the resource.

So in the end there was nothing wrong with the product! It was the performance measurement test that was wrong. For accurate comparison, Intel Turbo Boost has to be disabled for the duration of the test, as described at kernel.org.


  1.  This is based on a true story, but details have been changed.