So, another day at the customer site, another issue to fix:1
Issue 234. Over the past three months, the performance of [PRODUCT] has regressed. In December, at commit 919074d, [TESTCASE] ran in 21 seconds, but now it takes 44 seconds.
The first step is to check that the problem is reproducible (it is), and the next is to run a
git bisect to see if there was a bad commit:
$ git bisect start HEAD 919074d Bisecting: 192 revisions left to test after this (roughly 7 steps)
That’s enough bisection steps to justify some automation:
#!/bin/sh make --quiet clean make --quiet --jobs T=$(/usr/bin/time -f %e tests/TESTCASE >/dev/null 2>&1) [ "$T" -lt 25 ] && exit 0 # good [ "$T" -gt 35 ] && exit 1 # bad exit -1 # ambiguous
$ git bisect run ./bisect-script running ./bisect-script Bisecting: 192 revisions left to test after this (roughly 7 steps) [...] running ./bisect-script 9719514a17cc28319f934f57ae28bb44c1d51016 is the first bad commit commit 9719514a17cc28319f934f57ae28bb44c1d51016 Author: Gareth Rees <firstname.lastname@example.org> Date: Fri Jan 16 14:02:00 2015 +0000 Don't busy-wait in the server. Fixes #123. bisect run success
So, it was all my fault! But hang on a moment, there’s no way that this particular commit could be to blame for a general performance regression, because it was a performance improvement.
Let’s have a quick rundown of the architecture. The ‘server’ is an auxiliary process that provides miscellaneous services to [PRODUCT], which connects to it via a named pipe. [PRODUCT]’s various subprocesses write commands to the pipe; the server reads these commands and does whatever is needful. The server had originally been implemented as a busy-wait loop, repeatedly trying to read from its end of the pipe. This busy-wait loop caused the server to consume 100% of a CPU (as confirmed by
ps -o %cpu). This was clearly no good, so in mid-January I rewrote the server to
select on the pipe instead. This means that server no longer hogs the CPU, and so surely it could only result in a performance improvement? And yet here we are. So let’s try some experiments.
Experiment 1: checkout the head revision, remove the
select from the server (leaving a busy-wait loop), and compare the performance. Result: 40 seconds with the
select, 20 seconds with the busy-wait loop.
Experiment 2: see if the problem is repeatable on a different machine. I booted up an Ubuntu 14.04 virtual machine and re-ran experiment 1. Result: 26 seconds with the
select, 52 seconds with the busy-wait loop.
It’s hard for me to convey the completeness of my bafflement at this point: you’ll just have to do your best to imagine it. How can one simple change cause a program to run twice as fast on one machine and twice as slow on another?
Maybe the Linux version has something to do with it: the machine with the performance regression is running CentOS 5, which runs Linux kernel version 2.6.18, whereas Ubuntu 14.04 runs kernel version 3.13.0. So could it be something to do with the scheduler? Linux 2.6.18 uses the ‘O(1) scheduler’, but from 2.6.23 the default was the ‘Completely Fair Scheduler’. It’s hard to imagine exactly how a difference between schedulers could cause the observed effects, but the following theory is somewhat plausible:
select, then the kernel may well schedule it now.
However, it’s dangerous to be carried away by a plausible theory. Step 5 surely can’t be right: the scheduling of the reader would surely only cause the writer to yield if the machine were starved for CPU time, and that doesn’t seem to be the case: the load average was only a few percent. So it’s time for some more experiments.
Experiment 3: turn off all communication between [PRODUCT] and the server (no writes or reads on the named pipe), and re-run experiment 1. Result: 40 seconds with the
select, 20 seconds with the busy-wait loop.
Experiment 4: don’t run the server at all, just [PRODUCT]. Result: the test case still takes 40 seconds.
Well, that certainly sinks my theory! If there are no writes on the named pipe, then there can’t be any scheduling decisions based on those writes. It’s beginning to look as though the common factor is that the test case runs fast if the server is hogging 100% of a CPU, regardless of whether there is any communication between the two. So maybe it doesn’t matter whether the CPU-bound process is the server?
Experiment 5: don’t run the server at all, but do run another process that hogs the CPU. Use
(while :; do :; done) & to run a CPU-bound shell process. Result: the test case takes 20 seconds.
So my new theory is that I am seeing the effect of power management. That is, I believe that a CPU-bound process causes the system to power up more CPU resources, and other programs benefit from this. A quick way to test this theory is to run a program that’s completely unrelated to [PRODUCT] but which also runs many subprocesses (let’s use the
configure step from a Python source installation, which happens to be handy), and see if it benefits from a concurrent CPU-bound process:
$ time ./configure > /dev/null real 0m18.358s user 0m11.157s sys 0m7.447s $ (while :; do :; done) &  30219 $ time ./configure > /dev/null real 0m13.889s user 0m8.249s sys 0m5.629s
So this is a general phenomenon and nothing to do with [PRODUCT]. It seems that what is happening is that the CPU-bound process causes one of the CPUs on the Intel Core i7 to be boosted to full speed (aka Intel Turbo Boost):
$ grep MHz /proc/cpuinfo cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 $ (while :; do :; done) &  22222 $ grep MHz /proc/cpuinfo cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 1200.000 cpu MHz : 2801.000
and then this benefits other processes when they get scheduled onto the boosted CPU. This explains why the performance results were so different on the Ubuntu virtual machine in experiment 2: the virtual machine was configured to use a single CPU, so the CPU-bound server process competed with [PRODUCT] for the resource.
So in the end there was nothing wrong with the product! It was the performance measurement test that was wrong. For accurate comparison, Intel Turbo Boost has to be disabled for the duration of the test, as described at kernel.org.
↩ This is based on a true story, but details have been changed.