Toyota’s ‘five whys’ process is well known to engineers: it’s one of the basic problem-solving tools in modern engineering management systems like Six Sigma and Total Quality Management. Here’s how Taiichi Ohno introduces the process in Toyota Production System: Beyond Large-Scale Production (1978):
When confronted with a problem, have you ever stopped and asked why five times?1 It is difficult to do even though it sounds easy. For example, suppose a machine stopped functioning:
- Why did the machine stop?
There was an overload and the fuse blew.
- Why was there an overload?
The bearing was not sufficiently lubricated.
- Why was it not lubricated sufficiently?
The lubrication pump was not pumping sufficiently.
- Why was it not pumping sufficiently?
The shaft of the pump was worn and ratting.
- Why was the shaft worn out?
There was no strainer attached and metal scrap got in.
Repeating why five times, like this, can help uncover the root problem and correct it. If this procedure were not carried through, one might simply replace the fuse or the pump shaft. In that case, the problem would recur within a few months.
To tell the truth, the Toyota production system has been built on the practice and evolution of this scientific approach. By asking why five times and answering it each time, we can get to the real cause of the problem, which is often hidden behind more obvious symptoms.
So it’s an important technique if you have a defect. But in software, it’s an important technique even if you don’t have a defect. It’s easy to change the behaviour of a program (that’s what makes software ‘soft’), and so although part of one’s time as a programmer is spent fixing broken programs, much more of it is spent changing working programs. This is a risky activity: if a program is working and you change it then there’s a chance that you’ll break it because you didn’t understand all the implications of your change. This is where the ‘five whys’ technique comes in. Abstractly, the process proceeds along these lines:
Let’s take a concrete example from the Memory Pool System. Suppose you are looking at finalization and wondering whether it would make sense to run an object’s finalizer promptly: that is, as soon as it is established that there are no strong references keeping the object alive. But before you start changing things, you should ask:
If you’re going to carry out this technique on a working system, it’s handy if the system is traceable—that is, you can trace the links from the code to its design, and from the design to the requirements, and from the requirements to the users. If these links are too hard to find, then you’ll probably give up on trying to figure out the real reason for the behaviour, and just use your skill and judgement. So let’s look at an example where the system failed to be traceable and we initially made a poor decision.
When we were porting the MPS to OS X, we chose to use the native Mach threads interface instead of the POSIX threads emulation layer. We chose Mach because you can easily suspend and resume a target thread by calling the
thread_resume() functions, whereas in POSIX, suspending a thread is a complex operation, requiring co-operation from the target thread. The implementation we use is for the target thread to install a handler for signal S1, and in the handler call
sigsuspend() to wait for signal S2. Then you can send S1 to suspend it and S2 to resume. This requires all threads to be registered with the MPS by calling
mps_thread_reg(), so that (among other reasons) on systems using POSIX threads the handler can be installed.
All looked good with the Mach threads implementation, except that one of the test cases was failing. Let’s apply the ‘five whys’ technique:
mps_thread_regis being called twice on the same thread.
/* register the thread twice, just to make sure it works */ die(mps_thread_reg(&thread, (mps_arena_t)arena), "thread_reg"); die(mps_thread_reg(&thread2, (mps_arena_t)arena), "thread2_reg");
This one was tricky. As far as we knew, thread registration is reliable and doesn’t need to be done twice. We looked at the history for clues: the comment had been there since the first version of the test was written in February 2000. The commit message said:
new unit change.dylan.kinglet.160270 MT version of amcss PosixThreads only
We could have asked the committer, but after so many years it seemed unlikely that he would remember why he had written the test case that way. The second line of the commit names a change (a related group of tasks) for the DylanWorks product release codenamed kinglet. We looked up the request that generated the change, and it said:
request.dylan.160270 MMDW is not available for Linux Dylan want a new platform: Linux (RedHat 5.2), Intel, GCC version 126.96.36.199 Needs a real implementation of protection, virtual memory arena, threads, stack scanner
There was nothing here explaining the mysterious test, so we were faced with a choice between implementing multiple registration on OS X (but not knowing why we were doing it), or changing the test case to remove the multiple registration test (but not knowing why it was there in the first place, and so possibly failing to detect some bug). The former was tricky, so we implemented the latter.
However, I wasn’t completely satisfied. It had clearly been quite a lot of work, back in 2000, to support threads being registered multiple times on POSIX threads platforms, because during thread suspension it was necessary to deduplicate the list of registered threads to avoid signalling them twice. No-one would have gone to such effort unless there had been a very good reason for it. So from time to time I had a poke around the project archives in search of relevant facts. And finally I dug up a nugget of gold:2
request.dylan.160252 Stack scanning is broken with multiple threads calling in to Dylan A C program could create a number of threads, some of which call into Dylan. When a thread (other than the ‘main thread’, whose stack is registered when the Dylan DLL is loaded) calls into Dylan for the first time, Dylan registers it and creates a stack-scanning root for it. [...] Note that this problem must be solved in MPS, and cannot be fixed by Dylan arranging to deregister the thread when returning control from Dylan code back to C. The reason is that control might be transferred by a mechanism other than a return (e.g. a longjmp).
So now we can answer the ‘whys’:
mps_thread_regis being called twice on the same thread.
longjmp(). The situation in which this might happen is one where the foreign code calls into the MPS-using language runtime, which calls back out into foreign code, which raises an exception.
I documented the requirement in the thread manager design, so that future developers won’t have to go through the same archive search. And now I suppose that I’m going to have to implement multiple thread registration on OS X.
This requirement, now discovered, allows us to answer other questions about the proper behaviour of the MPS thread manager. For example, does the MPS need to cope with dead threads, or should it rely on the client program to deregister them? The requirement to be able to support embedding of MPS-using code (such as a language runtime) in a program that does not co-operate with the MPS means that a program cannot reliably deregister threads before they die, and so the MPS has to make a best effort to handle dead threads.
↩ Obviously the number ‘five’ is arbitrary: the idea is to keep asking why until you are confident that you understand enough about the problem and its causes. In Ohno’s example, you could continue (why was there no strainer attached?) and maybe go on to find defects in the maintainance and inspection procedures.
↩ Note the value of keeping all project documentation for the lifetime of the product: this request is from 1999, from a long-defunct issue-tracking system at a company that no longer exists, but without it, we wouldn’t have known why the code was written the way it was.