Five whys

,

Toyota’s ‘five whys’ process is well known to engineers: it’s one of the basic problem-solving tools in modern engineering management systems like Six Sigma and Total Quality Management. Here’s how Taiichi Ohno introduces the process in Toyota Production System: Beyond Large-Scale Production (1978):

When confronted with a problem, have you ever stopped and asked why five times?1 It is difficult to do even though it sounds easy. For example, suppose a machine stopped functioning:

  1. Why did the machine stop?
    There was an overload and the fuse blew.
  2. Why was there an overload?
    The bearing was not sufficiently lubricated.
  3. Why was it not lubricated sufficiently?
    The lubrication pump was not pumping sufficiently.
  4. Why was it not pumping sufficiently?
    The shaft of the pump was worn and ratting.
  5. Why was the shaft worn out?
    There was no strainer attached and metal scrap got in.

Repeating why five times, like this, can help uncover the root problem and correct it. If this procedure were not carried through, one might simply replace the fuse or the pump shaft. In that case, the problem would recur within a few months.

To tell the truth, the Toyota production system has been built on the practice and evolution of this scientific approach. By asking why five times and answering it each time, we can get to the real cause of the problem, which is often hidden behind more obvious symptoms.

So it’s an important technique if you have a defect. But in software, it’s an important technique even if you don’t have a defect. It’s easy to change the behaviour of a program (that’s what makes software ‘soft’), and so although part of one’s time as a programmer is spent fixing broken programs, much more of it is spent changing working programs. This is a risky activity: if a program is working and you change it then there’s a chance that you’ll break it because you didn’t understand all the implications of your change. This is where the ‘five whys’ technique comes in. Abstractly, the process proceeds along these lines:

  1. Why does the program have this behaviour?
    Because the code implements it that way.
  2. Why is the code implemented like that?
    Because the design specifies it.
  3. Why does the design specify that feature?
    Because it's the best way to meet a requirement.
  4. Why does the program need to meet that requirement?
    Because a user needs it.
  5. Why does the user need it?
    To meet their goal.

Let’s take a concrete example from the Memory Pool System. Suppose you are looking at finalization and wondering whether it would make sense to run an object’s finalizer promptly: that is, as soon as it is established that there are no strong references keeping the object alive. But before you start changing things, you should ask:

  1. Why doesn’t the MPS finalize objects promptly?
    Because it puts finalizable objects on a message queue. The internal reference from the message keeps the object alive until the client program discards the message.
  2. Why does the the MPS put finalizable objects on a queue?
    Because the design says that finalizer code must run synchronously with respect to the client program, and prompt finalization would run inside the collector, that is, asynchronously.
  3. Why does the design specify that?
    Because asynchronous finalizers are hard to write reliably: if the finalizer needs to access other objects than the one being finalized then there are race conditions accessing those objects, but if the finalizer takes a lock then there is a risk of deadlock. See Hans-J. Boehm (2002), “Destructors, Finalizers, and Synchronization”.
  4. Why do finalizers need to be easy to write?
    Because otherwise the development cost of integrating with the MPS would be too high.
  5. Why does the cost of integrating with the MPS need to be low?
    Because otherwise users will go elsewhere to meet their memory management needs.

If you’re going to carry out this technique on a working system, it’s handy if the system is traceable—that is, you can trace the links from the code to its design, and from the design to the requirements, and from the requirements to the users. If these links are too hard to find, then you’ll probably give up on trying to figure out the real reason for the behaviour, and just use your skill and judgement. So let’s look at an example where the system failed to be traceable and we initially made a poor decision.

When we were porting the MPS to OS X, we chose to use the native Mach threads interface instead of the POSIX threads emulation layer. We chose Mach because you can easily suspend and resume a target thread by calling the thread_suspend() and thread_resume() functions, whereas in POSIX, suspending a thread is a complex operation, requiring co-operation from the target thread. The implementation we use is for the target thread to install a handler for signal S1, and in the handler call sigsuspend() to wait for signal S2. Then you can send S1 to suspend it and S2 to resume. This requires all threads to be registered with the MPS by calling mps_thread_reg(), so that (among other reasons) on systems using POSIX threads the handler can be installed.

All looked good with the Mach threads implementation, except that one of the test cases was failing. Let’s apply the ‘five whys’ technique:

  1. Why is the test case failing?
    Because mps_thread_reg is being called twice on the same thread.
  2. Why is the test case registering the thread twice?
    It does so “just to make sure it works”. Here’s the code:
    /* register the thread twice, just to make sure it works */
    die(mps_thread_reg(&thread, (mps_arena_t)arena), "thread_reg");
    die(mps_thread_reg(&thread2, (mps_arena_t)arena), "thread2_reg");
    
  3. Why does this test need to register the thread twice “just to make sure it works”?

This one was tricky. As far as we knew, thread registration is reliable and doesn’t need to be done twice. We looked at the history for clues: the comment had been there since the first version of the test was written in February 2000. The commit message said:

new unit 
change.dylan.kinglet.160270
MT version of amcss
PosixThreads only

We could have asked the committer, but after so many years it seemed unlikely that he would remember why he had written the test case that way. The second line of the commit names a change (a related group of tasks) for the DylanWorks product release codenamed kinglet. We looked up the request that generated the change, and it said:

                         request.dylan.160270

MMDW is not available for Linux

Dylan want a new platform: Linux (RedHat 5.2), Intel, GCC version 2.7.2.3
Needs a real implementation of protection, virtual memory arena, threads,
stack scanner

There was nothing here explaining the mysterious test, so we were faced with a choice between implementing multiple registration on OS X (but not knowing why we were doing it), or changing the test case to remove the multiple registration test (but not knowing why it was there in the first place, and so possibly failing to detect some bug). The former was tricky, so we implemented the latter.

However, I wasn’t completely satisfied. It had clearly been quite a lot of work, back in 2000, to support threads being registered multiple times on POSIX threads platforms, because during thread suspension it was necessary to deduplicate the list of registered threads to avoid signalling them twice. No-one would have gone to such effort unless there had been a very good reason for it. So from time to time I had a poke around the project archives in search of relevant facts. And finally I dug up a nugget of gold:2

                         request.dylan.160252

Stack scanning is broken with multiple threads calling in to Dylan

A C program could create a number of threads, some of which call into
Dylan. When a thread (other than the ‘main thread’, whose stack is
registered when the Dylan DLL is loaded) calls into Dylan for the
first time, Dylan registers it and creates a stack-scanning root for
it.

[...]

Note that this problem must be solved in MPS, and cannot be fixed by
Dylan arranging to deregister the thread when returning control from
Dylan code back to C. The reason is that control might be transferred
by a mechanism other than a return (e.g. a longjmp).

So now we can answer the ‘whys’:

  1. Why is the test case failing?
    Because mps_thread_reg is being called twice on the same thread.
  2. Why is the test case registering the thread twice?
    Because it must be possible to register a thread multiple times with the MPS, and the test case checks this.
  3. Why does it need to be possible to register a thread multiple times?
    Because the language runtime may be embedded in a program that calls into it from multiple threads.
  4. Why can’t the language runtime deregister the thread before returning to the caller?
    Because control might be transferred to the caller by a nonlocal mechanism such as an exception or longjmp(). The situation in which this might happen is one where the foreign code calls into the MPS-using language runtime, which calls back out into foreign code, which raises an exception.
  5. Why does the MPS need to support this feature?
    Because it’s needed by DylanWorks (now Open Dylan).

I documented the requirement in the thread manager design, so that future developers won’t have to go through the same archive search. And now I suppose that I’m going to have to implement multiple thread registration on OS X.

This requirement, now discovered, allows us to answer other questions about the proper behaviour of the MPS thread manager. For example, does the MPS need to cope with dead threads, or should it rely on the client program to deregister them? The requirement to be able to support embedding of MPS-using code (such as a language runtime) in a program that does not co-operate with the MPS means that a program cannot reliably deregister threads before they die, and so the MPS has to make a best effort to handle dead threads.


  1.  Obviously the number ‘five’ is arbitrary: the idea is to keep asking why until you are confident that you understand enough about the problem and its causes. In Ohno’s example, you could continue (why was there no strainer attached?) and maybe go on to find defects in the maintainance and inspection procedures.

  2.  Note the value of keeping all project documentation for the lifetime of the product: this request is from 1999, from a long-defunct issue-tracking system at a company that no longer exists, but without it, we wouldn’t have known why the code was written the way it was.