I have been working on a project recently where there has been a lot of tricky debugging to do. The code I’m working on has to live in the same process as third-party code that is not under my control, and which calls into my code unpredictably, often from multiple threads or signal handlers. This is an application environment that’s particularly prone to rare crashes and deadlocks that somehow elude my own thorough test procedures but crop up nonetheless on the machines of potential customers during product evaluations.1 I used to find these kind of bugs intimidating and nerve-wracking, but (at least on this project and for the moment) I seem to be a match for them.
The process of debugging is (or ought to be) one of calmly, methodically and steadily gathering data that narrows the location of the problem until it can no longer hide and is forced to reveal itself. In my current project the first step is to acquire the log and (in the case of a crash) the backtrace from the customer and analyze it for the sequence of events leading up to the crash or deadlock. This identifies important features of the environment (which programs were running? are they multi-threaded? do they handle signals?) and usually pinpoints a suspect area of code which can be inspected for race conditions and deadlocks. Inspecting the code usually leads to a hypothesis or two about the cause; but even if it doesn’t I develop a test case that thoroughly exercises the problem area. With some experimentation and a lot of thought I’ve so far always found it possible to reproduce the problem under controlled conditions, whereupon it can be analyzed in the debugger.
Writing this is of course a form of hubris: no doubt I will return to work on Monday to find in my inbox a customer report describing a bug that I lack the skill to track down and fix. This hasn’t yet happened in my career, though there have been a few occasions where I have been in doubt as to the outcome, one of which I described in my post ‘The last lousy bug’. Programmers like to call these accounts “war stories”—joking, of course, but joking seriously. It is stressful to know that something is wrong in your code, but not to know what it is or how to find it, and to have it hanging over your head day after day as you fail to make progress towards solving it.
This nightmare scenario is the driver behind Ellen Ullman’s 2003 novel The Bug. Set in Silicon Valley in the mid-1980s, the novel follows two employees of database startup company ‘Telligentsia’: Ethan Levin, failed academic turned hotshot programmer, and Roberta Walton, English graduate turned software tester. Roberta is the tester who finds bug UI-1017: a lockup in the user interface when a dropdown menu is displaying and the mouse has just been moved outside the menu. Ethan is the programmer to whom the bug is assigned. It ought to be a simple fix, well within his capabilities, but it isn’t. He can’t reproduce it and can’t find it by inspection or analysis. All he can do is scrawl “CANNOT REPRODUCE”2 on the bug report and hope it never comes back. But the bug does come back, and at the most inconvenient moments, when Telligentsia’s salespeople are demonstrating the software to potential customers. The venture capitalists funding Telligentsia are getting nervous, and when they are nervous they respond by putting pressure on managers, and managers under pressure pass that on to the programmers. And so week after week, as Ethan fails to make progress on UI-1017, the looks of disapproval at the status meeting become darker. The bug takes over Ethan’s life: he works late into the night, stepping through the code again and again in a futile attempt to catch the bug in the act, not noticing that in the grip of his obsession with the code he is risking everything of importance: his marriage, his mental health, and his life.
It’s a plot of Aristotelian simplicity: the protagonist with a character flaw who suffers a reversal of fortune and eventually comes to a devastating realisation of what he has done.3 But it’s also psychologically insightful about what it’s like to work as a programmer, and, allowing for a little simplification for a general audience and a little exaggeration for dramatic effect, convincing in its portrayal of aspects of the software industry and the debugging process.
Ullman captures the way in which concentration on a programming task can utterly take over one’s life. Programmers call this experience ‘flow’4 and aspire to it because in the right circumstances it leads to high productivity and high quality of work. But the downside of flow is the loss of perspective: to let the task take over your brain you have to block out other thoughts and stimuli, but this includes thinking about whether the task is valuable. This is dramatised in The Bug by a scene in which Ethan brings a parachute into work to drape around his desk in an attempt to exclude all distractions. But we can see that really he is desperate to avoid thinking about his failed marriage, and reassessing his life. Eventually Ethan comes to the same realization, but for him it is far, far too late.
The novel also captures the way in which job performance can become too closely tied to our sense of self-worth, and the way in which employers use this to psychologically bully and exploit employees. By setting impossible tasks and brutal deadlines an unscrupulous employer can gradually destroy an employee’s confidence and self-worth: exactly the skills that he would need to realise what was going on and to challenge the bullying or walk away from it.
There’s an indictment of the corrosive corporate culture at Telligentsia. Everyone is struggling to be one-up against their colleagues: Ethan arrogantly snubs his colleagues when he is on schedule and they are not; then when the tables are turned and the “days opened” field on UI-1017 rolls over into three digits, his colleagues take the opportunity to put in a few kicks in return. No-one is supportive or collegiate; the project planning seems to be of the ‘death march’ variety.5 Management seem to have no idea how to manage: when things are going wrong all they know how to do is bully and threaten their subordinates. In the situation described in the novel a competent manager would have rotated all the programmers to work on UI-1017 instead of sticking Ethan with it week after week; and if that failed, brought in expertise, from external consultants if necessary, to assist the stuck team by proposing new debugging strategies.
Something that isn’t emphasized in the novel until close to the end—because it’s not clear to the protagonists—but which is clear to any programmer reading it, is that Ethan really isn’t very good at debugging. Working late, night after night, stepping through the code in the debugger, may be a good way of shutting out thoughts, but it’s not a good way of finding a bug. When debugging, you make progress by gathering information: time spent repeating a task that’s failed many times before is just wasted. The circumstances in which UI-1017 tends to be produced (that is, when the program is operated by testers, trainers, and salespeople) mean that it’s hard to get good debugging information out of the lockup. A running theme is that no-one who encounters the bug ever remembers to make a core dump: they just panic and reboot and all clues are lost. So Ethan needs to instrument the program so that it is more likely that the next time the bug is produced, information about it will be preserved. There should be a monitor process that watches for a UI lockup and sends a SIGQUIT. There should be a disk log of all the menu and mouse events (probably needing to be overwritten circularly, given the disk capacities of the mid-1980s) so that the sequence of events leading up to the lockup will be captured. And there should be some kind of automated stress testing tool delivering random mouse motions and menu clicks to the program. Any of these diagnostic approaches would surely have caught the bug. If Ethan didn’t know to do these things, or arrogantly thought that he didn’t need to because he could solve it by his own efforts, then it was the responsibility of management to set him right, or if they didn’t know themselves, then to bring in people who did. A few days of a consultant’s time could have saved months of wasted effort. But would it have saved Ethan? If the tattered remnants of his fragile ego depended on being better at programming than everyone around him, then could he have survived being shown up? The software industry is so huge that no-one can be an expert on everything, so logically speaking it shouldn’t be shameful to admit this, but nevertheless the last straw for Ethan is when Roberta finally does manage to acquire the core dump he’s been asking for:
Blind, stupid, incompetent—that’s what he was. He had the dump and now he was expected to fix the bug, but how could he tell anyone that it was useless to him? Machine guts, exposed, but it was no good, no help. He knew this would happen someday: twelve years of faking it, and one day, an illiterate pretending to read, he’d get caught. […] the idea of going to Thorne with his core dump, somehow getting what he needed without getting caught out as a functional illiterate—impossible.
You can see from the review how tempting it was for me to analyze the situation in the novel from the point of view of a software consultant! But that’s a tribute to Ullman’s convincing portrait of the bad side of the software development industry and what it can do to people who spend their working lives there.6
↩ This is not as spooky as it might seem: each customer has a unique combination of environment and third-party code, so a new customer is particularly likely to present our software with an environment that’s outside the envelope that’s well tested by other customers.
↩ A phrase with an ironic double meaning in the novel.
↩ In Aristotle’s terminology, hamartia, peripeteia, and anagnorisis respectively.
↩ A term coined by psychologist Mihály Csíkszentmihályi: see for example his TED talk on the subject.
↩ Wikipedia: “The knowledge of the doomed nature of the project weighs heavily on the psyche of its participants, as if they are helplessly watching themselves and their coworkers being forced to torture themselves and march toward death.”
↩ Abigail Nussbaum reviewed the novel and complained about the portrait of the industry: “the more she stresses the nastiness and resentment that exist between Ethan and his colleagues, the total lack of support, or anything resembling congeniality, the he experiences, the way he is allowed to go crazy right in front of his colleagues and no one notices or cares so long as he’s trying to track down the bug, the more it felt to me is if rather than capturing my industry, Ullman was grossly exaggerating some of the more extreme stereotypes about it.” But I think it may just be Nussbaum’s good fortune not to have worked at a company like Telligentsia that’s in deep financial trouble. When money runs short, everyone is under stress and tempers flare. The people at the top have no idea how to turn things around, everyone knows this but no-one admits it, and it becomes a test of character to see who can preserve their humanity as the company and their job come crashing down around them.