The last lousy bug

,

Inspired by David Jones’ “My Favourite Bugs”, here’s one of my own war stories.

Console video games are demanding pieces of software to write: in addition to being entertaining, they must be playable by six-year-olds and self-explanatory to adults. They must be able to gracefully handle corrupted save files. They must never run out of memory, be unresponsive to user input, or take longer than 16 ms to draw the next frame. Oh, and you typically only get one chance to get it right: when the game has to go out to manufacture on cartridge or DVD there’s no chance to fix things with a version 1.1.

But there are compensations. The ratio of testers to programmers is high, and nothing gives you confidence in the quality of your work as much as knowing that thirty people have been trying to make it break for the last week and they have only found one way to make it crash.

In the early days of testing, the bugs pile up rapidly, and it’s a scramble to keep the number of open reports down to a manageable level, but enough of the bugs are easy to fix that you can do it. The most exhilarating experience is when you’re working with testers who are in a time zone halfway around the world. You arrive at work in the morning to find a pile of bugs has arrived overnight, you work hard all day, and ship out a new release in the early evening, just in time for the testers to start verifying your fixes and finding new bugs.

After this has been going on for a few weeks, the low-hanging fruit have been plucked, and a handful of serious bugs have bubbled to the surface. These are the bugs that you haven’t been able to reproduce: the intermittent errors and mysterious crashes. And then eventually you come down to the last lousy bug, the one that you’ve been putting off in the hope that it would go away by itself, and you have to roll up your sleeves and get your hands dirty.

Bug 142 was especially mysterious because it happened after the program quit running. On this particular console, returning from the game to main menu was implemented by rebooting: a small region of memory survives the reboot and contains instructions for the boot loader telling it which application to run next. (This might sound strange, but it makes a lot of sense from a robustness point of view: applications aren’t required to co-exist, can’t cause each other to crash, and always start from a known machine configuration.) So when the player requests a quit, the game saves its state, and calls the Reboot function, which writes the necessary instructions to the protected region of memory, and then reboots the console. Except that our testers were reporting that on rare occasions our game was quitting but then crashing in the boot loader (which reported that it had been passed a bogus argument).

I always try to bear in mind Richard Stallman’s advice from Debugging with GDB:

Often people who encounter a bug spend a lot of time investigating which changes to the input file will make the bug go away and which changes will not affect it. [That is, trying to find the “envelope” of the bug.—GDR]

This is often time consuming and not very useful, because the way we will find the bug is by running a single example under the debugger with breakpoints, not by pure deduction from a series of examples. We recommend that you save your time for something else.

The trouble was that I couldn’t get the bug to manifest itself in the debugger. The testers were running the game on consumer hardware, so there was no way to attach a debugger when they provoked the bug. And whenever I stepped through the relevant sequence of events the correct arguments were always passed to the boot loader. So all I had to go on was the envelope of the bug, and the only salient fact about the envelope was that the bug only manifested itself in the RELEASE build of the game (with all debugging information and assertions stripped); never in the DEBUG build. (Which should have suggested that the problem might have something to do with memory layout, though I didn’t realize this until later.) I checked the documentation for the shutdown sequence and everything looked OK. We faded the screen to black, waited for the graphics pipeline to empty, faded out the music, waited for any pending save operation to finish, shut down any libraries that required it, and finally called Reboot.

So all I could do was to prepare a special RELEASE build with extra diagnostic support, and work on the other bugs, hoping that eventually something would turn up. But the pile of outstanding bugs grew smaller and smaller, and eventually it was clear that bug 142 would be the last lousy bug in the game.

The testers kept at it, using the special diagnostic build, and eventually one of them struck gold by getting our game to crash before it quit (rather than after sending bogus arguments to the boot loader), which caused our fatal error handler to run, which drew the stack trace you can see in the screenshot at the right. Knowing what I now know about the cause of the bug, the window of opportunity for this crash was very small, and I dread to think about how many times they had to provoke the bug before they got this result. Testers, I salute your dedication!

On the PowerPC, a DSI (Data Storage Interrupt) exception “occurs when a data memory access cannot be performed”—it’s the PowerPC equivalent of a segmentation fault or memory protection failure. Consulting the application map, it was possible to decode the stack trace as follows (everything from main up to menuTick is our code, and then Reboot and above are library code):

PCFunction
800E9D5Cstrlen
800C8AF0PrepareBootArgs
800C8DE8Reboot
80097B54menuTick
8009FD90mainTick
800A0518main
800041A4_init

We didn’t have the source code for the library functions, but we did have a debugger, so armed with the stack trace we spent a morning stepping through the sequence of events, reading the assembler, and reverse-engineering the library. It became clear that Reboot carried out the following steps:

  1. Allocate a 4 KiB buffer starting at 0x81280000.
  2. Use this buffer to assemble the command-line arguments for the boot loader.
  3. Copy these command-line arguments to the protected area of memory that will survive the reboot. (This is where strlen gets called, and where the DSI exception sometimes occurred.)
  4. Reboot.

The writers of Reboot didn’t worry about whether the memory at 0x81280000 would be in use by the game, because the function never returns. And if it never returns, it can’t possibly conflict with other uses of the memory. Or can it? We looked at our memory map, and in the RELEASE build 0x81280000 was right in the middle of one of our external frame buffers.

(For readers unfamiliar with video game programming: an external frame buffer is a region of memory which contains a single video frame in a format suitable for sending to the display. Typically you have two external frame buffers: one is being displayed and the other is receiving (or waiting to receive) a copy of the next frame from the GPU. The GPU has direct access to main memory, so the copy happens asynchronously with respect to the CPU. When the next frame has been copied, the game waits for a safe moment (known as a ‘retrace’ or ‘vsync’ after the vertical retrace or synchronization on analog televisions) and then swaps the roles of the two external frame buffers, so that the second is now displayed and the first is now waiting for a copy of the next frame.)

So even though we had programmed a wait for the graphics pipeline to empty before shutting down the game, there might still be a frame being worked on by the GPU, and if the timing were just right, then it would copy its last frame to the external frame buffer while Reboot was assembling its command line in that same region of memory, corrupting it. (This would also explain why the crash only happened in the RELEASE build: in the DEBUG build the memory layout was different and the external framebuffers did not overlap the critical region.)

It was easy to test this theory in the debugger: by putting a breakpoint in different places in Reboot we could arrange for the copied framebuffer to cause both of the observed effects (the crash in strlen and the corrupted arguments to the boot loader), and several other kinds of crash and corruption.

So after all that, it was completely trivial to fix the bug. All we had to do was wait for the GPU to finish copying out the last frame (not just for the pipeline to clear) before calling Reboot.