The ping that wouldn’t die

,

It was Friday afternoon at the customer site1 and I had completed my tasks for the week. I had implemented the new feature and its test cases, written a design justifying the implementation, and drafted a section of the user manual explaining how to use it, so now I was looking for something else to do. In any large program, there are always things that need fixing, so I poked about in the depths of the issue tracker to see if there were any interesting problems that could be appropriated from the other developers.2 The unluckily numbered and long unsolved issue 13 caught my eye:

Issue 13: Can't interrupt a command with control-C. I couldn’t remember the address of our web site, so I typed ping www at the application’s command line. Then I tried to stop the ping by typing control-C but nothing happened.

A well written issue is always pleasing to read. The problem was clearly stated, the steps to reproduce were given, the expected behaviour was described, and the actual behaviour contrasted, all in a couple of sentences.3 I started up the application, opened its command line, and found that the problem was 100% reproducible. So this ought to be easy, right?4

The first thing to do was to fully understand the architecture of the system. I knew that it looked something like this, but what was in the cloud?

The UI sends input to (and receives output from) an unknown group of processes, which eventually sends input to (and receives output from) the ping process.

In this diagram, the box labelled ‘UI’ represents the application’s user interface, which implemented a terminal interface containing the buggy command line. This was a component of the system that I had not worked on; all I knew was that it was written in Java, and I’m not much of a Java programmer. But how hard can it be? I fired up Eclipse and stepped through the sequence of events that resulted from typing control-C at the command line. It became apparent that the application used java.lang.ProcessBuilder to launch a subprocess, and when the user typed control-C, a byte with value 3 was written to the standard input of that process.

The subprocess in question was a custom program named xpty. As its name implied, it implemented a pseudo-terminal (presumably to work around Java’s lack of pseudo-terminal support). I found the source code, and it was just about the simplest possible terminal implementation you could write. It starting by calling forkpty; then the child exec’d bash -i, and the parent went into a loop using select, copying standard input to the pty’s input and the pty’s output to standard output. So the architecture was as follows:

The UI sends input to (and receives output from) xpty, which sends input to (and receives output from) bash, which sends input to (and receives output from) ping.

and it was easy to verify this using ps:

$ ps xf -o pid,pgid,comm | grep -A3 java
 10063 10063      \_ java
 40067 40067          \_ xpty
 40068 40068              \_ bash
 40082 40068                  \_ ping

So, how far along this chain did the control-C get? I attached to the xpty process in GDB, put a breakpoint on the copy function, and then typed control-C into the command line in the user interface. Sure enough:

$ sudo gdb
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
...
(gdb) attach 40067
Attaching to process 40067
Reading symbols from /home/gdr/repo/xpty/xpty...done.
Reading symbols from /lib/x86_64-linux-gnu/libutil.so.1...
...
0x00007fc132779803 in __select_nocancel ()
    at ../sysdeps/unix/syscall-template.S:81
81	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) b xpty.c:25
Breakpoint 1 at 0x400955: file xpty.c, line 25.
(gdb) c
Continuing.

Breakpoint 1, copy (in_fd=0, out_fd=3) at xpty.c:25
25	        ret = write(out_fd, buf, bytes);
(gdb) p buf[0]
$1 = 3 '\003'

So the control-C was written to file descriptor 3 (the master side of the pseudo-terminal), where it should have caused the kernel to send SIGINT to the terminal’s foreground process group (in this case, process group 40068, consisting of the bash and ping processes). Did that happen? The most reliable way to check was to use strace. Here’s a trace of the bash process showing the signal being received:

$ sudo strace -p 40068
Process 40068 attached
wait4(-1, 0x7fffd7d21bb8, WSTOPPED|WCONTINUED, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGINT {si_signo=SIGINT, si_code=SI_USER} ---
rt_sigreturn()                          = -1 EINTR (Interrupted system call)
wait4(-1, 

In fact I had already established that SIGINT was being delivered to Bash. While experimenting with the behaviour of the command line, I had typed:

$ while read X; do echo $X; done
hello
hello
world
world
^C
$

It seemed that so long as the foreground process group consisted of shell builtin functions, then it received the SIGINT and was interrupted. So the bug only showed up if the foreground process group contained other processes like ping. And that was all I had time for.

A few days later I found myself with a spare half hour, and back I went to the command line that wouldn’t be interrupted. I had shown that the control-C was being delivered to the pseudo-terminal, and that this resulted in a SIGINT being sent to Bash. So the next question was, was the SIGINT being sent to the ping process too?

It was a bit inconvenient to use ping as the test case because it spammed the strace output. But the bug applied to any non-builtin process, not just ping, so this time I set up the test case using cat instead:

$ ps xfww -o pid,pgid,comm | grep -A3 java
 17904 17904      \_ java
 18227 18227          \_ xpty
 18228 18228              \_ bash
 18240 18228                  \_ cat

I ran strace on the cat process, and typed control-C:

$ sudo strace -p 18240
Process 18240 attached
read(0, 0x206b000, 65536)               = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGINT {si_signo=SIGINT, si_code=SI_USER} ---
read(0, 

So the signal was delivered but nothing happened. What on Earth? Surely cat doesn’t ignore SIGINT? And if it did, then this would show up as a sigaction record in the output of strace cat, and it didn’t. So I was stumped.

I was busy for the rest of the week, but eventually the release was tested and shipped, and so I had another spare moment. I was sure I was getting close, but something was eluding me. Maybe the problem was that I didn’t know enough about the details of signal handling in Bash, so I looked up the Signals chapter of the Bash manual. And here it says:

Non-builtin commands started by Bash have signal handlers set to the values inherited by the shell from its parent.

I had forgotten, or maybe I had only ever dimly known, that signal handlers inherit their values from the parent process. As it says in the POSIX exec specification:

Signals set to the default action (SIG_DFL) in the calling process image shall be set to the default action in the new process image. Except for SIGCHLD, signals set to be ignored (SIG_IGN) by the calling process image shall be set to be ignored by the new process image. Signals set to be caught by the calling process image shall be set to the default action in the new process image.

So the Java runtime for the application’s user interface had set SIGINT to be ignored, and this was inherited by the xpty process, and so by bash, and finally by cat or ping or whatever was run in the shell. I opened up xpty.c and typed:

/* Restore default behaviour of SIGINT. See issue #13. */
signal(SIGINT, SIG_DFL);

  1.  This is based on a true story, but many details have been changed.

  2.  It occurs to me, writing this up now, that this behaviour might be seen by some people as a bit rude. The way I see it, there are always many more issues to work on then there is developer time available, so if I use my spare time to fix an issue assigned to you, then you’ll have time to work on some other issue instead. (But if you disagree, please comment.)

  3.  Note the serendipity involved in this report: if the tester had remembered the host or nslookup commands, he wouldn’t have needed to kill the command with control-C, and so the issue would not have been discovered.

  4.  There’s discussion of this article at Reddit.