Software inspection and the Heartbleed bug

,

Since 2005, when a train crashes in the UK, a professional body of investigators—the Rail Accident Investigation Branch—is tasked with determining the cause of the incident and making recommendations to reduce the likelihood, or mitigate the severity, of similar events occurring in the future. There are similar branches tasked with investigating Air and Marine accidents.

There is nothing like this for computer security incidents. Millions of sysadmin-hours have been spent dealing with the fallout of the Heartbleed bug, but what are the lessons we should learn? In the absence of any kind of Software Failure Investigation Branch to collect evidence and make recommendations, we have to do our best to figure this out for ourselves.1

The first thing that struck me when I read the commit that introduced the bug was, this would never have passed inspection if it had been submitted to the Memory Pool System!2

Software inspection is not always fully understood. It’s not just a matter of reviewing the code and looking for bugs (though that is valuable): the most significant benefits of inspection come from analyzing the causes of bugs, and improving processes and standards so that bugs are less likely to be introduced again. Finding bugs is difficult and expensive; writing code in a style that avoids introducing bugs in the first place, or makes bugs easy to detect, is much cheaper.

An important part of this feedback process is the production of rules that code has to obey. Rules speed up development and inspection because you don’t have to make a decision in each case. Decisions require analysis, and the analysis might be fragile. For example, faced with a decision about whether to check the result of malloc(), you might carry out an analysis like, “this program never allocates more than such-and-such, and on all the platforms we support there’s plenty of virtual memory and so malloc() never has trouble providing what we want, and so there’s no need to check the result.” But this analysis is fragile: in future the program may change so that it allocates more memory, or the program may be ported to an embedded platform where memory is limited. So you can avoid the cost of carrying out the analysis and the cost of possible future failures by checking the result. Perhaps the first time you do this you will grumble about having to figure out what to do in the exceptional case, but since you are a decent engineer, you’ll figure out a robust and re-usable approach, so by the tenth time you allocate some memory, adding the check is trivial.

Rules should also be used to guide the inspection, and transgressions identified and scrutinized. Sometimes there’s no way to meet your requirements without breaking a rule, but these cases need careful justification.

So let’s imagine we started the inspection process with only the following very generic rules in force (these are rules we use at Ravenbrook: they are based on rules from Tom Gilb and Dorothy Graham’s book Software Inspection). In these rules, “document” means “thing being inspected”—it’s usually source code, but also specifications, designs, user manuals, procedures, or anything else whose correctness is important enough to need inspecting.

rule.clear Documents must be clear to the intended readership.

rule.sources Documents shall refer to sources which justify them.

rule.consistent Documents must be consistent with each other and with themselves.

rule.assume Code in one module may not make assumptions about the inner workings of other modules.

Here’s the inspection report, based on applying these rules to every line of the code being inspected:

int
dtls1_process_heartbeat(SSL *s)
  1. [rule.clear] There’s no specification for this function. We can guess from the name that it sends a response to a heartbeat request, but what are the requirements on the input and what exactly is it required to do? Without a spec, how can anyone check that this code is correctly implementing its spec? This suggests a new rule:

    rule.spec Every function must have a specification to enable its implementation to be checked.

    Maybe this version of the rule will turn out to be too strict: perhaps some functions can be checked without a spec. But rules don’t have to be perfect to be valuable, so this will do for now and we can revise it later.

    {
    unsigned char *p = &s->s3->rrec.data[0], *pl;
  1. [rule.assume] How does this code know that these dereferences are legitimate? How does it know that s, s->s3, and s->s3->rrec.data are all non-NULL? Even if these values are all non-NULL, how does the code know that the structures pointed to are correct? I would expect to see something like this:

    SSL3_STATE *s3;
    SSL3_RECORD *rrec;
    unsigned char *p;
    
    assert(ssl_check(s));
    s3 = s->s3;
    assert(ssl3_state_check(s3));
    rrec = &s3->rrec;
    assert(ssl3_record_check(rrec));
    p = rrec->data;
    

    to check that all the arguments are valid (given suitable definitions of the *_check() functions).

  2. [rule.assume] There should be a check on the record type:

    assert(rrec->type == TLS1_RT_HEARTBEAT);
    
  3. [rule.assume] There should be a check that rrec has actually been filled in by a received packet. I stared at the code for ssl3_get_record() but I couldn’t figure out what the right condition would be. The code sets s->rstate = SSL_ST_READ_HEADER once the record has been read, but this is also true in the initial state.

    unsigned short hbtype;
    unsigned int payload;
  1. [rule.clear] The payload variable contains the size of a data structure in bytes, so it should have type size_t, both to make that fact clear to the reader (and, in the general case, to ease porting to platforms where size_t is different from unsigned int).

    unsigned int padding = 16; /* Use minimum padding */
  1. [rule.clear, rule.sources] This would be easier to understand if there were a reference to §4 of RFC 6520 which explains, “The sender of a HeartbeatMessage MUST use a random padding of at least 16 bytes.”

    /* Read type and payload length first */
    hbtype = *p++;
  1. [rule.assume] How does the code know that this read from p is legitimate? There needs to be some check on rrec->length. This suggests the rule:

    rule.length Every deserialization of data from a buffer must be protected by a check on the length of the buffer.

    which we can then apply in the remainder of the inspection.

    n2s(p, payload);
  1. [rule.length] How does the code know that this read is legitimate?

  2. [rule.clear] It is not obvious that n2s(p, payload) has the side-effect of updating the values of p and payload, which makes the code harder to read and analyze. (Similarly for s2n.)

  3. [rule.assume] There is no check on the value of payload. You might imagine that n2s() decodes two bytes and so can only return values between 0 and 216−1, but it’s implemented like this:

    #define n2s(c,s) ((s=(((unsigned int)(c[0]))<< 8)| \
                         (((unsigned int)(c[1]))    )),c+=2)
    

    and so if (due to a programming error) the type of c were wider than char, then it could return a number greater than 216−1.

  4. [rule.assume, rule.sources] There is no check that payload satisfies the condition from RFC 6520, “The total length of a HeartbeatMessage MUST NOT exceed 214 or max_fragment_length when negotiated as defined in [RFC6066].”

    pl = p;

    if (s->msg_callback)
  1. [rule.clear] It is poor design to rely on an exceptional value of a data type (here, NULL is an exceptional value of the msg_callback function type). This means every time you access the value you have to remember to check it for the exceptional case. And if you forget, then the omission does not stand out clearly to the reader.

    An alternative approach would be to initialize s->msg_callback with a function that does nothing but check its arguments. Then you could always call the callback without having to remember to check it. This suggests the rule:

    rule.exceptional Don’t represent exceptional situations using an exceptional value of a variable that is used for some other purpose.

        s->msg_callback(0, s->version, TLS1_RT_HEARTBEAT,
            &s->s3->rrec.data[0], s->s3->rrec.length,
            s, s->msg_callback_arg);

    if (hbtype == TLS1_HB_REQUEST)
        {
        unsigned char *buffer, *bp;
        int r;

        /* Allocate memory for the response, size is 1 byte
         * message type, plus 2 bytes payload length, plus
         * payload, plus padding
         */
        buffer = OPENSSL_malloc(1 + 2 + payload + padding);
  1. [rule.assume] How does the code know that OPENSSL_malloc() succeeds here? There needs to be a check on the return value.

  2. [rule.exceptional] This failure follows from the use of an exceptional value (NULL as the allocated buffer) to represent an exceptional situation (out of memory).

        bp = buffer;

        /* Enter response type, length and copy payload */
        *bp++ = TLS1_HB_RESPONSE;
        s2n(payload, bp);
        memcpy(bp, pl, payload);
  1. [rule.length] How does the code know that this copy is legitimate?

        /* Random padding */
        RAND_pseudo_bytes(p, padding);
  1. Why are these random bytes being written to the source buffer? Surely they should be written to the destination buffer at bp + payload?

        r = dtls1_write_bytes(s, TLS1_RT_HEARTBEAT, buffer, 3 + payload + padding);
  1. [rule.consistent] This is the second time the buffer length has been computed (and there’s a third computation below). This should be stored in a variable to ensure that it’s consistent between the three uses.

        if (r >= 0 && s->msg_callback)
            s->msg_callback(1, s->version, TLS1_RT_HEARTBEAT,
                buffer, 3 + payload + padding,
                s, s->msg_callback_arg);

        OPENSSL_free(buffer);

        if (r < 0)
            return r;
        }
    else if (hbtype == TLS1_HB_RESPONSE)
        {
        unsigned int seq;

        /* We only send sequence numbers (2 bytes unsigned int),
         * and 16 random bytes, so we just try to read the
         * sequence number */
        n2s(pl, seq);
  1. [rule.length] How does the code know that this read is legitimate?

        if (payload == 18 && seq == s->tlsext_hb_seq)
  1. [rule.clear] The purpose of this line (and the magic number 18) needs an explanation. Presumably this code aims to recognize a response to a heartbeat request sent by this SSL implementation. There needs to be a comment and a cross-reference to the code that sends the corresponding request.

  2. [rule.consistent, rule.assume] What if the code for sending the request needed to be changed? The number 18 should be computed in some way that’s consistent between the two functions, so that there’s no risk of the two becoming inconsistent.

  3. [rule.assume] Shouldn’t there be a check on s->tlsext_hb_pending too?

            {
            dtls1_stop_timer(s);
            s->tlsext_hb_seq++;
            s->tlsext_hb_pending = 0;
            }
        }
  1. [rule.clear] What about other values of hbtype? I guess the intention is to silently discard incorrect messages, but it would be nice to have a comment confirming this.

    return 0;
    }

Item 15 in this inspection report is the cause of the Heartbleed bug,3 but you should note that the inspection report does not actually identify the bug. All that I was able to observe is that the code assumes, without checking, that it can read 3 + payload bytes from s->s3->rrec->data, and so the function relies for its correctness on its callers setting up the data structure correctly. For all I know, the callers could all have checked the buffer size and then there would be no bug. To have actually discovered the Heartbleed bug, I’d have had to analyze the consequences of every issue encountered here, which is not realistic. (With hindsight, it’s straightforward to do the analysis for this one item and so find the bug, but you have to bear in mind that the vast majority of issues found in inspection don’t indicate serious bugs.)

The advantage of the inspection process is that we don’t have to find the bug to fix it! The inspection process identifies the broken rule (the missing check) and remedial action must be taken, whether or not the broken rule indicates a bug.

Item 16 also indicates a security bug. The purpose of cryptographic padding is to make known-plaintext and chosen-plaintext attacks more difficult, so by writing the padding bytes to the wrong buffer, the program is losing some protection against these attacks. The author spotted this bug after a couple of months and fixed it in this commit in February 2012.

There are some common objections to software inspection:

  1. The checking code that will have to be introduced to meet these rules will slow the program so much that it will fail to meet its requirements. This is just premature optimization. How do you know this is the case until you try? If you have a program that you are confident is correct (because it systematically checks result codes and asserts the correctness of its data structures) then it is usually a straightforward engineering task to make it run faster (by compiling out the small proportion of checks and assertions that lie on the critical path, by using Multics-style structure marking, or other techniques).4 But if you have a fast program that you are not confident is correct, it is a much harder task to establish that confidence.

  2. It would be a vast amount of work to apply these rules to the body of legacy code, and change is always risky: in trying to add checks and assertions wholesale, it would be easy to introduce bugs. That’s right, but it doesn’t prevent you from applying the inspection procedure to new or updated code.

  3. No programmer would accept their code being subject to petty rules. If the rules are petty—if they are not providing benefit to justify the cost of complying—then they should be revised or discarded.

  4. Where can we get the time or resources to do this inspection? This is the most salient objection. The promise of software inspection is that it saves time overall, by preventing bugs from being introduced into the code, and so saving substantial time later in development. But getting it off the ground requires some investment based on the promise of future payback. If in doubt, measure! Track the cost of dealing with defects; introduce inspection in one area where reliability is critical; track the cost of the inspection process; after sufficient time has passed, compare the cost of dealing with defects in inspected and uninspected software. Is there a saving? Does the saving pay for the inspection?


  1.  Obviously everyone brings their own perspective to this analysis, and so, to take an example, John Regehr (a researcher in static analysis) suggests that improvements to static analysis are the way to find bugs like Heartbleed. It would be easy to mock this tendency (“why the Heartbleed bug shows that you need to buy our product”) but technology and methodology recommendations are always going to come from people who are invested in them, so if you ignored all recommendation that were the least bit self-serving, you’d have to ignore everything, at least until the Software Failure Investigation Branch is formed.

  2.   The Memory Pool System is an incremental, thread-safe, generational garbage collector originally developed at Harlequin in Cambridge, and now maintained and developed by Ravenbrook.

    Security engineering is the hardest sub-discipline of computer programming, because of the nature of the adversaries: when writing ordinary programs, your adversaries are the complexity of the task, the stupidity of the computer, and your own human frailties and mistakes; but when writing programs that have to be secure, your adversaries are other programmers, some of whom are cleverer, more motivated and better funded than you are (and just one of them has to find something that you missed). But if writing secure programs is hard, incremental thread-safe garbage collection is also hard: the task of the Memory Pool System is to maintain a conservative approximation to the set of live objects, move objects around memory to improve locality and avoid fragmentation, at the same time as client threads are busy reading and writing all over that same memory. Any small failure to ensure the consistency of data structures can eventually leads to disaster as live objects are overwritten or collected.

    The techniques described in this article have been invaluable in ensuring that the Memory Pool System has a very low rate of serious defects.

  3.  See Troy Hunt and Ted Unangst for analysis of the bug.

  4.  In particular, if you are skilled enough to develop an SSL implementation, then surely you can’t claim to be unable to engineer the trade-off between checking and speed?