My wife works as a management accountant. Her work often involves pulling large volumes of financial data from the company accounting database and analyzing the results in Microsoft Excel. Occasionally she rings me up with a problem: she has a spreadsheet with thousands, or tens of thousands, of rows, and a task that she doesn’t know how to automate.
I’ve never used Microsoft Excel myself, but I have some skill at programming, so by reading the manual carefully I can figure out how to write the formula she needs. However, lots of things that ought to be easy seems pretty hard to do in Excel, and even once I’ve figured out how to do them, I usually can’t figure out how to make my solution readable, comprehensible, and maintainable.
Let’s take an example. Suppose we have a table of contracts like this:
|… etc …|
A straightforward query on a table like this might be, “how much business did we sign with Foo Ltd in the 2005–2006 financial year?” This kind of query is the bread and butter of SQL: it’s straightforward to write and clear and simple to read and understand:1
SELECT SUM(value) FROM contracts WHERE customer = 'Foo Ltd' AND signed >= '20050401' AND signed < '20060401'
In Microsoft Excel it’s not so simple to write, nor is the resulting code particularly easy to read (even if we format it nicely, which is not exactly easy to do in Excel):
=SUMPRODUCT(($D$2:$D$99) * ($A$2:$A$99 = "Foo Ltd") * ($B$2:$B$99 >= "20050401") * ($B$2:$B$99 < "20060401"))
Some explanations for readers who may not be familiar with Excel: a dollar sign specifies an absolute cell reference, for example
$A$2 always refers to cell A2 regardless of which cell this formula is copied into, but
A2 will change when the formula is copied or moved. For example, if this formula started out in cell E2 and then is copied to cell F3, an
A2 reference will change to
$A$2 will remain unchanged.
SUMPRODUCT is a function that multiplies a bunch of cell ranges and sums the resulting products. (When there are two ranges, this is like the dot product of two vectors.) A range of cells together with an operation refers to the result of applying the operation. Since the result of a logical operation like
<= is 0 or 1, the effect of multiplying a range of values by a bunch of ranges with conditions is to select just those values where all the conditions apply. In Excel, it’s possible to name columns or ranges of cells, so with the appropriate names, this query wouldn’t look too bad:
=SUMPRODUCT(value * (company = "Foo Ltd") * (signed >= "20050401") * (signed < "20060401"))
SUMPRODUCT corresponds to a logical and operation. As far as I know, there’s no Excel function that works like
SUMPRODUCT but has the effect of taking the logical or of its conditions. So to express some queries using
SUMPRODUCT you might need to apply De Morgan’s laws, which would further reduce the legibility of the formula. And some other queries are not expressible at all using this approach.
So is there a better way? Well, Excel 2003 has a bunch of “database functions” which allow you treat a range of cells like a database table and do stuff to those rows that match some conditions. Sound good, right? But you have to specify those conditions in cells somewhere in your spreadsheet. For our example above we could use
DSUM and write our query as:
=DSUM($A$1:$D$99, "value", $A$100:$D$101)
$A$1:$D$99 is the “table” we want to query,
"value" is the field we want to sum, and
$A$100:$C$101 is the location of our query conditions, which should look like this:
|101||="=Foo Ltd"||>= "20050401"||< "20060401"|
Each row of the query (in this case, only one row) corresponds to a conjunction: all the conditions have to be true. So note that since we have two conditions on the signing date, we need two columns headed by “signed”. Note also the funny syntax in cell A101 that we need to use to prevent Excel from thinking that the cell contains a formula.
If there are multiple rows in the query specification then a record will be included in the
DSUM if any of the query rows match. This allows you to express any query in disjunctive normal form. Well, it’s a theorem of Boolean logic that any Boolean expression can be converted to disjunctive normal form, but I wouldn’t like to have to do it every time. So this method is rubbish: you have to express your query in disjunctive normal form, you have to put it in some other part of your spreadsheet, not in your formula where it belongs, and it’s still ugly and difficult to read.
Now let’s introduce some more complexity. Suppose we have another spreadsheet with the revenue for each contract by quarter (quarters are labelled with the month they start with):
|… etc …|
Now we want to know, “how much revenue did we make from Foo Ltd in financial year 2005–2006?” This involves cross-referencing the two tables. If we were working with databases, we’d be stuck here, because the data is not in database normal form. But at database design time, we would have noticed this and put the revenue table into normal form, like this:
|… etc …|
Now we can use a join to get the result we want:
SELECT SUM(revenue.revenue) FROM contracts, revenue WHERE contracts.company = 'Foo Ltd' AND contracts.contract = revenue.contract AND revenue.quarter >= '200504' AND revenue.quarter < '200604'
Of course, a normal form database is not very nice for a human to read, compared with a spreadsheet. But it’s simple to get the data out in spreadsheet form, and indeed that’s probably where the spreadsheet came from in the first place.
So how do we approach this problem in Excel? Well, I don’t know a way to do this in a single formula, so I’m going to have to store some intermediate results. Let’s add two more columns to the revenue spreadsheet, one for the company name, and one for the total 2005–2006 revenue.
|… etc …|
Column G can be computed using
VLOOKUP in a formula like this (for cell G2):
=VLOOKUP($A2, contracts!$A$2:$B$99, 2, FALSE)
A2!) We might be tempted to use a simple sum for column H (for example,
=SUM($B2:$E2) in cell H2) but to be properly general we should really write something like this (in cell H2):
=SUMPRODUCT(($B$1:$F$1 >= "200504") * ($B$1:$F$1 < "200604") * ($B2:$F2))
And finally we can compute the sum we want:
=SUMPRODUCT(($A$1:$A$99 = "Foo Ltd") * ($H$1:$H$99))
This is ghastly. This kind of operation (joining two tables based on a foreign key) ought to be really simple, but in Excel we struggled with naming cells, awkward syntax for constant and variable cell references that’s easy to get wrong, Boolean normal forms, and the tedium of having to re-copy our formulas into all the cells we want to compute every time we make an edit to the code—and still failed to write code that’s easy to read, understand, or maintain.
So what’s the alternative? (Other than ditching Excel for something better, of course.) Maybe Visual Basic is the right way to go? Perhaps some Excel expert can present some advice.
A tongue-in-cheek theory about product marketing is that it may be better to have a product that’s hard to use than a product that “just works”. There are two reasons: first, users of the product that’s hard to use will have to get together and share solutions, create user groups, have tutorials and training sessions, and so on, thus creating a sense of community and lots of free publicity and word of mouth; and second, users will get a lot of pleasure from accomplishing the simplest things.
Perhaps the latter is one of the reasons why Excel is so popular: you can get a genuine feeling of accomplishment from merely having figured out how to add up a bunch of numbers.
↩ In practice I’d probably use some system facility for representing and comparing dates, but for simplicity of exposition I’m avoiding this complication.