Python has no precedence grammar


I was writing a minifier for Python 2 when I noticed1 an interesting feature: you can’t express the Python grammar in the form of an operator precedence grammar.

Background: an operator precedence grammar is one in which every pair of operators have a precedence relation, which allows the parser to decide whether to shift or reduce by looking only at the next operator. (See operator precedence parser.) A feature of operator precedence grammars is that you can put the operators in a precedence list, with operators higher in the list binding more tightly than operators below.

However, this can’t be done in Python, because tuples and generator expressions are incomparable: neither has higher precedence than the other. In the context of an assignment you must parenthesize a generator expression but you can write a tuple without parentheses:

a = 1, 2, 3                   # Assign tuple to a.
a = (x**2 for x in range(5))  # Assign generator to a.
a = x**2 for x in range(5)    # Oops: syntax error.

While in the context of a function call you must parenthesize a tuple, but you can write a (single) generator expression without extra parentheses:

f(x**2 for x in range(5))     # Call f with generator.
f((1, 2, 3))                  # Call f with tuple.
f(1, 2, 3)                    # Oops: three arguments, not tuple.

This inconsistency manifests itself in the grammar having two different productions for tuples and generators. Normally a generator expression is parsed by the testlist_comp production:

testlist_comp: test ( comp_for | (',' test)* [','] )

But in the context of a function call containing a single generator expression it can also be parsed via the argument production:

argument: test [comp_for] | test '=' test

Similarly, a tuple can be parsed via the testlist production:

testlist: test (',' test)* [',']

but inside (non-function call) parentheses it gets parsed by the testlist_comp production (see above).

I guess I understand the pragmatic reason for designing it like this: it is very common to call a function with a generator expression as its argument, while it is uncommon to call a function with a tuple expression: normally you’d arrange just to call the function with multiple arguments. However, it does mean extra cases in the parser and in anything that tries to generate source code while minimizing parentheses.

I tried Dan McDougall’s minifier but it has a few bugs:

  1. Its docstring detection is broken: if I run it on:

    def foo():
        'a' if True else 'd'
    def bar():

    then I get the output

    def foo():
     if True else 'd'

    where a syntax error has been introduced into foo, and bar has simply disappeared!

  2. It deletes blank lines in triple-quoted strings.

  3. It adds a spurious pass to a function whose body is on the same line as its declaration. If I run it on:

    def baz(): return 1

    then I get the ill-formed output

    def baz():return 1 pass

And some missing features:

  1. It doesn’t know that it can squeeze the spaces out of 'a' if 'b' and 'c' else 'd'.

  2. It doesn’t know that operators don’t need a space when they follow a number, so that 1 if 2 in 3 and 4 is 5 else 6 can be turned into 1if 2in 3and 4is 5 else 6. (I have to say this surprised me when I discovered it.)

  3. It doesn’t know that it can move up a suite of code onto the same line as the statement that introduces the suite, as long as the suite itself contains no sub-suites.

  4. It doesn’t know anything about the grammar (it works on the basis of tokens only), so it can’t remove unnecessary parentheses. For example, ((a - b) + (c - (d ** (e ** f)))) ought to be minified to a-b+(c-d**e**f).2 Unnecessary parentheses often result when a multi-line parenthesized expression is joined up onto a single line, so it’s important to handle this case.

  5. It doesn’t remove trailing commas from tuple, dictionary, and list literals.

  6. It doesn’t change except E as e: to except E,e:.

  7. It doesn’t change pass to 0.

  8. It can’t rename variables and functions.

The last of these is fair enough: Python’s use of introspection and duck typing means that automatic renaming cannot be done reliably without user oversight. But still it would be nice to be able to ask for a “best effort” automated approach.

Here’s some example output from my minifier. What does this function do?

def b(c):
 if c not in a:d=c//2;e=c%2;f=e*2-1;a[c]=b(d+1)**2+f*b(d+e-1)**2
 return a[c]

And how about this one?

import random
def a(b,c):
 if b%2==0:return False
 while d%2==0:d//=2;e+=1
 for f in xrange(c):
  if h==1or h==b-1:continue
  for i in xrange(e-1):
   if h==1:return False
   if h==b-1:break
  if h!=b-1:return False
 return True

  1.  Actually, I struggled for some time trying to figure out why my code (which uses a table of operator precedences to deduce when expressions need to be parenthesized) wasn’t working. Whenever I fixed it for tuples, generator expressions would break, and vice versa.

  2.  Not a-b+c-d**e**f: even though the removal of the last parenthesis would be a valid transformation if all these variables were numbers, it’s not correct in general, as it would result in a different order of operations, which might make a difference if c is an object that implements the __sub__ method, for example.