Type inference for Python

,

1. Introduction

[Python] is a dynamically typed programming language with good support for “interface based programming”.

“Interface based programming” is a style of programming in which arguments to functions need not have a particular type, but must match a specified interface. (Interfaces are an explicit feature of the type system in Java; they are called “signatures” in ML; and are important parts of technologies like COM [Pattison 1999-01] and CORBA.)

Python’s built-in functions are good examples. Consider the function len [van Rossum 2001-12-21, 2.1]. This has the following interface:

For example, we might define a class all of whose instances have length 1:

class length_one:
    def __len__(self):
        return 1

and then len(length_one()) is 1.

This interface based approach in Python allows progammers to develop their own implementations of lists, strings, dictionaries, files and other native types, and use them wherever a native object would have been legal. See [van Rossum 2001-12-21, 3.3].

Interface based programming extends to modules. For example, the Python Database API [DBAPI 2.0] specifies an interface to a relational database. A program that uses the interface can be ported between the database backends that implement the interface [DBAPI Modules].

The Python language itself does not know about interfaces: interfaces are specified by documenting them. However, we can consider every Python function to specify an interface: namely, the set of objects that can be passed to it.

For example, consider the function successor:

def successor(n):
    return n + 1

This can be considered to offer the following interface:

This paper will formalize this notion of interfaces and set out a type system for expressing these interfaces.

2. A type system for Python

2.1. Caveat

The type system described in this paper is not sufficiently general to be able to give a type to all Python programs or even all Python programs that never have run-time type errors. However, it is conservative in the sense that it never gives a type to a program that can fail with a run-time type error, and when it does fail it is usually because the programmer has used a “trick” of the language.

This means that it can be a useful tool for checking that a program adheres to a type discipline. There’s more detail about this in section 4.

2.2. Types

Given types T1, T2, etc, we can build more types by applying the following two type constructors:

Here m1, m2, etc are attribute or method names, namely non-empty strings. Object types are “open” if they are not fully known (that is, they may have more methods than the specified ones); and “closed” if all methods are known. The distinction is not always important and I will write object types without the superscript to mean either open or closed.

I will write meth(T) for the set of methods of an object type. This means that:

meth({ m1: T1, ..., mn: Tn }0) ⊇ { m1, ..., mn }
meth({ m1: T1, ..., mn: Tn }1) = { m1, ..., mn }

In addition, there are some predefined names for types and type constructors:

None =      { }1
Number =    { __add__: Number → Number,
              __sub__: Number → Number,
              __mul__: Number → Number, ... }1
List(T) =   { __len__: () → Number,
              __getitem__: Number → T,
	      __setitem__: (Number, T) → None, ... }1
Dict(T,U) = { __len__: () → Number,
              __getitem__: T → U,
	      __setitem__: (T, U) → None, ... }1

Note that these definitions have no foundation. Types in Python are general graphs.

2.3. Subtypes

The subtype relation T <: U means that any object of type T can be used in a place where an object of type U was required, or more simply, that an object of type T implements the interface U.

The subtype relation satisfies the following rules (all except (Transitive) can be applied in reverse):

T <: T (Reflexive)
 
T <: U   U <: V
T <: V
(Transitive)
 
T <: U   U <: T
T = U
(Antisymmetric)
 
T → U <: V → W
V <: T   U <: W
(Function)
 
{ mi: Ti | i ∈ J ∪ I } <: { mj: Uj | j ∈ J }
Tj <: Uj (j ∈ J)
(Object)
 
( T1, T2, ..., Tn ) <: ( U1, U2, ..., Un )
Ti <: Ui (1 ≤ i ≤ n)
(Tuple)

2.4. Union and intersection of types

The union of two types is the set of objects of either type. The intersection of two types is the set of objects that match both types. The union and intersection operations obey these rules:

T <: V   U <: V
T ∪ U <: V
(Union)
 
T <: T ∪ U (Union-L)
 
U <: T ∪ U (Union-R)
 
T <: U   T <: V
T <: U ∩ V
(Intersection)
 
T ∩ U <: T (Intersection-L)
 
T ∩ U <: U (Intersection-R)

These rules can be derived from the rules above:

 
(T → U) ∪ (V → U) <: (T ∩ V) → U (∪-Function)
 
(T → U) ∩ (T → V) <: T → (U ∩ V) (∩-Function)
 
(T1, T2, ..., Tn) ∪ (U1, U2, ..., Un) <: (T1 ∪ U1, T2 ∪ U2, ..., Tn ∪ Un) (∪-Tuple)
 
(T1 ∩ U1, T2 ∩ U2, ..., Tn ∩ Un) <: (T1, T2, ..., Tn) ∩ (U1, U2, ..., Un) (∩-Tuple)
 
{ mi: Ti | i ∈ I } ∪ { mj: Uj | j ∈ J } <: { mi: Ti ∪ Ui | i ∈ I ∩ J } (∪-Object)
 
{ mi: Ti | i ∈ I } ∩ { mj: Uj | j ∈ J } <: { mi: Ti ∩ Ui | i ∈ I ∪ J } (∩-Object)

3. Type inference

I suspect that there’s no type inference algorithm for this system. Type inference for types without foundations and intersection types is known to be hard.

However, that doesn’t mean we won’t be able to get useful results from a procedure that attempts to deduce a type in this system, because we may be able to prove that a program can’t have a type, which means either that the program may fail with a run-time type error, or else that the programmer has used a trick of the Python language not modelled by the type system. In either case, the result is useful.

The basic idea behind type inference is to assign a type to each expression in the program. Initially these types are placeholders. Then we apply rules for the language constructs that allow us to place constraints on the types.

3.1. Examples

These examples are informal, because I haven’t yet explained how to assign types to pieces of Python code. But they should be clear.

3.1.1. Example: the type of successor

def successor(n):
    return n + 1

Let successor: A and n: B. Then I deduce:

  1. A = B → C from line 1.
  2. B = { __add__: D → E }0 and Number <: D from the + on line 2.
  3. E <: C from the return on line 2.
  4. Number → C <: D → E by rule (Function).
  5. { __add__: Number → C }0 <: B by rule (Object).
  6. A <: { __add__: Number → C }0 → C by rule (Function).

This type expresses all the facts I noted about successor in section 1 so I can’t do any better than this.

3.1.2. Example: no type

len(1)

Let len: A and len(1): B. I deduce:

  1. A <: Number → B from the function application.
  2. A = { __len__: () → Number }0 → Number by the definition of len.
  3. __len__ ∈ meth(Number) by rule (Object).
  4. But this is false: __len__ ∉ meth(Number) (note that this is crucially dependent on Number being a closed object type; if it were open then the code would typecheck).

3.1.3. Example: an intersection

len(a)
a + 1

Let a: A. Then I deduce:

  1. A <: { __len__: () → Number } from line 1.
  2. A = { __add__: B → C }0 and Number <: B from line 2.
  3. A <: { __len__: () → Number } ∩ { __add__: Number → B } by rule (Intersection).
  4. A <: { __len__: () → Number, __add__: Number → B } by rule (∩-Object).

3.1.4. Example: a union

a + 1
a + ""

Let a: A. Then I deduce:

  1. A = { __add__: B → C }0 and Number <: B from line 1.
  2. String <: B from line 2.
  3. Number ∪ String <: B by rule (Union).
  4. A <: { __add__: (Number ∪ String) → C } by rules (Function) and (Object).

4. Limitations

The informal examples of type inference assume that each site has a single type (implements a single interface). But Python doesn’t enforce that: a programmer can use one variable with several meanings. For example, the following correct code doesn’t have a type in this system:

a = 1
a = []
len(a)

because we deduce

  1. Number <: A from line 1.
  2. A <: { __len__: () → Number } from line 3.
  3. Number <: { __len__: () → Number } by rule (Transitive).
  4. But this is impossible since Number has no method __len__.

The type system can’t deduce all the information it needs to. For example, the following correct code doesn’t have a type in this system:

a = lambda x: x
b = a(1)
c = a([])
len(c)

because we deduce

  1. D <: E and D → E <: A from line 1.
  2. Number <: D from line 2.
  3. E <: C from line 3.
  4. Number <: C by rule (Transitive).
  5. C <: { __len__: () → Number } from line 4.
  6. Number <: { __len__: () → Number } by rule (Transitive).
  7. But this is impossible since Number has no method __len__.

Here we didn’t deduce enough information about A: it’s really something like ∀T: T → T. But type inference is undecidable for many systems involving second-order types [Well 1996-08-05] so I didn’t want to include them in my system. And for similar reasons, this algorithm can’t deal with programs that switch on the types of objects.

A. References

[Aycock 1999-11-23] “Aggressive type inference”; John Aycock; Department of Computer Science, University of Victoria, Canada; 1999-11-23.
[DBAPI 2.0] “Python Database API Specification 2.0”.
[DBAPI Modules] “Python Database Modules”.
[Jeske 1999-11-21] “Python Type-Inference based LINT (pylint.py)”; David Jeske; 1999-11-21.
[Jeske 1999-12-05] “Re: Python Type-Inference based LINT (pylint.py)”; David Jeske; 1999-12-05.
[Pattison 1999-01] “Understanding Interface-based Programming”; Ted Pattison Microsoft; 1999-01.
[Python] “The Python programming language”.
[Well 1996-08-05] “Typability and type checking in the second-order λ calculus are equivalent and undecidable”; J B Wells; Boston University; 1996-08-05.
[van Rossum 2001-12-21] “Python Reference Manual (Python 2.2)”; Guide van Rossum; 2001-12-21.

B. Document History

2002-02-26 GDR Created.