Bogus foreign keys in Django

,

This is a problem that had me puzzled for a while, but has a really simple solution, so I’m writing it up here in case it might be useful to other developers who find themselves in this slightly bizarre situation.

1. The background

There’s quite a lot of background to get through before I can introduce the problem. Imagine that you are developing a web site using the Django framework. The underlying database has been in production for some years, so that the ideal mode of operation (where you describe your data model to Django and then it creates the appropriate database tables) is not available. Instead you have to reverse-engineer the database and translate it into a form that Django understands. So you have a look at the database and it looks like this:1

mysql> select * from user limit 3;
+--------+------------------+----------------+
| userid | name             | (other fields) |
+--------+------------------+----------------+
|      1 | Alarico De Luca  | ...            |
|      2 | Blanka Zielinska | ...            |
|      3 | Cristin Åberg    | ...            |
+--------+------------------+----------------+
3 rows in set (0.00 sec)

mysql> select * from task limit 3;
+--------+--------+----------------+
| taskid | userid | (other fields) |
+--------+--------+----------------+
|      1 | 3      | ...            |
|      2 | 1      | ...            |
|      3 | 2      | ...            |
+--------+--------+----------------+
3 rows in set (0.00 sec)

So there are tasks and users, and each task is assigned to a user. So let’s translate that into a couple of Django models:

from django.db import models

class User(models.Model):
    id = models.IntegerField(primary_key=True, db_column='userid')
    name = models.CharField()

class Task(models.Model):
    id = models.IntegerField(primary_key=True, db_column='taskid')
    user = models.ForeignKey(User, null=True, db_column='userid')

This all looks good from the Django shell:

>>> from myapp.models import *
>>> User.objects.values('name')[0]
{'name': u'Alarico De Luca'}
>>> Task.objects.get(id = 1).user.name
u'Cristin Åberg'

So you write a template to present some query results in a table:

<table>
  <thead>
    <tr>
      <td>#</td>
      <td>Assigned to</td>
    </tr>
  </thead>
  <tbody>
    {% for task in tasks %}
      <tr>
        <td>{{ task.id }}</td>
        <td>{{ task.user.name }}</td>
      </tr>
    {% endfor %}
  </tbody>
</table>

But when you run some sample queries, the Django server falls over with the following error while expanding {{ task.user.name }} in the template:

Traceback (most recent call last):
  [...]
  File "[...]/django/db/models/fields/__init__.py", line 537, in get_prep_value
    return int(value)
ValueError: invalid literal for int() with base 10: 'John Smith'

What’s going on here? You dig into the backtrace and discover that the offending task is number 123.

mysql> select * from task where taskid=123;
+--------+------------+----------------+
| taskid | userid     | (other fields) |
+--------+------------+----------------+
|    123 | John Smith | ...            |
+--------+------------+----------------+
1 row in set (0.00 sec)

This explains why Django is trying to convert a string into an integer when trying to get the user for this task. The userid field is a string field! This was in fact deducible from the way the numbers in this field were left-aligned in the output of the query select * from task, but that was easy to miss. Or it would have been obvious if you had done the sensible thing and looked at the table schema as well as the first few rows:

mysql> describe task;
+----------+--------------+------+-----+---------+----------------+
| Field    | Type         | Null | Key | Default | Extra          |
+----------+--------------+------+-----+---------+----------------+
| taskid   | int(11)      | NO   | PRI | NULL    | auto_increment |
| userid   | varchar(100) | NO   |     | NULL    |                |
| ...      | ...          | ...  | ... | ...     | ...            |
+----------+--------------+------+-----+---------+----------------+
11 rows in set (0.38 sec)

You can guess how this might have happened: perhaps originally there was no user table: instead, the user’s name was written directly into a field in the task table. Later, the user table was added, and instead of adding another field to the task table containing a foreign key on the user table, the foreign key was placed in the existing userid field. MySQL is quite happy to join a string field against an integer field (performing an implicit conversion), so it would have been easy for an inexpert or hurried database administrator to make the change this way. Now of course, with several applications deployed against this database, it is not worth the work to fix the schema.

2. The problem

Right, so now you know that userid is a string field, but how do you fix the Django model? In particular, if you make this field a models.CharField, how do you get the associated user? Here’s a sequence of plans that you might try:

  1. Leave the field as a models.ForeignKey field, but add a user() method that catches the ValueError:

    class Task(models.Model):
        id = models.IntegerField(primary_key=True, db_column='taskid')
        userid = models.ForeignKey(User, null=True)
    
        def user(self):
            try:
                return self.userid
            except ValueError:
                return None
    

    This works, but how do you get at the user name in the exceptional cases where it’s the name (rather than the foreign key) that’s stored in the record’s userid field?

  2. Looking at the implementation of foreign key fields in Django (see related.py), it seems that you can get the actual value of a foreign key field F (rather than the related object) like this:

    getattr(instance, type(instance).F.field.attname)
    

    So you could add a user_name() method like this:

    class Task(models.Model):
        id = models.IntegerField(primary_key=True, db_column='taskid')
        userid = models.ForeignKey(User, null=True)
    
        def user(self):
            try:
                return self.userid
            except ValueError:
                return None
    
        def user_name(self):
            try:
                return self.userid.name
            except ValueError:
                return getattr(self, Task.userid.field.attname)
    

    The trouble with this is that field.attname isn’t documented, so it’s not supported. This area of the Django implementation changed significantly between 1.3 and 1.5 (though without breaking field.attname, as it happens). So this is risky: you would prefer to have a solution that is supported.

  3. Represent the userid field in Django as a models.CharField (so that you can get the name in a supported way), and look up the related object in Python:

    class Task(models.Model):
        id = models.IntegerField(primary_key=True, db_column='taskid')
        userid = models.CharField()
    
        def user(self):
            try:
                return User.objects.get(id = int(self.userid))
            except ValueError:
                return None
            except User.DoesNotExist:
                return None
    
        def user_name(self):
            try:
                return self.user().name
            except AttributeError:
                return self.userid
    

    This is a non-starter: it means a whole extra database query each time a user name is looked up, and kills performance. So on to:

  4. Fetch all the users from the database, store them using Django’s cache framework, and use the cache to look up the user as needed?

    from django.core.cache import cache
    
    class Task(models.Model):
        id = models.IntegerField(primary_key=True, db_column='taskid')
        userid = models.CharField()
    
        def user(self):
            users = cache.get('users')
            if users is None:
                users = {user.id: user for user in User.objects.all()}
                cache.set('users', users)
            try:
                return users[int(self.userid)]
            except ValueError:
                return None
            except KeyError:
                return None
    
        def user_name(self):
            try:
                return self.user().name
            except AttributeError:
                return self.userid
    

    However, (a) there are so many users that filling the cache takes tens of seconds which means that the first person to query the site after a reboot is going to suffer an unacceptable delay; and (b) the cache is going to get out of date and need refreshing. Solving both of these is a Simple Matter of Programming, but this is looking over-complicated and error prone. Plan B would surely be better than this!

3. The solution

You can have two fields in a Django model that refer to the same database column, but with different field types!

class Task(models.Model):
    id = models.IntegerField(primary_key=True, db_column='taskid')
    user = models.ForeignKey(User, null=True,2 db_column='userid')
    userid = models.CharField()

    def user_name(self):
        try:
            return self.user.name
        except ValueError:
            return self.userid

Obvious when you know how…


  1.  Made up example.

  2.  One of the effects of setting null=True for a foreign key is that the select_related() method “by default, does not follow foreign keys that have null=True”. You need to request the field explicitly: in this case, by calling select_related('user').