Using html5lib to resolve relative URLs

,

Icon indicating an RSS feed.

David suggested that the RSS feed for garethrees.org should have the full content of the posts instead of the summary. The way he put it was, “It should be up to me how to summarize your feed.”

I wasn’t so sure. At the time my own feed reader was the one built into Safari, which is adequate but not great, and my habit was to visit posts at their own URLs. But (also at David’s suggestion) I tried out Google Reader, and I can now see that with a decent feed reader it makes sense to read most posts in the reader.

I learned some other things from Google Reader. For example, I can see that there are 3 people who subscribe to my feed. Because the date that Google gives for a post isn’t the <pubDate> from the feed but is instead the date on which Google first discovered the post in the feed, I can tell that the first subscriber picked up the feed on 2009-08-12. Welcome, Google Readers!

So, how to put the full posts in the feed? The main problem is that of translating relative URLs (in the post source) into absolute URLs (in the feed). It turns out that this has to be done for all relative URLs, even ones that consist only of a fragment identifier (that is, links within a single post), because Google Reader transforms <a href="#id"> into <a href="http://site-of-feed/#id">.

It turns out that this is really easy to do in Python using Google’s html5lib library.

import html5lib
import html5lib.serializer
import html5lib.treewalkers
import urlparse

# List of (ELEMENT, ATTRIBUTE) for HTML5 attributes which contain URLs.
# Based on the list at http://www.feedparser.org/docs/resolving-relative-links.html
url_attributes = [
    ('a', 'href'),
    ('applet', 'codebase'),
    ('area', 'href'),
    ('blockquote', 'cite'),
    ('body', 'background'),
    ('del', 'cite'),
    ('form', 'action'),
    ('frame', 'longdesc'),
    ('frame', 'src'),
    ('iframe', 'longdesc'),
    ('iframe', 'src'),
    ('head', 'profile'),
    ('img', 'longdesc'),
    ('img', 'src'),
    ('img', 'usemap'),
    ('input', 'src'),
    ('input', 'usemap'),
    ('ins', 'cite'),
    ('link', 'href'),
    ('object', 'classid'),
    ('object', 'codebase'),
    ('object', 'data'),
    ('object', 'usemap'),
    ('q', 'cite'),
    ('script', 'src')]

def absolutify(src, base_url):
    """absolutify(SRC, BASE_URL): Resolve relative URLs in SRC.
SRC is a string containing HTML. All URLs in SRC are resolved relative
to BASE_URL. Return the body of the result as HTML."""

    # Parse SRC as HTML.
    tree_builder = html5lib.treebuilders.getTreeBuilder('dom')
    parser = html5lib.html5parser.HTMLParser(tree = tree_builder)
    dom = parser.parse(src)

    # Handle <BASE> if any.
    head = dom.getElementsByTagName('head')[0]
    for b in head.getElementsByTagName('base'):
        u = b.getAttribute('href')
        if u:
            base_url = urlparse.urljoin(base_url, u)
            # HTML5 4.2.3 "if there are multiple base elements with href
            # attributes, all but the first are ignored."
            break

    # Change all relative URLs to absolute URLs by resolving them
    # relative to BASE_URL. Note that we need to do this even for URLs
    # that consist only of a fragment identifier, because Google Reader
    # changes href=#foo to href=http://site/#foo
    for tag, attr in url_attributes:
        for e in dom.getElementsByTagName(tag):
            u = e.getAttribute(attr)
            if u:
                e.setAttribute(attr, urlparse.urljoin(base_url, u))

    # Return the HTML5 serialization of the <BODY> of the result (we don't
    # want the <HEAD>: this breaks feed readers).
    body = dom.getElementsByTagName('body')[0]
    tree_walker = html5lib.treewalkers.getTreeWalker('dom')
    html_serializer = html5lib.serializer.htmlserializer.HTMLSerializer()
    return u''.join(html_serializer.serialize(tree_walker(body)))

Notes:

  1. If you’re using html5lib for other tasks, you might want to apply my patch for a bug in html5lib.treewalkers so that the serialization works reliably. For the application described above, I think the <body> won’t have any sibling elements, so you don’t need the patch. [This bug is fixed in release 0.90.]

  2. If your blog is in XHTML instead of HTML then your task is a bit harder. You can do the parse step easily enough using one of Python’s XML parsers, for example xml.sax. But then for the URL resolution step you have to take into account the XML Base specification, which is a bit hairier than the HTML <base> element. Though I suppose if you’re writing something for personal use only, and you don’t use xml:base yourself, you can get away with ignoring it.