Validating XHTML

,

If, like me, you often write documents directly in HTML or XHTML, then you probably find that it’s hard to write correct markup. There are lots of mistakes that are easy to make and hard to notice, such as failing to write the close tag for an element, omitting compulsory attributes like the alt attribute on img elements or forgetting to encode ampersands in a URL.

Of course, you are not obliged to care about the correctness of your markup. If you only mean the result to be readable by humans then it probably doesn’t matter if there are a few errors. Web browsers are very good at handling tag soup and displaying a plausible result. But if you want to do automated processing on your documents, or care about other people’s ability to do automated processing on your documents (for example, people using screen readers) it helps if they are valid. Or you might just care about the quality of your work.

I mostly write in XHTML 1.0 Transitional so for validation I use a script I wrote back in 2001, check_xhtml.py. The following Emacs interface is convenient, because it uses the function compile to run the checker, so I can use next-error (C-x `) to step through the errors in the source, just as with any other compilation.

(defvar checker-program (expand-file-name "~/python/check_xhtml.py")
  "Path to document checker program.")

(defun check-file ()
  "Check visited file using `checker-program'."
  (interactive)
  (let ((file (buffer-file-name)))
    (unless file (error "Buffer not visiting file"))
    (compile (concat (shell-quote-argument checker-program) " "
                     (shell-quote-argument file)))))

This works for me, but you might not be satisfied with check_xhtml.py:

The W3C validator is a fine program, but very inconvenient for the user, because there’s no way to automatically step to the offending lines of source. Cross-referencing a list of errors with an erroneous document is pretty tedious. So let’s make a program that uploads a file to the W3C validator, parses the resulting list of errors, and outputs them in a format that Emacs’ next-error command recognizes.

The first problem is that Python has no built-in support for file upload. Luckily a few programmers out there have already tackled the problem, with Will Holcomb’s MultipartPostHandler.py (as modified by Brian Schneider to support Unicode) the best of the solutions I looked at. So with that in hand, the rest of it is a straightforward application of XML parsing and DOM querying using the xml.dom.minidom library.

#!/usr/bin/python
#
#          VALIDATE.PY -- VALIDATE DOCUMENT USING W3 VALIDATOR
#                        Gareth Rees, 
#
# USAGE
#
#   validate.py FILE
#
#   Uploads FILE to the World Wide Web Consortium's validation service,
#   parses the results, and writes out errors and warnings on stderr in
#   gcc's "brief format" [1] for easy parsing by the "next-error"
#   command in Emacs.
#
# NOTES
#
#   Uses MultipartPostHandler.py [2,3] to encode the file using
#   the MIME type multipart/form-data [4] for upload.
#
# REFERENCES
#
#   [1] "GNAT User's Guide", section 3.2.1. Output and Error Message Control
#       <http://gcc.gnu.org/onlinedocs/gnat_ugn_unw/Output-and-Error-Message-Control.html>
#
#   [2] Brian Schneider (2007).
#       "MultipartPostHandler doesn't work for unicode files".
#       <http://peerit.blogspot.com/2007/07/multipartposthandler-doesnt-work-for.html>
#
#   [3] Will Holcomb <wholcomb@gmail.com> (2006).
#       "MultipartPostHandler.py"
#       <http://pipe.scs.fsu.edu/PostHandler/MultipartPostHandler.py>
#
#   [4] L. Masinter (1998).
#       "RFC 2388: Returning Values from Forms: multipart/form-data".
#       <http://www.ietf.org/rfc/rfc2388.txt>

import re
import sys
import urllib2
import urlparse
import xml.dom.minidom
import MultipartPostHandler

def validate_w3(file, errs = sys.stderr):
    """validate_w3(FILE, ERRS): validate FILE using validator.w3.org.
Print errors and warnings to the stream ERRS (which defaults to
sys.stderr) using gcc's "brief format": "FILE:LINE:COLUMN: MESSAGE".
Return True if the document was valid (no errors or warnings), False
otherwise."""
    opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler)
    params = {
        'doctype': 'Inline',
        'uploaded_file': open(file, 'rb'),
        }
    result = opener.open('http://validator.w3.org/check', params)
    dom = xml.dom.minidom.parse(result)
    valid = True
    line_re = re.compile(r'line ([0-9]+)(?:, column ([0-9]+))?', re.I)
    # Each warning is contained in a <li class="msg_warn"> element and
    # each error in a <li class="msg_err"> element.
    for li in dom.getElementsByTagName('li'):
        if li.getAttribute('class') in ['msg_warn', 'msg_err']:
            valid = False
            line = '0:'
            col = ''
            # There may be a line number in text in an <em> element.
            for em in li.getElementsByTagName('em'):
                text = em.firstChild
                if text and text.nodeType == text.TEXT_NODE:
                    m = line_re.search(text.data)
                    if m:
                        line = m.group(1) + ':'
                        if m.group(2):
                            col = m.group(2) + ':'
                        break
            # Error message is in a <span class="msg"> element.
            for span in li.getElementsByTagName('span'):
                if span.getAttribute('class') == 'msg':
                    text = span.firstChild
                    if text and text.nodeType == text.TEXT_NODE:
                        errs.write("%s:%s%s %s\n" % (file, line, col, text.data))
                        break
    return valid

if __name__ == '__main__':
    sys.exit(0 if validate_w3(sys.argv[1]) else 1)

Set your checker-program variable to the location of the validate.py script, and off you go.

This very application shows why you might care about XHTML correctness. Because the output of the W3C validator is itself valid XHTML 1.0 Strict, it’s trivial to parse and easy to automatically extract data. (Of course, even if the output were tag soup, we could no doubt get what we wanted with a bit of regular expression hacking. But it would be more work and most likely more brittle.)

If you do use this program, please bear in mind some caveats:

  1. The validate.py script uploads your file to a web server in plain text. I’m sure that the W3C is an ethical organization, and their privacy statement looks reasonable, but it’s only common sense not to upload anything that’s private or confidential.

  2. It’s not polite to spam the W3C validator with large numbers of validation requests. They don’t say anything about this in their FAQ, but they do say that the validator “costs a lot to develop, support, host and maintain”.

  3. If you use the validator a lot, you might want to donate to the W3C validator program.