Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, PHP, and whatever else I happen to feel like hacking at.

Caching DTDs using lxml and etree

Jerry Stratton, October 16, 2009

I have a script that goes through every page on our site and validates it, every morning. This gives me a very fast heads-up when some data in a database contains evil characters, or when a change I made in an include file for use in one page ends up causing trouble on another page.

A couple of days ago that script started failing for every page:

[toggle code]

  • Checking 330 pages
  • Unable to parse About web spam
    • http://www.sandiego.edu/webdev/coding/email/spam.php
    • failed to load HTTP resource, line 1, column 118
    • dtd">
  • Unable to parse Account Information
    • http://www.sandiego.edu/unet/
    • failed to load HTTP resource, line 1, column 118
    • dtd">
  • Unable to parse Add System Notices
    • http://www.sandiego.edu/webdev/notices/add/
    • failed to load HTTP resource, line 1, column 118
    • dtd">

And so on, for every page. Line 1, column 118 was inside of the URL portion of the DOCTYPE:

  • <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

It appeared to be failing, but the real error was hidden behind “failed to load HTTP resource”. My immediate thought was that our “security” blocker had gone haywire; we use a filter to block all off-campus web access unless the user logs in from a browser first. As you can imagine, this throws a big wrench into automated scripts, and this one had earlier been caught by it; it returned exactly this error whenever it tried to load http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd. But I was able to replicate the error from the command line after using a browser successfully, and I was able to use both a GUI browser and command-line curl to get the DTD. So it didn’t seem like a network issue, at least on our end.

I remembered that when I created the validation script, I deliberately cached the etree.XMLParser instance to let the parser act intelligently over multiple validation requests:

[toggle code]

  • class Validator(object):
    • parser = None
    • def getParser(self):
      • if not self.parser:
        • self.parser = etree.XMLParser(dtd_validation=True, no_network=False)
      • return self.parser
    • def validateXHTML(self, html, page):
      • error = None
      • try:
        • parser = self.getParser()
        • etree.fromstring(html, parser=parser)
      • except etree.XMLSyntaxError, e:
        • error = e.args[0]
      • return self.error(error, page, html)

But what if it wasn’t caching? What if it was trying grab the DTD (and all subsidiary DTDs) 330 times in the few seconds it takes the script to run?

A quick google search on “lxml caching dtds” and “etree caching dtds” brought up some postings that indicated that, at least as of a year and a half ago, lxml did not cache. And a few more searches brought up a page that indicated that, at least as of 2008, w3.org does block scripts that pound their servers.

So if lxml doesn’t do caching, how do I intercept the requests for the DTDs and return a cached version? As usual, lots of postings saying how easy it is, and none giving an actual working example. The closest I got was a link to etree: Document loading and URL resolving. That page has an “example” of adding a Resolver to the etree.XMLParser object. After a lot of work, I came up with this:

[toggle code]

  • class CustomResolver(etree.Resolver):
    • cache = '/Users/jerry/Documents/CMS/etc/caches/dtds'
    • def resolve(self, URL, id, context):
      • #determine cache path
      • url = urlparse.urlparse(URL)
      • filefolder, filename = os.path.split(url.path)
      • filefolder = url.hostname + filefolder
      • dtdFolder = os.path.join(self.cache, filefolder)
      • dtdPath = os.path.join(dtdFolder, filename)
      • #cache if necessary
      • if not os.path.exists(dtdPath):
        • print "CREATING CACHE FOR", URL
        • if not os.path.exists(dtdFolder):
          • os.makedirs(dtdFolder)
        • filename, headers = urllib.urlretrieve(URL, dtdPath)
      • #resolve the cached file
      • return self.resolve_file(open(dtdPath), context, base_url=URL)

I then update my getParser method to add an instance of this CustomResolver to the XMLParser instance:

[toggle code]

  • def getParser(self):
    • if not self.parser:
      • self.parser = etree.XMLParser(dtd_validation=True, no_network=False)
      • self.parser.resolvers.add(CustomResolver())
    • return self.parser

I’ve decided to cache the DTD based on its URL: a folder for the hostname, and then a folder for each item in the DTD’s path. The urlretrieve function in urllib saves the URL to the cache folder, and then I use the resolve_file() method on the Resolver to return whatever it is that XMLParser is expecting. The documentation seems to recommend using resolve_filename(), but resolve_filename doesn’t accept base_url as a parameter. Without base_url, all subsequent requests for DTDs based off of the cached DTD will assume that the DTD’s URL is in the cache folder. That won’t work when using external DTDs such as the ones at w3.org that are referenced at the top of XHTML web pages. The resolve_file method (and resolve_string) both allow specifying the base_url.

One thing this script doesn’t do is recheck the original DTD to see if it’s been updated. The XHTML 1.0 transitional and strict DTDs were both last updated in 2002; the time spent coding recaching doesn’t seem worth it.

Yes, I am sometimes annoyed when working with Python. Why do you ask?

I don’t know that this code does what I expect it to do. I don’t even know that I’ve been blocked; if I have been, it isn’t an IP block, since I can download the DTDs from my web browser and from curl. But downloading the necessary files using my browser and then putting the downloaded DTDs into the cache folders fixed the problem1.

This is an example of one of the things that make me really question using Python/Django. Python’s library documentation sucks. This issue has been known for over a year; there’s a bug on bugs.python.org; there’s a discussion on the lxml mailing list saying “it isn’t our problem2 and it’s easy to fix”. Is there a simple solution described anywhere? If there is, I can’t find it; I can’t even find an acknowledgement that the issue still affects lxml. A short note in the XMLParser documentation saying “if you turn on dtd_validation, lxml will redownload the DTDs every time you validate a page using that instance; you should implement a caching function like this one…” (or “lxml will cache dtds during the same session” if that’s true) would help a lot. I had no idea there was going to be an issue until the script failed.

The example of overriding the resolver on the etree page seems to be an example in name only; it doesn’t work. It appears to show nothing more than how to use the string formatting operator in Python.

  • return self.resolve_string('<!ENTITY myentity "[resolved text: %s]">' % url, context)

This is a simple class. Twelve lines of code. Those twelve lines took almost two days to track down to the point where I’m reasonably certain that it’s doing what I needed it to do. I’m still not sure; just reasonably certain.

  1. Note that I use a version of this script both at home and in the office. I tested the caching version of the script at home. Since the office was (possibly) blocked, I had to manually cache the files there.

  2. And saying it’s okay as is because DTD validation is turned off by default is misleading; lxml won’t validate entities unless DTD validation is on. Anyone who is validating XHTML web pages using lxml, if they have any entities at all in their pages, needs to turn on DTD validation.

  1. <- Snow Leopard Server
  2. Django intermediate pages ->