Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, PHP, and whatever else I happen to feel like hacking at.

Using ETag and If-Modified-Since

Jerry Stratton, August 14, 2009

In Django Twitter tag and RSS object I wrote “If to-the-second timeliness is important, you’ll want to pay attention to the ETags that Twitter sends as well as throttling based on the cache’s timestamp.”

I ended up needing that for another project. The main difference is that you have to manage HTTP headers, and to manage HTTP headers you have to use urllib2 instead of urllib.

This change will require the addition of two methods, and modifying the reCache method. Also, change the import at the top of the file from urllib to urllib2.

Here’s the new reCache:

[toggle code]

  • def reCache(self):
    • feedStream = self.openFeed()
    • feed = None
    • if feedStream:
      • try:
        • feed = xml.dom.minidom.parse(feedStream)
      • except ExpatError, message:
        • print "ExpatError opening URL:", message, self.feedURL
        • feed = None
      • except IOError, message:
        • print "IOError opening URL:", message, self.feedURL
        • feed = None
    • if self.cache:
      • if not feed:
        • feed = self.readCache(forceRead=True)
      • if feed:
        • xmlString = feed.toprettyxml(encoding="utf-8")
        • #if last created by a different user, remove it first
        • self.ensureWritability(self.cache)
        • cacheFile = open(self.cache, 'w')
        • cacheFile.write(xmlString)
        • cacheFile.close()
    • return feed

This uses two new functions. One is easy: ensureWritability is the same code as before to make sure that the process that’s running this code can write to the cache file. I’ve moved it off into a separate method, because now we’re going to have to cache an ETag also.

[toggle code]

  • def ensureWritability(self, filepath):
    • if os.path.exists(filepath) and not os.access(filepath, os.W_OK):
      • os.remove(filepath)

The bulk of the new work is done with the openFeed method. The urllib2 requires a lot more fiddling than does urrllib, and managing ETags also requires some fiddling. You need to keep track of the ETag of the feed if one is provided, and then in the future, ask for the new feed “if the new feed doesn’t match this old feed”.

[toggle code]

  • def openFeed(self):
    • feedRequest = urllib2.Request(self.feedURL)
    • if self.cache:
      • #tell the server to only give us the new file if the file has changed since we last got it
      • if os.path.exists(self.cache):
        • cacheTime = datetime.datetime.utcfromtimestamp(os.path.getmtime(self.cache))
        • cacheStamp = cacheTime.strftime('%a, %d %b %Y %H:%M:%S +0000')
        • feedRequest.add_header('If-Modified-Since', cacheStamp)
      • #tell the server to only give us the file if the file has changed since the last etag we stored
      • tagFile = self.cache + '.etag'
      • if os.path.exists(tagFile):
        • tag = open(tagFile).read().strip()
        • if tag:
          • feedRequest.add_header('If-None-Match', tag)
    • feedOpener = urllib2.build_opener()
    • try:
      • feedStream = feedOpener.open(feedRequest)
      • error = None
    • except urllib2.HTTPError, errorInfo:
      • error = errorInfo.code
      • errorText = 'HTTP Error'
    • except urllib2.URLError, errorInfo:
      • error, errorText = errorInfo.code, errorInfo.reason
    • if not error:
      • tag = feedStream.headers.get('ETag')
      • if tag:
        • self.ensureWritability(tagFile)
        • open(tagFile, 'w').write(tag)
      • return feedStream
    • elif error != 304:
      • print "Problem with feed", self.feedURL, ":", error, errorText
    • return None

Rather than just opening an URL and using it as we did with urllib, with urllib2 we need to open a “Request”. We can then modify this request by adding headers to it.

  1. If there’s a cache file, send an “If-Modified-Since” using the timestamp from that file. This tells the server we’re making the request to, to not give us the page/feed if the file has not been modified since the date and time in that header.
  2. If there’s an etag file (here, I’m setting it to use the cachefile path with “.etag” added to the end), send an “If-None-Match” using the value of the etag file. This tells the server we’re making the request to, to not give us the page if the current etag matches the etag we’re sending.
  3. If there’s no error, that means we have a stream that we can return for use in functions such as xml.dom.minidom.parse that take a file-like stream.
  4. If there is an error, and the error is error 304, that means the file has not been modified. We can safely ignore that error. It’s what we’re expecting when the file hasn’t been changed. For any error, the method returns None since for whatever reason the server hasn’t given us data we can use.
  5. If there’s no error and there’s an ETag header in the page the server returned to us, the method writes out the ETag into the .etag cache file for later use.

This lets us be a good netizen by only requesting the full page if the full page has new data.

In response to Django Twitter tag and RSS object: I wanted to embed my twitter feed into my Django blog, and didn’t see any simple RSS readers for Python that did what I wanted.