Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, PHP, and whatever else I happen to feel like hacking at.

Django Twitter tag and RSS object

Jerry Stratton, June 13, 2009

Python’s minidom makes it easy to parse RSS feeds, since RSS feeds are themselves just very simple XML. I wanted to parse my Twitter RSS feed into a context usable by Django templates.

I broke the feed down into the Feed, a Channel, and individual Items. Channels and Items are both XML nodes, so I made them inherit from a Node class that understands what is available in RSS.

[toggle code]

  • #!/usr/bin/python
  • #provide an RSS feed object for use in Django or Mako templates
  • import datetime, os.path, time, urllib, xml.dom.minidom
  • from xml.parsers.expat import ExpatError
  • class Node(object):
    • def __init__(self, node):
      • self.node = node
      • self.title = self.getValue('title')
      • self.link = self.getValue('link')
      • self.description = self.getValue('description')
    • def __str__(self):
      • return self.title
    • def getValue(self, tag):
      • node = self.node.getElementsByTagName(tag)[0].firstChild
      • data = None
      • if node:
        • data = self.node.getElementsByTagName(tag)[0].firstChild.data
        • data = data.strip()
      • return data
  • class Channel(Node):
    • def items(self, displayCount):
      • items = self.node.getElementsByTagName("item")
      • if displayCount:
        • items = items[:displayCount]
      • feedItems = []
      • for item in items:
        • feedItem = Item(item)
        • feedItems.append(feedItem)
      • return feedItems
  • class Item(Node):
    • def __init__(self, item):
      • super(Item, self).__init__(item)
      • self.pubDate = self.getValue('pubDate')
    • #provide a datetime for use by Django's date filters
    • def stamp(self):
      • #Mon, 16 Mar 2009 13:02:19 +0000
      • return datetime.datetime.strptime(self.pubDate, '%a, %d %b %Y %H:%M:%S +0000')
  • class Feed(object):
    • def __init__(self, feedURL, cache=None):
      • self.feedURL = feedURL
      • if cache:
        • self.cache = '/tmp/' + cache + '.rss'
      • else:
        • self.cache = None
    • #is the cache fresh enough to use?
    • def freshCache(self):
      • if self.cache and os.path.exists(self.cache):
        • #use cache if it is less than sixty minutes old
        • freshTime = time.time() - 60*60
        • if os.path.getmtime(self.cache) > freshTime:
          • return True
      • return False
    • def readCache(self, forceRead=False):
      • feed = None
      • if forceRead or self.freshCache():
        • try:
          • feed = xml.dom.minidom.parse(open(self.cache))
        • except:
          • feed = None
      • return feed
    • def reCache(self):
      • try:
        • feed = xml.dom.minidom.parse(urllib.urlopen(self.feedURL))
      • except ExpatError, message:
        • print "ExpatError opening URL:", message
        • feed = None
      • except IOError, message:
        • print "IOError opening URL:", message
        • feed = None
      • if self.cache:
        • if not feed:
          • feed = self.readCache(forceRead=True)
        • if feed:
          • xmlString = feed.toprettyxml(encoding="utf-8")
          • #if last created by a different user, remove it first
          • if os.path.exists(self.cache) and not os.access(self.cache, os.W_OK):
            • os.remove(self.cache)
          • cacheFile = open(self.cache, 'w')
          • cacheFile.write(xmlString)
          • cacheFile.close()
      • return feed
    • def context(self, displayCount=None):
      • context = {}
      • feed = self.readCache()
      • if not feed:
        • feed = self.reCache()
      • if feed:
        • channel = Channel(feed.getElementsByTagName("channel")[0])
        • feedItems = channel.items(displayCount)
        • context['items'] = feedItems
        • context['title'] = channel.title
        • context['feedURL'] = self.feedURL
      • return context

The Feed class caches, if possible, the output of the RSS feed, and tries not to make a request more often than once an hour.

I saved this file in an app I have called “resources”. Then I added a “tweet” tag to my templatetags:

[toggle code]

  • import resources.rss
  • from django import template
  • register = template.Library()
  • def tweet(count=1):
    • feed = resources.rss.Feed('http://twitter.com/statuses/user_timeline/20020901.rss', "twitter")
    • context = feed.context(displayCount=count)
    • context['webURL'] = 'http://twitter.com/hoboes'
    • return template.loader.render_to_string("parts/tweet.html", context)
  • register.simple_tag(tweet)

This uses a dedicated Django template snippet to render the tweets:

[toggle code]

  • <ul class="twitter">
    • {% for tweet in items %}
      • <li><a href="{{ tweet.link }}">{{ tweet.title|stripLeadingText:"hoboes:" }}</a></li>
    • {% endfor %}
  • </ul>

There’s a filter in there called “stripLeadingText” that I use to remove my Twitter name from the title:

[toggle code]

  • def stripLeadingText(text, toStrip):
    • text = text.strip()
    • if text.startswith(toStrip):
      • text = text[len(toStrip):]
    • text = text.strip()
    • return text
  • register.filter('stripLeadingText', stripLeadingText)

I can then use a “tweet” template tag to display one or more tweets:

  • {% tweet %}
  • {% tweet 2 %}
  • {% tweet 10 %}

You can also, of course, provide the tweets directly to any template via your views, or turn this into a tweet/endtweet loop for custom tweet HTML on every page.

A couple of caveats:

  • You don’t want to use /tmp for your cache files. I just used it so that the example will most likely work.
  • Remember to compare dates as UTC time, using, for example, datetime.datetime.utcnow().
  • I’ve also seen pubDate formats which use GMT instead of +0000. You may need to account for that if you use multiple feeds and each uses a different format.
  • If to-the-second timeliness is important, you’ll want to pay attention to the ETags that Twitter sends as well as throttling based on the cache’s timestamp. I make either one or two requests per day, and I’m not sure ETags matter on Twitter anyway.
August 14, 2009: Using ETag and If-Modified-Since

In Django Twitter tag and RSS object I wrote “If to-the-second timeliness is important, you’ll want to pay attention to the ETags that Twitter sends as well as throttling based on the cache’s timestamp.”

I ended up needing that for another project. The main difference is that you have to manage HTTP headers, and to manage HTTP headers you have to use urllib2 instead of urllib.

This change will require the addition of two methods, and modifying the reCache method. Also, change the import at the top of the file from urllib to urllib2.

Here’s the new reCache:

[toggle code]

  • def reCache(self):
    • feedStream = self.openFeed()
    • feed = None
    • if feedStream:
      • try:
        • feed = xml.dom.minidom.parse(feedStream)
      • except ExpatError, message:
        • print "ExpatError opening URL:", message, self.feedURL
        • feed = None
      • except IOError, message:
        • print "IOError opening URL:", message, self.feedURL
        • feed = None
    • if self.cache:
      • if not feed:
        • feed = self.readCache(forceRead=True)
      • if feed:
        • xmlString = feed.toprettyxml(encoding="utf-8")
        • #if last created by a different user, remove it first
        • self.ensureWritability(self.cache)
        • cacheFile = open(self.cache, 'w')
        • cacheFile.write(xmlString)
        • cacheFile.close()
    • return feed

This uses two new functions. One is easy: ensureWritability is the same code as before to make sure that the process that’s running this code can write to the cache file. I’ve moved it off into a separate method, because now we’re going to have to cache an ETag also.

[toggle code]

  • def ensureWritability(self, filepath):
    • if os.path.exists(filepath) and not os.access(filepath, os.W_OK):
      • os.remove(filepath)

The bulk of the new work is done with the openFeed method. The urllib2 requires a lot more fiddling than does urrllib, and managing ETags also requires some fiddling. You need to keep track of the ETag of the feed if one is provided, and then in the future, ask for the new feed “if the new feed doesn’t match this old feed”.

  1. <- PDO_MYSQL on Leopard
  2. Django memory ->