Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, PHP, and whatever else I happen to feel like hacking at.

Automatically link related URLs in Django

Jerry Stratton, December 1, 2010

When I wrote the first version of my blog software in PHP, one of the things I wanted to do was avoid typing links directly into the text. I used a fairly simple str_replace, taking advantage of str_replace’s ability to take arrays as the search for and replace with parameters. The first array contained the “title” of the link, and the second array contained the full “a” tag.

  • $body = str_replace($linkTitles, $linkTags, $body, 1);

I also told it to only link the first instance of each item in the array, because I don’t like reading things filled with duplicate links.

It ended up getting a little more complicated than that as I fixed issues. The first one was that I had to sort the links by the length of the title, with the shortest titles first—otherwise, a short link might end up creating a link inside of an existing “a” tag created from an earlier, longer title.

[toggle code]

  • function strLengthAscending($a, $b) {
    • $al = strlen($a);
    • $bl = strlen($b);
    • if ($al == $bl) {
      • return 0;
    • } else {
      • return ($al > $bl) ? 1 : -1;
    • }
  • }
  • uksort($links, 'strLengthAscending');
  • $linkTitles = array_keys($links);
  • $linkTags = array_values($links);
  • $body = str_replace($linkTitles, $linkTags, $body, 1);

Then I realized that this was case insensitive, and not linking text that didn’t exactly match the case of the title, so I switched to preg_replace:

  • $linkTitles = preg_replace('/^.*$/', '`\Q$0\E`i', $linkTitles);
  • $body = preg_replace($linkTitles, $linkTags, $body, 1);

But that, once again, started causing “a” tags within “a” tags, as all lower-case titles with no spaces could easily match hostnames.

  • //titles without spaces can't be case-insensitive, because that might hit URLs
  • $linkTitles = preg_replace('/^.* .*$/', '`\Q$0\E`i', $linkTitles);
  • $linkTitles = preg_replace('/^[^ ]*$/', '`\Q$0\E`', $linkTitles);
  • $body = preg_replace($linkTitles, $linkTags, $body, 1);

And so forth. It still runs the risk of putting a link inside of another link, but it was fairly rare and I just adjusted the link text on the rare occasions when it happened.

Django

When I moved to Django, I initially just used a template tag to handle links. It was precise and linked specific text to specific URLs from another model. But I missed the ability to have text automatically get linked to related links, related pages, and related topics. I ended up writing an Autolinker class; it accepts a list of objects and titles that match those objects. Each object must have a “linkHTML” method that accepts a “title” parameter and creates the link tag with that link text.

[toggle code]

  • class Autolinker(object):
    • ignoreCaseLength = 12
    • def __init__(self):
      • self.links = {}
    • #any item that gets added must have a linkHTML method that accepts a title parameter
    • def addItem(self, title, item):
      • self.links[title] = item
    • def addLinks(self, links):
      • for link in links:
        • self.links[link.title()] = link
    • def replaceAll(self, haystack):
      • for title, link in sorted(self.links.items(), key=lambda x:-len(x[0])):
        • haystack = self.replace(title, link, haystack)
        • #we're going paragraph-by-paragraph, but don't want multiple links
        • if self.replaced:
          • del self.links[title]
      • return haystack
    • def regexOptions(self, text):
      • options = re.DOTALL
      • if len(text) > self.ignoreCaseLength:
        • options = options | re.IGNORECASE
      • return options
    • def replace(self, needle, replacement, haystack):
      • self.replacement = replacement
      • options = self.regexOptions(needle)
      • needle = re.compile('([^{}]*?)(' + re.escape(needle) + ')([^{}]*)', options)
      • self.needle = needle
      • self.replaced = False
      • return self.doReplace(haystack)
    • def doReplace(self, haystack):
      • return re.sub(self.needle, self.matcher, haystack)
    • def matcher(self, match):
      • fullText = match.group(0)
      • if not self.replaced:
        • #if it's inside of a django tag, don't make the change
        • if fullText[0] == '%' or fullText[-1] == '%':
          • return fullText
        • #if it's inside of a link already, don't make the change
        • leftText = match.group(1)
        • matchText = match.group(2)
        • rightText = match.group(3)
        • rightmostAnchor = leftText.rfind('<a')
        • if rightmostAnchor != -1:
          • anchorClose = leftText.rfind('</a>')
          • if anchorClose < rightmostAnchor:
            • #this is inside of an open a tag.
            • #but there might be a match in the rightText
            • fullText = leftText+matchText + self.doReplace(rightText)
            • return fullText
        • #check the right side for anchors, too.
        • leftmostAnchorClose = rightText.find('</a>')
        • if leftmostAnchorClose != -1:
          • anchorOpen = rightText.find('<a')
          • if anchorOpen == -1 or anchorOpen > leftmostAnchorClose:
            • #this is inside of an open a tag
            • return fullText
        • #otherwise, it is safe to make the change
        • fullText = leftText + self.replacement.linkHTML(title=matchText) + rightText
        • self.replaced = True
      • return fullText

It gets used by calling addItem to add individual objects with an arbitrary title, or addLinks to add a list of objects each of which must have a “title” method, and then replaceAll(). Something like this will replace all mention of related topics, related links, or the parent object with linked text to those items:

[toggle code]

  • def blurb(self):
    • linker = Autolinker()
    • description = self.description()
    • #add topic links
    • linker.addLinks(self.topics())
    • #add related links
    • linker.addLinks(self.urls())
    • #add parent link
    • linker.addItem(self.parent.title, self.parent)
    • #autolink text
    • description = linker.replaceAll(description)
    • return description

Each item needs to know how to link itself, and put this knowledge into a linkHTML method; it might look like this:

[toggle code]

  • def linkHTML(self, title=None):
    • return render_to_string('parts/link.html', {'link': self, 'title':title})

The template snippet (parts/link.html) would look like this:

  • <a href="{{ link.get_absolute_url }}">{{ firstof title link.title }}</a>

What does it do?

It mainly uses Python regular expressions; the bulk of the search is handled by re.sub, but in order to ensure that “a” tags don’t get nested, it sends re.sub a method to call when there’s a match, and that method looks for potential issues.

  1. You call the “replaceAll” method with the text to be looked at for potential autolinking. This might be the text of a blog post, or a blurb for a review, or something like that. The replaceAll method goes through each link and calls the “replace” method; it also removes the link from the list; this is useful for looping through paragraphs; it ensures that even if replaceAll() is called more than once for a page’s text, only the first instance of a title showing up on the page gets linked.1
  2. The replace method sets up the replacement object, the needle text, and a boolean for whether or not replacement occurred, as properties on the Autolinker instance. It then calls the “doReplace” method with just the haystack as a parameter.
  3. The doReplace method just calls re.sub() with self.needle, a method that will handle matches, and the haystack.
  4. The “matcher” method is the guts of it: it takes the match and makes sure that this match is not inside of a Django tag or inside of an existing “a” element. It looks for “a” tags that open but don’t close on the left of the match, and for “a” tags that close but don’t open on the right of the match. If it can’t find either of them, the match is good. It sets the full text to the left text, plus the replacement object’s linkHTML method (called with the match text as the link title), plus the right text. And it marks self.replaced to True, so that the higher methods know that a replacement has occurred.

If necessary, the matcher method calls doReplace again with the text to the right of the match. (The regular expression should ensure no potential matches to the left of the match.)

You can see examples in this post: I typed str_replace and preg_replace bare towards the top of the page, for example, and the Autolinker recognized that it matched the title of one of the related links at the bottom of this post; it then linked that text to the appropriate page.

Caveats

This is going to be processor-intensive for lots of links. I use a Django-based CMS that uploads flat files, so I don’t mind it; if you use Django directly to serve pages, you’ll probably want to perform autolinking on save; for your content field(s), have two columns: one for the raw, unfiltered text, and one for the filtered text. On a save where the raw text changes, filter the raw column to the public column.

  1. The Autolinker is very page-centric, because it’s meant for Django, which itself is a bit page-centric, and for a blog where each page will have different relevant links to autolink.

  1. <- Per-header encryption
  2. Apple Mail and GeekTool ->