Automatically link related URLs in Django

Jerry Stratton, December 1, 2010

When I wrote the first version of my blog software in PHP, one of the things I wanted to do was avoid typing links directly into the text. I used a fairly simple str_replace, taking advantage of str_replace’s ability to take arrays as the search for and replace with parameters. The first array contained the “title” of the link, and the second array contained the full “a” tag.

$body = str_replace($linkTitles, $linkTags, $body, 1);

I also told it to only link the first instance of each item in the array, because I don’t like reading things filled with duplicate links.

It ended up getting a little more complicated than that as I fixed issues. The first one was that I had to sort the links by the length of the title, with the shortest titles first—otherwise, a short link might end up creating a link inside of an existing “a” tag created from an earlier, longer title.

[toggle code]

function strLengthAscending($a, $b) {
- $al = strlen($a);
- $bl = strlen($b);
- if ($al == $bl) {
  - return 0;
- } else {
  - return ($al > $bl) ? 1 : -1;
- }
}
uksort($links, 'strLengthAscending');
$linkTitles = array_keys($links);
$linkTags = array_values($links);
$body = str_replace($linkTitles, $linkTags, $body, 1);

Then I realized that this was case insensitive, and not linking text that didn’t exactly match the case of the title, so I switched to preg_replace:

$linkTitles = preg_replace('/^.*$/', '`\Q$0\E`i', $linkTitles);
$body = preg_replace($linkTitles, $linkTags, $body, 1);

But that, once again, started causing “a” tags within “a” tags, as all lower-case titles with no spaces could easily match hostnames.

//titles without spaces can't be case-insensitive, because that might hit URLs
$linkTitles = preg_replace('/^.* .*$/', '`\Q$0\E`i', $linkTitles);
$linkTitles = preg_replace('/^[^ ]*$/', '`\Q$0\E`', $linkTitles);
$body = preg_replace($linkTitles, $linkTags, $body, 1);

And so forth. It still runs the risk of putting a link inside of another link, but it was fairly rare and I just adjusted the link text on the rare occasions when it happened.

Django

When I moved to Django, I initially just used a template tag to handle links. It was precise and linked specific text to specific URLs from another model. But I missed the ability to have text automatically get linked to related links, related pages, and related topics. I ended up writing an Autolinker class; it accepts a list of objects and titles that match those objects. Each object must have a “linkHTML” method that accepts a “title” parameter and creates the link tag with that link text.

[toggle code]

class Autolinker(object):
- ignoreCaseLength = 12
- def __init__(self):
  - self.links = {}
- #any item that gets added must have a linkHTML method that accepts a title parameter
- def addItem(self, title, item):
  - self.links[title] = item
- def addLinks(self, links):
  - for link in links:
    - self.links[link.title()] = link
- def replaceAll(self, haystack):
  - for title, link in sorted(self.links.items(), key=lambda x:-len(x[0])):
    - haystack = self.replace(title, link, haystack)
    - #we're going paragraph-by-paragraph, but don't want multiple links
    - if self.replaced:
      - del self.links[title]
  - return haystack
- def regexOptions(self, text):
  - options = re.DOTALL
  - if len(text) > self.ignoreCaseLength:
    - options = options | re.IGNORECASE
  - return options
- def replace(self, needle, replacement, haystack):
  - self.replacement = replacement
  - options = self.regexOptions(needle)
  - needle = re.compile('([^{}]*?)(' + re.escape(needle) + ')([^{}]*)', options)
  - self.needle = needle
  - self.replaced = False
  - return self.doReplace(haystack)
- def doReplace(self, haystack):
  - return re.sub(self.needle, self.matcher, haystack)
- def matcher(self, match):
  - fullText = match.group(0)
  - if not self.replaced:
    - #if it's inside of a django tag, don't make the change
    - if fullText[0] == '%' or fullText[-1] == '%':
      - return fullText
    - #if it's inside of a link already, don't make the change
    - leftText = match.group(1)
    - matchText = match.group(2)
    - rightText = match.group(3)
    - rightmostAnchor = leftText.rfind('<a')
    - if rightmostAnchor != -1:
      - anchorClose = leftText.rfind('</a>')
      - if anchorClose < rightmostAnchor:
        
        #this is inside of an open a tag.
        
        #but there might be a match in the rightText
        
        fullText = leftText+matchText + self.doReplace(rightText)
        
        return fullText
    - #check the right side for anchors, too.
    - leftmostAnchorClose = rightText.find('</a>')
    - if leftmostAnchorClose != -1:
      - anchorOpen = rightText.find('<a')
      - if anchorOpen == -1 or anchorOpen > leftmostAnchorClose:
        
        #this is inside of an open a tag
        
        return fullText
    - #otherwise, it is safe to make the change
    - fullText = leftText + self.replacement.linkHTML(title=matchText) + rightText
    - self.replaced = True
  - return fullText

It gets used by calling addItem to add individual objects with an arbitrary title, or addLinks to add a list of objects each of which must have a “title” method, and then replaceAll(). Something like this will replace all mention of related topics, related links, or the parent object with linked text to those items:

[toggle code]

def blurb(self):
- linker = Autolinker()
- description = self.description()
- #add topic links
- linker.addLinks(self.topics())
- #add related links
- linker.addLinks(self.urls())
- #add parent link
- linker.addItem(self.parent.title, self.parent)
- #autolink text
- description = linker.replaceAll(description)
- return description

Each item needs to know how to link itself, and put this knowledge into a linkHTML method; it might look like this:

[toggle code]

def linkHTML(self, title=None):
- return render_to_string('parts/link.html', {'link': self, 'title':title})

The template snippet (parts/link.html) would look like this:

<a href="{{ link.get_absolute_url }}">{{ firstof title link.title }}</a>

What does it do?

It mainly uses Python regular expressions; the bulk of the search is handled by re.sub, but in order to ensure that “a” tags don’t get nested, it sends re.sub a method to call when there’s a match, and that method looks for potential issues.

You call the “replaceAll” method with the text to be looked at for potential autolinking. This might be the text of a blog post, or a blurb for a review, or something like that. The replaceAll method goes through each link and calls the “replace” method; it also removes the link from the list; this is useful for looping through paragraphs; it ensures that even if replaceAll() is called more than once for a page’s text, only the first instance of a title showing up on the page gets linked.¹
The replace method sets up the replacement object, the needle text, and a boolean for whether or not replacement occurred, as properties on the Autolinker instance. It then calls the “doReplace” method with just the haystack as a parameter.
The doReplace method just calls re.sub() with self.needle, a method that will handle matches, and the haystack.
The “matcher” method is the guts of it: it takes the match and makes sure that this match is not inside of a Django tag or inside of an existing “a” element. It looks for “a” tags that open but don’t close on the left of the match, and for “a” tags that close but don’t open on the right of the match. If it can’t find either of them, the match is good. It sets the full text to the left text, plus the replacement object’s linkHTML method (called with the match text as the link title), plus the right text. And it marks self.replaced to True, so that the higher methods know that a replacement has occurred.

If necessary, the matcher method calls doReplace again with the text to the right of the match. (The regular expression should ensure no potential matches to the left of the match.)

You can see examples in this post: I typed str_replace and preg_replace bare towards the top of the page, for example, and the Autolinker recognized that it matched the title of one of the related links at the bottom of this post; it then linked that text to the appropriate page.

Caveats

This is going to be processor-intensive for lots of links. I use a Django-based CMS that uploads flat files, so I don’t mind it; if you use Django directly to serve pages, you’ll probably want to perform autolinking on save; for your content field(s), have two columns: one for the raw, unfiltered text, and one for the filtered text. On a save where the raw text changes, filter the raw column to the public column.

The Autolinker is very page-centric, because it’s meant for Django, which itself is a bit page-centric, and for a blog where each page will have different relevant links to autolink.
↑

preg_replace: “Searches subject for matches to pattern and replaces them with replacement.”
Python regular expression operations: “This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings.”
str_replace: “This function returns a string or an array with all occurrences of search in subject replaced with the given replace value.”

More Python

Quick-and-dirty old-school island script: Here’s a Python-based island generator using the tables from the Judges Guild Island Book 1.
Astounding Scripts on Monterey: Monterey removes Python 2, which means that you’ll need to replace it if you’re still using any Python 2 scripts; there’s also a minor change with Layer Windows and GraphicConverter.
Goodreads: What books did I read last week and last month?: I occasionally want to look in Goodreads for what I read last month or last week, and that currently means sorting by date read and counting down to the beginning and end of the period in question. This Python script will do that search on an exported Goodreads csv file.
Test classes and objects in python: One of the advantages of object-oriented programming is that objects can masquerade as each other.
Timeout class with retry in Python: In Paramiko’s ssh client, timeouts don’t seem to work; a signal can handle this—and then can also perform a retry.
30 more pages with the topic Python, and other related pages

More regular expressions

Simple .ics iCalendar file creator: A simple Perl script to create an ics file from a human-readable text of events.
Random table rolls: As often as not, when you roll on a random table you are rolling a random number of times. Now that we have a dice library, we can turn the roll count into a die roll.

Comments?

The undiscovered comment form, whose bourn no poster returns.

Your email, URL, and location are optional—but I won’t be able to contact you if you don’t leave a working email. Your email does not get displayed, your URL and location do. Your name is required but may vary as the needs of the day demand, or you can just use the anonymous Hark Thrice name. You can use the following tags: <em>, <a>, <blockquote>. Use them wisely and post intelligently. Comments may take some time to approve, especially if I’m stuck in a Mexican jail.

If you have private comments, or questions about this page, please, leave a message on the Negative Space Comments Page.

Lost?

If you’re looking for something here, use the search box in the navigation to limit your search to this part of the site, or use the Negative Space search page.

Jerry

Having a problem keeping all those phone numbers handy? Instead of throwing out a faded window shade, neatly print your names and numbers on it alphabetically, and hang the shade in a nearby inconspicuous area (like a closet door). It can also be used over a kitchen counter-top for recording recipes. When not in use, the shade is merely rolled up and zipped out of sight. — Hesperia Community Kitchens

Contents of Negative Space™ as a whole Copyright © 1994-2024 Jerry Stratton. Individual copyrights remain held by their respective authors unless they specify otherwise. Site titles, such as Negative Space, Strange Bedfellows, Biblyon Broadsheet, Highland Games, and FireBlade Coffeehouse are trademarks of Jerry Stratton.

Code and code snippets, to the extent that they are copyrightable, may be re-distributed under the terms of the GNU General Public License 3.

Automatically link related URLs in Django last modified December 6th, 2010.

Your comment
Your name
Your email
Your web page
Your location

Mimsy Were the Borogoves

Automatically link related URLs in Django

Django

What does it do?

Caveats

More Python

More regular expressions

Editorials

Books, Movies, & Music

Technology & Hacks

Food

42 Astounding Scripts

Walkerville Reader

Biblyon Broadsheet

About Mimsy

Comments?

Lost?

Mimsy Were the Borogoves

Automatically link related URLs in Django

Django

What does it do?

Caveats

More Python

More regular expressions

Editorials

Books, Movies, & Music

Technology & Hacks

Food

42 Astounding Scripts

Walkerville Reader

Biblyon Broadsheet

Blogroll

Keep in touch

About Mimsy

Comments?

Lost?