Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, PHP, and whatever else I happen to feel like hacking at.

Django: fix_ampersands and abbreviations

Jerry Stratton, May 22, 2011

I’ve been slowly converting all of my Django fields to use UTF8 instead of named entities. The combination of using named entities, talking about code, and talking about D&D makes it difficult to know when the ampersand should be converted and when it shouldn’t.

For the most part the conversion to UTF is going well, except that there’s no easy way to know when < and > need to be converted and when they don’t: sometimes I’m talking about HTML, and sometimes I’m using it. So for those two characters, I’ll need to continue using the ampersand entity directly in my blog content.

It turns out, though, that the fix_ampersands filter handles this. The documentation makes it sound like fix_ampersands converts all ampersands, but in fact it uses a simple regular expression to exclude existing named entities and numeric character references.1

Unfortunately, this still leaves a few edge cases. Any use of ampersand abbreviations, such as Q&A, R&R, M&Ms, or R&D, runs the risk of triggering one of them. For me, this comes up mainly when talking about role-playing games such as D&D; V&V; and T&T. Those first two ampersands look like named entities to fix_ampersands because django/utils/html.py uses the simple and fast expedient of a very simple regular expression:

  • unencoded_ampersands_re = re.compile(r'&(?!(\w+|#\d+);)')

In my case, a simple change to the regex will handle the edge case examples above:

  • unencoded_ampersands_re = re.compile(r'&(?!(\w{2,}|#\d+);)')

There are no one-character named entities, so this regular expression includes what look like single-character named entities in its fixes. Rather than “\w+” it uses “\w{2,}” to only exclude two-character or longer named entities from replacing.2

This won’t fix the problem when the abbreviation looks like a two-character (or more) entity, but in my case the problem has only shown up for one-character abbreviations. In Python it’s possible to “fix” this without hacking the core code directly. At the end of settings.py, I added:

  • #modify fix_ampersands to handle single-character abbreviations
  • import django.utils.html, re
  • django.utils.html.unencoded_ampersands_re = re.compile(r'&(?!(\w{2,}|#\d+);)')

This overwrites the regular expression used by django.utils.html with one that ignores single-character entities.

This is only a temporary fix. Once I finish converting every field so that entities appearing in them are switched to their UTF8 equivalents, I will know that there are no entities in my content except for the angle brackets. At that point, I should be able to make a new tag to replace fix_ampersands that excludes nothing except the left and right angle brackets. Something like (and this is untested):

  • re.compile(r'&(?!(lt|gt);)')

Because the only named entities that will be appearing are these two, I can exclude those directly and escape everything else.

  1. I’ve submitted a patch to the documentation to add this.

  2. I submitted this as a patch, but it was (reasonably) refused, as this does not address the core limitation of fix_ampersands: it will still fail for fantasy and science fiction fans and railroad fans, who have abbreviations (F&SF, AT&SF) with two characters following the ampersand; I expect the military is filled with such abbreviations as well. Two-character entities do exist—such as &lt and &gt.

  1. <- Relative desktop clock
  2. Apple Mail and appscript ->