Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, Swift, BASIC, and whatever else I happen to feel like hacking at.

Auto-closing HTML tags in comments

Jerry Stratton, March 19, 2011

I was just on Watts up with that? and saw Anthony Watts’s biggest pet peeve: unclosed italics. However, his blog uses WordPress, and WordPress uses PHP. As it turns out, I’ve been working on integrated comments on my blog (I have it working now on the Biblyon Broadsheet) and tried to deal with this potential issue.

The way I did it was to use PHP’s built-in XML functionality. PHP’s XML objects can take in HTML and write out XML—with all tags fully closed.

Here is a stripped-down version of the method I use to do this, along with a simple test case:

[toggle code]

  • <?
    • function fixHTML($comment) {
      • $xml = new DOMDocument('1.0');
      • if (@$xml->loadHTML($comment)) {
        • //pull just the body out and save it
        • $body = $xml->getElementsByTagName('body');
        • $body = $body->item(0);
        • $xml = $xml->saveXML($body);
        • //DOMDocument appears to not use utf8 as its default
        • $xml = utf8_decode($xml);
        • //strip out the <body></body> tag
        • $xml = substr($xml, 6, -7);
        • return $xml;
      • } else {
        • return false;
      • }
    • }
    • $testComment = 'I think your blog is the <i>greatest!';
    • echo fixHTML($testComment);
  • ?>

As you can see, the test comment has unclosed italicization. But once run through this function, that html becomes:

  • <p>I think your blog is the <i>greatest!</i></p>

Presumably, you will already have used PHP’s strip_tags function to remove all tags except for the ones you want to allow. If not, you can add a strip_tags as the first line of this function.

One of the other issues with allowing HTML in your comments, however, are attributes. The most common attribute is the “href” attribute on the “a” tag for making links. If you want to strip all attributes except the href, you can do that, too:

[toggle code]

  • $xml = new DOMDocument('1.0');
  • if (@$xml->loadHTML($comment)) {
    • //remove all attributes except href
    • $xpath = new DOMXPath($xml);
    • $attributeBearingNodes = $xpath->query('//*[@*]');
    • foreach ($attributeBearingNodes as $node) {
      • $attributes = array();
      • foreach ($node->attributes as $attributeName=>$attributeNode) {
        • if (!($node->tagName == 'a' && $attributeName == 'href')) {
          • $node->removeAttribute($attributeName);
        • }
      • }
    • }

This finds every tag that has an attribute on it, loops through them, and then loops through the attributes on those tags. If the attribute is not an “href” on an “a” tag, the attribute is removed.

That doesn’t necessarily fix everything, though. The “href” attribute itself can contain JavaScript that runs directly from your page.

[toggle code]

  • foreach ($node->attributes as $attributeName=>$attributeNode) {
    • if (!($node->tagName == 'a' && $attributeName == 'href')) {
      • $node->removeAttribute($attributeName);
    • } else {
      • //this is an HREF on an A tag, but we still want to avoid running javascript directly on the page
      • $link =  $attributeNode->value;
      • $link = strtolower(trim($link));
      • if (strpos($link, 'javascript') === 0) {
        • $attributeNode->value = 'http://example.com/prettykittens';
      • }
    • }
  • }

This will check every “a” tag’s “href” to make sure it doesn’t start with the word “javascript”. In the example, it replaces the offending link with a link to pretty kittens. In practice, it might be more appropriate to return an error at that point. Depending on the audience for your blog, you might also decide to play it extremely safe and reverse the logic: rather than only getting rid of “href” attributes that start with “javascript”, get rid of all of them that don’t start with “http”.

  • if (strpos($link, 'http') !== 0) {

Doing it that way will also bar ftp links, email links, and other less commonly-used links, but you don’t see too many of those around any more.

This solution will work best when storing the HTML, so that you don’t have to run this code every time you display every comment.

  1. <- Stable PHP sort
  2. Relative desktop clock ->