Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, PHP, and whatever else I happen to feel like hacking at.

Excerpting partial XHTML using minidom

Jerry Stratton, April 4, 2009

For the front page of this blog, I want to excerpt the HTML that makes up the content of each post; but trying to use Python for this kept resulting in errors because the XHTML I was excerpting is not itself the full page; it‘s just the content.

The simple solution was to render the page content; chop it at 3,000 characters; use .rfind to look for the last <p> tag; and chop it, because it’s only a partial paragraph. For simple pages, this worked. For more complex pages, it ran the risk of the last item being a floating image that floats over the next article, or chopping the end but not the beginning of a blockquote or div.

The xml.dom.minidom “lightweight” XML parser seemed like the obvious solution. There were two problems I kept running into: the second element always produced an error, and any entities also always produced an error. The solution ended up being pretty simple, although it’s a bit hacky.

  1. Surround the entire XHTML snippet with a single <div> tag, and get that element’s children after parsing;
  2. Convert all ampersands to &amp; before parsing, and convert them back after parsing.

[toggle code]

  • from xml.dom import minidom
  • #excerpt XHTML safely
  • def excerptXHTML(content=None, maxSize=3000, element=None):
    • if content:
      • if len(content) < maxSize:
        • return content
      • content = '<div>' + content + '</div>'
      • content = content.encode("utf-8")
      • content = content.replace('&', '&amp;')
      • xhtml = minidom.parseString(content).childNodes[0]
    • elif element:
      • xhtml = element
    • else:
      • return ""
    • excerpt = ""
    • snippet = ""
    • snippets = []
    • #collect the elements that reach maxSize
    • for tag in xhtml.childNodes:
      • tagText = getExcerptElementText(tag)
      • if tagText:
        • if len(snippet) + len(tagText) <= maxSize:
          • snippet = snippet + "\n" + tagText
          • snippets.append(tag)
        • elif len(snippet) < maxSize*.6:
          • #there's a large element; see if we can remove some of its children
          • snippedTagChildren = excerptXHTML(element=tag, maxSize=maxSize-len(snippet))
          • if snippedTagChildren:
            • tag = tag.cloneNode(False)
            • for child in snippedTagChildren:
              • tag.appendChild(child)
            • snippets.append(tag)
          • break
        • else:
          • break
    • if element:
      • return snippets
    • #walk back elements that shouldn't be at the end
    • if snippets:
      • while snippets and badExcerptEnding(snippets[-1]):
        • snippets = snippets[0:-1]
      • #join the good elements up
      • for snippet in snippets:
        • excerpt = excerpt + "\n" + getExcerptElementText(snippet)
    • return excerpt
  • #clean an XHTML snippet and return its useful text
  • def getExcerptElementText(element):
    • return element.toxml().strip().replace('&amp;', '&')
  • #flag elements that shouldn't end an excerpt
  • def badExcerptEnding(element):
    • if element.hasAttribute('class'):
      • classes = element.getAttribute('class').split(" ")
      • if 'imagepull' in classes:
        • return True
    • return False

I considered assigning an ID to the outer div, and using xhtml.getElementById() to grab the shell; but getElementById doesn’t appear to work in XML snippets, only in full XML documents.

The important methods and properties here are:

This property is a list of all the immediate children of this element.
This method takes the current element and converts it back to text suitable for being part of a web page.
element.hasAttribute() and element.getAttribute()
In the badExcerptEnding function, I check to see if this element has a class attribute; if it does, I get its value. I know that any element with the class “imagepull” is an image pull-out. It makes no sense to have that element at the end of an excerpt. When I start putting sidebars into the text, I’ll flag those, as well.
This method takes the current element and returns a clone of that element without any childNodes. (Changing False to True returns a clone with childNodes).
This method appends a node to the list of childNodes on this node. I use this and cloneNode to cut large elements down to more manageable sizes.

I’m running through the list of top-level elements twice: once to determine how many will “fit” in the maxSize character limitation, and once again to actually zip them up. That’s because I occasionally need to remove elements from the end. I’m doing all of my processing off-line, so it makes more sense to do this in a readable fashion; if you’re doing this live, you’ll probably want to be more efficient.

When I render the page’s content, I get a series of paragraphs and the occasional div or blockquote (or, in the case of this post, a list). If I have one opening paragraph and then one really huge blockquote or source code listing, then without recursion my excerpt is going to be just that opening paragraph. Here, for example, the source code listing puts the maxSize over the 3,000 characters that I’m setting.

I’m arbitrarily checking to see if the length of the snippet is currently less than 60% of maxSize. If it is, I assume that the tag that’s putting this snippet over maxSize is a large one. I call excerptHTML() with just this tag. Since we already know that this is a single tag instead of a list, and it’s already been parsed, I modified the function to accept a parsed element. The function recognizes that it has an element and doesn’t do any of the trickery I had to do with text. It returns a list of the children that don’t put the snippet over the limit.

It then clones that large Node into an empty version of itself, and appends each of the children in that list to the clone. When rendered back to XML, this produces a safely smaller version of that element.

  1. <- Leopard install command
  2. Django code tag ->