Excerpting partial XHTML using minidom

Jerry Stratton, April 4, 2009

For the front page of this blog, I want to excerpt the HTML that makes up the content of each post; but trying to use Python for this kept resulting in errors because the XHTML I was excerpting is not itself the full page; it‘s just the content.

The simple solution was to render the page content; chop it at 3,000 characters; use .rfind to look for the last <p> tag; and chop it, because it’s only a partial paragraph. For simple pages, this worked. For more complex pages, it ran the risk of the last item being a floating image that floats over the next article, or chopping the end but not the beginning of a blockquote or div.

The xml.dom.minidom “lightweight” XML parser seemed like the obvious solution. There were two problems I kept running into: the second element always produced an error, and any entities also always produced an error. The solution ended up being pretty simple, although it’s a bit hacky.

Surround the entire XHTML snippet with a single <div> tag, and get that element’s children after parsing;
Convert all ampersands to & before parsing, and convert them back after parsing.

[toggle code]

from xml.dom import minidom
#excerpt XHTML safely
def excerptXHTML(content=None, maxSize=3000, element=None):
- if content:
  - if len(content) < maxSize:
    - return content
  - content = '<div>' + content + '</div>'
  - content = content.encode("utf-8")
  - content = content.replace('&', '&')
  - xhtml = minidom.parseString(content).childNodes[0]
- elif element:
  - xhtml = element
- else:
  - return ""
- excerpt = ""
- snippet = ""
- snippets = []
- #collect the elements that reach maxSize
- for tag in xhtml.childNodes:
  - tagText = getExcerptElementText(tag)
  - if tagText:
    - if len(snippet) + len(tagText) <= maxSize:
      - snippet = snippet + "\n" + tagText
      - snippets.append(tag)
    - elif len(snippet) < maxSize*.6:
      - #there's a large element; see if we can remove some of its children
      - snippedTagChildren = excerptXHTML(element=tag, maxSize=maxSize-len(snippet))
      - if snippedTagChildren:
        
        tag = tag.cloneNode(False)
        
        for child in snippedTagChildren:
        
        tag.appendChild(child)
        
        snippets.append(tag)
      - break
    - else:
      - break
- if element:
  - return snippets
- #walk back elements that shouldn't be at the end
- if snippets:
  - while snippets and badExcerptEnding(snippets[-1]):
    - snippets = snippets[0:-1]
  - #join the good elements up
  - for snippet in snippets:
    - excerpt = excerpt + "\n" + getExcerptElementText(snippet)
- return excerpt
#clean an XHTML snippet and return its useful text
def getExcerptElementText(element):
- return element.toxml().strip().replace('&', '&')
#flag elements that shouldn't end an excerpt
def badExcerptEnding(element):
- if element.hasAttribute('class'):
  - classes = element.getAttribute('class').split(" ")
  - if 'imagepull' in classes:
    - return True
- return False

I considered assigning an ID to the outer div, and using xhtml.getElementById() to grab the shell; but getElementById doesn’t appear to work in XML snippets, only in full XML documents.

The important methods and properties here are:

node.childnodes: This property is a list of all the immediate children of this element.
node.toxml(): This method takes the current element and converts it back to text suitable for being part of a web page.
element.hasAttribute() and element.getAttribute(): In the badExcerptEnding function, I check to see if this element has a class attribute; if it does, I get its value. I know that any element with the class “imagepull” is an image pull-out. It makes no sense to have that element at the end of an excerpt. When I start putting sidebars into the text, I’ll flag those, as well.
node.cloneNode(False): This method takes the current element and returns a clone of that element without any childNodes. (Changing False to True returns a clone with childNodes).
node.appendChild(): This method appends a node to the list of childNodes on this node. I use this and cloneNode to cut large elements down to more manageable sizes.

I’m running through the list of top-level elements twice: once to determine how many will “fit” in the maxSize character limitation, and once again to actually zip them up. That’s because I occasionally need to remove elements from the end. I’m doing all of my processing off-line, so it makes more sense to do this in a readable fashion; if you’re doing this live, you’ll probably want to be more efficient.

When I render the page’s content, I get a series of paragraphs and the occasional div or blockquote (or, in the case of this post, a list). If I have one opening paragraph and then one really huge blockquote or source code listing, then without recursion my excerpt is going to be just that opening paragraph. Here, for example, the source code listing puts the maxSize over the 3,000 characters that I’m setting.

I’m arbitrarily checking to see if the length of the snippet is currently less than 60% of maxSize. If it is, I assume that the tag that’s putting this snippet over maxSize is a large one. I call excerptHTML() with just this tag. Since we already know that this is a single tag instead of a list, and it’s already been parsed, I modified the function to accept a parsed element. The function recognizes that it has an element and doesn’t do any of the trickery I had to do with text. It returns a list of the children that don’t put the snippet over the limit.

It then clones that large Node into an empty version of itself, and appends each of the children in that list to the clone. When rendered back to XML, this produces a safely smaller version of that element.

xml.dom: Documentation for Document objects, Node objects, NodeList objects, Element objects, and the ubiquitous Python exceptions.
xml.dom.minidom: “xml.dom.minidom is a light-weight implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also significantly smaller.”

More Python

Quick-and-dirty old-school island script: Here’s a Python-based island generator using the tables from the Judges Guild Island Book 1.
Astounding Scripts on Monterey: Monterey removes Python 2, which means that you’ll need to replace it if you’re still using any Python 2 scripts; there’s also a minor change with Layer Windows and GraphicConverter.
Goodreads: What books did I read last week and last month?: I occasionally want to look in Goodreads for what I read last month or last week, and that currently means sorting by date read and counting down to the beginning and end of the period in question. This Python script will do that search on an exported Goodreads csv file.
Test classes and objects in python: One of the advantages of object-oriented programming is that objects can masquerade as each other.
Timeout class with retry in Python: In Paramiko’s ssh client, timeouts don’t seem to work; a signal can handle this—and then can also perform a retry.
30 more pages with the topic Python, and other related pages

More XML

Catalina: iTunes Library XML: What does Catalina mean for 42 Astounding Scripts?
Parsing JSKit/Echo XML using PHP: In the comments, dpusa wants to import JSKit comments into WordPress, which uses PHP. Here’s how to parse them using PHP.
Parsing JSKit/Echo XML comments files: While I’m not a big fan of remote comment systems for privacy reasons, I was willing to use JSKit as a temporary solution because they provide an easy XML dump of posted comments. This weekend, I finally moved my main blog to custom comments; here’s how I parsed JSKit’s XML file.
Auto-closing HTML tags in comments: One of the biggest problems on blogs is that comments often get stuck with unclosed italics, bold, or links. You can automatically close them by transforming the HTML snippet into an XML document.
minidom self-closes empty SCRIPT tags: Python’s minidom will self-close empty script tags—as it should. But it turns out that Firefox 3.6 and IE 8 don’t support empty script tags.
Five more pages with the topic XML, and other related pages

Comments?

The undiscovered comment form, whose bourn no poster returns.

Your email, URL, and location are optional—but I won’t be able to contact you if you don’t leave a working email. Your email does not get displayed, your URL and location do. Your name is required but may vary as the needs of the day demand, or you can just use the anonymous Hark Thrice name. You can use the following tags: <em>, <a>, <blockquote>. Use them wisely and post intelligently. Comments may take some time to approve, especially if I’m stuck in a Mexican jail.

If you have private comments, or questions about this page, please, leave a message on the Negative Space Comments Page.

Lost?

If you’re looking for something here, use the search box in the navigation to limit your search to this part of the site, or use the Negative Space search page.

Jerry

But everything should be done in a fitting and orderly way. — Saint Paul (I Corinthians)

Contents of Negative Space™ as a whole Copyright © 1994-2026 Jerry Stratton. Individual copyrights remain held by their respective authors unless they specify otherwise. Site titles, such as Negative Space, Strange Bedfellows, Biblyon Broadsheet, Highland Games, and FireBlade Coffeehouse are trademarks of Jerry Stratton.

Code and code snippets, to the extent that they are copyrightable, may be re-distributed under the terms of the GNU General Public License 3.

Excerpting partial XHTML using minidom last modified April 22nd, 2009.

Your comment
Your name
Your email
Your web page
Your location

Mimsy Were the Borogoves

Excerpting partial XHTML using minidom

More Python

More XML

Editorials

Books, Movies, & Music

Technology & Hacks

Food

42 Astounding Scripts

Walkerville Reader

Biblyon Broadsheet

About Mimsy

Comments?

Lost?

Mimsy Were the Borogoves

Excerpting partial XHTML using minidom

More Python

More XML

Editorials

Books, Movies, & Music

Technology & Hacks

Food

42 Astounding Scripts

Walkerville Reader

Biblyon Broadsheet

Blogroll

Keep in touch

About Mimsy

Comments?

Lost?