Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, Swift, BASIC, and whatever else I happen to feel like hacking at.

Automatically distributing images within XHTML

Jerry Stratton, August 20, 2009

The ability to safely and surely parse XHTML makes it easy to automate some boring tasks. For example, in my movie reviews I usually provide a handful of stills from the movie I’m reviewing. I don’t really care where they go on the page, just that they should be relatively evenly distributed.

When I first started including images in my reviews back in 2001, I was just using soupy HTML. I automated image distribution by counting up the number of “paragraphs” and hoping that the image didn’t fall into a sidebar or table. If the image did, then I’d either change the review so that the image-unsafe code section moved, or I’d switch the review to manual mode.

Now that I’m using XHTML, I don’t have to worry: I can parse the XML and loop through the top-level elements.

As I did in Excerpting partial XHTML using minidom, in order to parse loose XHTML it needs to be surrounded with a single element (I’m using a div) and the ampersands need to be encoded. Since I’m obviously going to be doing this for more than one purpose, it needs to be a function:

[toggle code]

  • def parseLooseXHTML(content):
    • content = '<div>' + content + '</div>'
    • content = content.encode("utf-8")
    • content = content.replace('&', '&amp;')
    • xhtml = minidom.parseString(content).childNodes[0]
    • return xhtml

After that, it’s a simple process of taking some XHTML content and a list of media and looping:

[toggle code]

  • #insert automatic media between top-level HTML
  • def simplemedia(content, media):
    • mediaCount = len(media)
    • if not mediaCount:
      • return content
    • currentMedia = 0
    • characterCount = len(content)
    • currentCharacter = 0
    • xhtml = parseLooseXHTML(content)
    • htmlParts = []
    • for tag in xhtml.childNodes:
      • tagText = getElementText(tag)
      • if currentMedia < mediaCount:
        • if currentCharacter >= characterCount*currentMedia/mediaCount:
          • if currentMedia % 2:
            • mediaClass = ["pulleven"]
          • else:
            • mediaClass = ["pullodd"]
          • mediaHolder = media[currentMedia]
          • if mediaHolder.style:
            • mediaClass.append(mediaHolder.style.className)
          • mediaClass = ' '.join(mediaClass)
          • imageContext = {'link': mediaHolder.linkHTML(embed=True), 'style': mediaClass, 'caption': mediaHolder.caption}
          • htmlParts.append(render_to_string("parts/image_pull.html", imageContext))
          • currentMedia = currentMedia + 1
        • currentCharacter = currentCharacter + len(tagText)
      • htmlParts.append(tagText)
    • content = "\n".join(htmlParts)
    • return content

Each item in the list of media is an object that knows how to create its display HTML (method: linkHTML), and that contains properties for various parts of the media, such as the caption, any custom style, the title, and the URL.

I’m using Django, so I can use render_to_string to render a template using a dict of items. The template looks like this:

[toggle code]

  • <div class="imagepull {{ style }}">
    • {{ link }}
    • {% if caption %}
      • <p class="caption">{{ caption }}</p>
    • {% endif %}
  • </div>

You could do the same thing with Mako or other templating systems.

And I’m using the same getElementText that I used in Excerpting XHTML:

[toggle code]

  • #clean an XHTML snippet and return its useful text
  • def getElementText(element):
    • return element.toxml().strip().replace('&amp;', '&')

The simplemedia function keeps track of the size of each element as it loops, so that larger elements count for more than smaller elements when distributing the images or other media. And I get nicely spaced graphics interspersed throughout my reviews, or any other page that uses images that don’t need to be precisely placed.

  1. <- Listed code to plaintext
  2. Combining architectures ->