Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, PHP, and whatever else I happen to feel like hacking at.

Parsing JSKit/Echo XML comments files

Jerry Stratton, January 30, 2012

I just switched over from my temporary JSKit comments to custom local comments. The main reason I went with JSKit to begin with rather than just not have comments is that they provide the comments in an XML file. This meant that I was able to convert the JSKit/Echo comments on my site to the new system.

I wrote it in Python because my comments database uses Django on the back end.

[toggle code]

  • #!/usr/bin/python
  • # -*- coding: utf-8 -*-
  • from optparse import OptionParser
  • import sys, urlparse, datetime
  • import xml.dom.minidom as minidom
  • parser = OptionParser(u'%(prog) [options] <jskit file>')
  • (options, args) = parser.parse_args()
  • if not args:
    • parser.print_help()
  • def getEntry(comment, key):
    • entry = comment.getElementsByTagName(key)
    • if entry:
      • return entry[0].firstChild.data.strip()
    • return None
  • def getValue(comment, key):
    • possibilities = comment.getElementsByTagName('jskit:attribute')
    • entry = None
    • for possibility in possibilities:
      • if possibility.getAttribute('key') == key:
        • entry = possibility
        • break
    • if entry:
      • value = entry.getAttribute('value').strip()
      • return value
    • return None
  • def getPosterURL(webpresence):
    • if '],[' in webpresence:
      • webpresences = webpresence.split('],[')
    • else:
      • webpresences = [webpresence]
    • for webpresence in webpresences:
      • webpresence = webpresence.strip('["]')
      • if webpresence:
        • service, serviceURL = webpresence.split('","')
        • if service in ['login-twitter', 'login-blogspot']:
          • return serviceURL
        • if service not in ['login-openid', 'login-gfc']:
          • print 'Unknown service:', service, serviceURL
          • sys.exit()
    • return None

The “getEntry” function just gets any subelement from an XML element by name. The “getValue” method gets the value of a specific keyed element named jskit:attribute, which is what JSKit uses to store the commenter’s IP address as well as sometimes their identity and personal web site.

Finally, “getPosterURL” tries to get their personal web site from the sites listed in jskit:attribute keyed as “Webpresence”. That element’s value contains both public sites and private login URLs. Since JSKit displayed them publicly, I thought it would be nice to the commenter to continue linking to their Twitter or Blogspot site in my new system. But, at least among my commenters, I can’t see any reason to link to a person’s openid or gfc URL. (And if the webpresence is neither of those four types, the function will immediately bail and let you know, so that you can add it to either the good list or the ignore list.)

Here is the main loop that goes through every entry in the JSKit file to extract each comment and it’s associated info: commenter name, IP address, and potentially their web address:

[toggle code]

  • empties = []
  • for xmlfile in args:
    • pages = minidom.parse(xmlfile)
    • for page in pages.getElementsByTagName('channel'):
      • pageURL = page.getElementsByTagName('link')[0].firstChild.data
      • comments = page.getElementsByTagName('item')
      • if comments:
        • print pageURL
        • url = urlparse.urlparse(pageURL)
        • for comment in comments:
          • pubDate = getEntry(comment, 'pubDate')
          • #not too sure about this--it ignores the time zone information
          • pubDate = datetime.datetime.strptime(pubDate, '%a, %d %b %Y %H:%M:%S +0000')
          • originalComment = getEntry(comment, 'description')
          • ipAddress = getValue(comment, 'IP')
          • poster = getEntry(comment, 'author')
          • if not poster:
            • poster = getValue(comment, 'user_identity')
          • if not poster:
            • poster = 'Guest'
          • poster = poster
          • posterURL = None
          • webpresence = getValue(comment, 'Webpresence')
          • if webpresence:
            • posterURL = getPosterURL(webpresence)
          • if not posterURL:
            • posterURL = ''
          • print "\t", pubDate
          • print "\t\tIP:", ipAddress
          • print "\t\tPoster:", poster
          • print "\t\tSnippet:", originalComment[:100].replace("\n", ' ')
          • if posterURL:
            • print "\t\tPoster’s web URL:", posterURL
        • print
      • else:
        • empties.append(pageURL)
    • if empties:
      • print 'Listed pages with no comments:'
      • print "\t"+"\n\t".join(empties)

This is pretty basic XML parsing in Python. If there is no name for the commenter, they get the name “Guest”. Python’s datetime.datetime class can’t handle time zones, but as far as I can tell JSKit always provides the pubDate in universal time. So make sure that your database also expects it in universal.

JSKit also gets entries for pages that don’t have comments. Just in case I’m not understanding why they do this, I also list out all of the “empties” at the end of the script. They don’t appear to be pages that had comments but which no longer do; at least one page on my site that used to have comments is not in that list.

October 29, 2012: Parsing JSKit/Echo XML using PHP

According to dpusa in the comments, you can manually insert comments into WordPress using something like:

[toggle code]

  • $data = array(
    • 'comment_post_ID' => 256,
    • 'comment_author' => 'Dave',
    • 'comment_author_email' => 'dave@example.com',
    • 'comment_author_url' => 'http://hal.example.com',
    • 'comment_content' => 'Lorem ipsum dolor sit amet...',
    • 'comment_author_IP' => '127.3.1.1',
    • 'comment_agent' => 'manual insertion',
    • 'comment_date' => date('Y-m-d H:i:s'),
    • 'comment_date_gmt' => date('Y-m-d H:i:s'),
    • 'comment_approved' => 1,
  • );
  • $comment_id = wp_insert_comment($data);

In PHP, you should be able to loop through a jskit XML file using something like:

[toggle code]

  • $comments = simplexml_load_file("/path/to/comments.xml");
  • function getJSKitAttribute($item, $key) {
    • $attribute = $item->xpath('./jskit:attribute[@key="' . $key . '"]/@value');
    • $attribute = $attribute[0];
    • return $attribute;
  • }
  • foreach ($comments as $page) {
    • if ($page->item) {
      • $pageURL = $page->link;
      • echo $pageURL, "\n";
      • foreach ($page->item as $comment) {
        • $date = $comment->pubDate;
        • $text = $comment->description;
        • $IP = getJSKitAttribute($comment, 'IP');
        • echo "\t", substr($text, 0, 80), "\n";
        • echo "\t\t", $date, "\n";
        • echo "\t\t", $IP, "\n";
      • }
      • echo "\n";
    • }
  • }

You could then fill out the $data array with the values of $date, $text, $IP, etc., or hard-code them to default values if they don’t exist. Do this in place of (or in addition to) the three “echo” lines.

[toggle code]

  • $data = array(
    • 'comment_post_ID' => $comment->guid,
    • 'comment_author' => $comment->author,
    • 'comment_content' => $text,
    • 'comment_author_IP' => $IP,
    • 'comment_agent' => 'manual insertion',
    • 'comment_date_gmt' => strtotime($date),
    • 'comment_approved' => 1,
  • );
  • $comment_id = wp_insert_comment($data);
  1. <- Negative Space move
  2. Paging Reed Richards ->