Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, PHP, and whatever else I happen to feel like hacking at.

Nisus “clean HTML” macro

Jerry Stratton, March 16, 2008

I’ve recently switched to using Nisus Writer Pro for writing most of what I write. It’s a lot easier to work with than Microsoft Word (for me), especially when it comes to (a) writing and (b) maintaining styles.

One thing that Word arguably does better than Nisus is create HTML from its documents. While Word fills up its HTML-ified documents with a lot of extra crap, it at least creates somewhat structured documents. If you can throw out the crap, a Word-created HTML file is fairly reasonable. Nisus makes pretty much everything be paragraphs, even headlines. Like Word, it tries to recreate the print formatting of the document in HTML styles. Unlike Word, it uses arbitrary class names instead of duplicating the in-document class names in the HTML. So, not only is the HTML unstructured, but it is also difficult to modify the layout.

I prefer to use different layouts on web pages than I use for print documents. The web is not print, and layouts that work in print are often wildly inappropriate for web browsing. What I want from Nisus is for it to keep headlines as Hx tags and to create predictable style classes so that I can optimize them for web viewing.

One thing Nisus has that leaves Word in the dust, however, is a serious scripting language. Nisus has a simple scripting language built in, and it has an advanced scripting language built in. The advanced scripting language in Nisus is Perl, and Nisus scripts can move in and out of Perl easily as needed.

[toggle code]

  • $currentParagraph = Read Selection
  • Begin Perl
    • chomp($currentParagraph);
    • $currentParagraph = reverse($currentParagraph);
  • End
  • Type Text $currentParagraph

This script grabs the current selection from Nisus, and then switches into Perl to work on the selection (reversing the text). Finally, it comes back out of Perl and in Nisus types the now reversed text.

It turns out to be not particularly difficult to write a Nisus “macro” that will write an entire document to structured HTML, retaining tables and images. The only thing it loses is character-based styles. I don’t use them much, and hopefully, a future version of Nisus will obsolete this script so that I won’t have to worry about them.

The trick is in the RTF

Nisus can grab your selections in two formats: a basically text format that can be written out as paragraphs, and the RTF of the same selection that contains all of the style information for that selection. RTF is not easy to read, but in this case it’s a little simpler because we don’t have to look at the RTF for an entire document; we can deal with it a paragraph at a time.

The script can loop through every paragraph, and use the “standard” version of the paragraph for writing to the HTML while using the RTF version to get the style name and any embedded images.

For example, here is a quick script that extracts every image in your document, and writes them to a folder called “images”.

[toggle code]

  • #extract all images in document to “images” folder
  • $currentFolder = Document Property "enclosing folder path"
  • $imageFolder = "$currentFolder/images"
  • If File Exists $imageFolder
    • Prompt "Re-use existing folder?", "An images folder already exists. Cancel this script or overwrite existing items in that folder?", "Overwrite"
  • Else
    • #ensure that folder exists
    • Begin Perl
      • mkdir($imageFolder);
    • End
  • End
  • #get to just in front of the first image
  • Select Image 1
  • Select Start
  • $imageCount = 0
  • While Select Next Image
    • $imageCount += 1
    • $currentImage = Read Selection
    • $imageRTF = Encode RTF $currentImage
    • Begin Perl
      • $imageRTF =~ /\\pngblip ([^}]+)/;
      • $image = pack("H*", $1);
      • $imageFileName = "image_$imageCount.png";
      • $imageFilePath = "$imageFolder/$imageFileName";
      • if (open $imageHandle, ">", $imageFilePath) {
        • print $imageHandle $image;
        • close($imageHandle);
      • }
    • End
  • End

From “Begin Perl” to “End” is all Perl; the rest is Nisus. Variables created in Nisus can be used—and changed—by Perl, as $imageCount and $imageFolder are here. Variables created in Perl are not available in Nisus.

This script:

  1. Gets the folder where the current document lives;
  2. Gives a warning if there is already an “images” folder in the current folder, or creates the folder if it doesn’t exist;
  3. Moves the selection to just in front of the first image;
  4. Selects each image in turn and:
    1. Adds one to the image count; this will be used for the filename;
    2. Gets the selected image;
    3. Converts the image to RTF;
    4. Grabs the PNG from the RTF in Perl;
    5. Converts the PNG back to binary in Perl;
    6. Creates the filename as “image_” and the image number inPerl;
    7. Writes the binary data to that filename in Perl.

It isn’t necessary to know how to read RTF, just to look for the specific code you need. In this case, Nisus stores images in RTF as a “pngblip” (a PNG image) ending with a space and a curly bracket.

You might find this Nisus macro useful for inspecting a selection’s RTF:

  • #convert the current selection to RTF and put in new document
  • $currentParagraph = Read Selection
  • $currentRTF = Encode RTF $currentParagraph
  • New
  • View:Draft View
  • Insert Text $currentRTF

Being able to extract images is a big step towards being able to create clean HTML from Nisus documents.

Clean HTML

Here’s the script I currently have. Except for character-level styles, it works with everything I currently throw at it. There are some things it won’t work with, such as multiple levels of lists. I don’t use those in any of the documents I used for testing, so I haven’t added that functionality in (I’m not sure how easy it would even be).

The script separates some common functionality into a separate file I called “nisus.nwm”; more on that below.

[toggle code]

  • $pageName = Document Property "file name without extension"
  • $title = $pageName
  • Begin Perl
    • require "/Users/jerry/bin/nisus.nwm";
    • $title = cleanText($title);
    • $pageName = slugify($pageName);
  • End
  • $currentFolder = Document Property "enclosing folder path"
  • $htmlPage = "$currentFolder/$pageName.html"
  • If File Exists $htmlPage
    • Prompt "Erase existing file?", "The file $htmlPage already exists. Cancel this script or overwrite the existing file?", "Overwrite"
  • End
  • Write to File "<html>\n", $htmlPage
  • Append to File "\t<head>\n", $htmlPage
  • Append to File "\t\t<title>$title</title>\n", $htmlPage
  • Append to File "\t\t<link href=\"$pageName.css\" rel=\"StyleSheet\" media=\"all\" />\n", $htmlPage
  • Append to File "\t</head>\n", $htmlPage
  • Select Paragraph 1
  • Select Start
  • Append to File "\t<body>\n", $htmlPage
  • $tabs = "\t\t"
  • $inList = false
  • $previousRow = 0
  • $previousColumn = 0
  • $listType = ""
  • $imageCount = 0
  • While Select Next Paragraph
    • $currentParagraph = Read Selection
    • $row = Selection Row Index
    • $column = Selection Column Index
    • $currentRTF = Encode RTF $currentParagraph
    • $precedingTag = ""
    • Begin Perl
      • require "/Users/jerry/bin/nisus.nwm";
      • $currentParagraph = cleanText($currentParagraph);
      • @precedingTags = ();
      • ($tag, $style) = parseParagraph($currentRTF);
      • #deal with ending lists
      • if ($inList && !isList($currentParagraph)) {
        • $inList = 0;
        • chop $tabs;
        • $precedingTags[$#precedingTags+1] = "$tabs</$listType>";
      • }
      • #handle tables
      • $needCell = 0;
      • if ($row > $previousRow) {
        • #new row
        • if ($row == 1) {
          • #new table
          • $precedingTags[$#precedingTags+1] = "$tabs<table>";
          • $tabs .= "\t";
        • } else {
          • #new row of existing table
          • chop $tabs;
          • $precedingTags[$#precedingTags+1] = "$tabs</td>";
          • chop $tabs;
          • $precedingTags[$#precedingTags+1] = "$tabs</tr>";
        • }
        • $precedingTags[$#precedingTags+1] = "$tabs<tr>";
        • $tabs .= "\t";
        • $needCell = 1;
        • $previousColumn = 0;
      • } elsif ($previousRow > $row) {
        • #end of table
        • chop $tabs;
        • $precedingTags[$#precedingTags+1] = "$tabs</td>";
        • chop $tabs;
        • $precedingTags[$#precedingTags+1] = "$tabs</tr>";
        • chop $tabs;
        • $precedingTags[$#precedingTags+1] = "$tabs</table>";
      • } elsif ($column > $previousColumn) {
        • #new column in existing row
        • chop $tabs;
        • $precedingTags[$#precedingTags+1] = "$tabs</td>";
        • $needCell = 1;
      • }
      • #need to open any cell(s)?
      • if ($needCell) {
        • #handle any empty cells before this one
        • $columnCount = $column-$previousColumn;
        • while ($columnCount>1) {
          • $precedingTags[$#precedingTags+1] = "$tabs<td></td>";
          • $columnCount--;
        • }
        • $precedingTags[$#precedingTags+1] = "$tabs<td>";
        • $tabs .= "\t";
      • }
      • #handle new lists
      • if ($tag eq "p") {
        • if (($newParagraph, $newList) = isList($currentParagraph)) {
          • $currentParagraph = $newParagraph;
          • $listType = $newList;
          • $tag = "li";
          • if (!$inList) {
            • $inList = 1;
            • $precedingTags[$#precedingTags+1] = "$tabs<$listType>";
            • $tabs .= "\t";
          • }
        • }
      • }
      • #is there an image here?
      • $imageFile = "";
      • while ($currentRTF =~ /\\pngblip ([^}]+)/) {
        • $image = pack("H*", $1);
        • $currentRTF =~ s/\\pngblip [^}]+//;
        • if ($image ne $oldImage) {
          • $imageCount++;
          • $imageFolder = "$currentFolder/images";
          • mkdir($imageFolder);
          • $imageFileName = "$imageCount.png";
          • $imageFilePath = "$imageFolder/$imageFileName";
          • if (open $imageHandle, ">", $imageFilePath) {
            • print $imageHandle $image;
            • close($imageHandle);
          • }
          • if (($style ne "image") && $currentParagraph) {
            • if ($imageCount % 2 == 1) {
              • $order = "odd";
            • } else {
              • $order = "even";
            • }
            • $currentParagraph = "<img class=\"inline $order\" src=\"images/$imageFileName\" />" . $currentParagraph;
          • } else {
            • if (!$style) {
              • $style = "image";
            • }
            • $currentParagraph .= "<img src=\"images/$imageFileName\" />";
          • }
        • }
        • $oldImage = $image;
      • }
      • #create tag and class
      • if ($style) {
        • $style = slugify($style);
        • $startTag = "<$tag class=\"$style\">";
      • } else {
        • $startTag = "<$tag>";
      • }
      • $endTag = "</$tag>";
      • if ($currentParagraph) {
        • $currentParagraph = "$tabs$startTag$currentParagraph$endTag";
      • }
      • $precedingTag = join("\n", @precedingTags);
    • End
    • If $precedingTag
      • Append to File "$precedingTag\n", $htmlPage
    • End
    • If $currentParagraph
      • Append to File "$currentParagraph\n", $htmlPage
    • End
    • $previousRow = $row
    • $previousColumn = $column
  • End
  • $closer = false
  • Begin Perl
    • #check for open items
    • @closers = ();
    • #lists
    • if ($inList) {
      • chop $tabs;
      • $closers[$#closers+1] = "$tabs</$listType>";
    • }
    • #tables
    • if ($previousRow) {
      • chop $tabs;
      • $closers[$#closers+1] = "$tabs</td>";
      • chop $tabs;
      • $closers[$#closers+1] = "$tabs</tr>";
      • chop $tabs;
      • $closers[$#closers+1] = "$tabs</table>";
    • }
    • $closer = join("\n", @closers);
  • End
  • If $closer
    • Append to File "$closer\n", $htmlPage
  • End
  • Append to File "\t</body>\n", $htmlPage
  • Append to File "</html>", $htmlPage

That looks fairly complex, but what it does is easy enough to grasp: it goes through each paragraph and writes it out to an HTML file. It uses the paragraph text to get the content, and it uses the paragraph RTF to get class names for styles, as well as heading levels.

Along the way it checks for lists and tables to make sure that those get converted correctly.

Support file

The above script calls a support file twice. You’ll need to store this file on your system and put your path where it gets required:

[toggle code]

  • #Nisus subroutines
  • #edit in Nisus for encoding
  • use utf8;
  • sub cleanText {
    • my($text) = shift;
    • chomp $text;
    • $text =~ s/[ ]+$//;
    • $text =~ s/&/&amp;/g;
    • $text =~ s/‘/‘/g;
    • $text =~ s/’/’/g;
    • $text =~ s/“/“/g;
    • $text =~ s/”/”/g;
    • $text =~ s/…/…/g;
    • $text =~ s/—/—/g;
    • $text =~ s/©/&copy;/g;
    • $text =~ s/ë/ë/g;
    • $text =~ s/é/é/g;
    • $text =~ s/\x{FFFC}//g;
    • $text =~ s/\x{2028}/<br \/>\n/g;
    • $text =~ s/\x{0C}//g;
    • $text =~ s/\x{0A}//g;
    • $text =~ s/^ +//;
    • $text =~ s/ +$//;
    • return $text;
  • }
  • sub slugify {
    • my($text) = shift;
    • $text = cleanText($text);
    • $text =~ s/\&[a-z]+\;//g;
    • $text =~ s/ /_/g;
    • $text = lc($text);
    • return $text;
  • }
  • #decide current tag
  • sub parseParagraph {
    • my($RTF) = shift;
    • my($tag) = "p";
    • my($style) = "";
    • #is this a heading?
    • if ($RTF =~ /\\tcl([0-9]) /) {
      • my($headingLevel) = $1;
      • if ($headingLevel) {
        • $tag = "h$headingLevel";
      • }
    • }
    • #get the style if one exists
    • #be careful not to get other info such as font name
    • if ($RTF =~ /\\tcl[0-9] ([^;\\]+);}/) {
      • $style = $1;
    • } elsif ($RTF =~ /\in0 ([a-z0-9 ]+);}/i) {
      • $style = $1;
    • } elsif ($RTF =~ /[0-9]+ ([a-z0-9 ]+);}/i) {
      • $style = $1;
    • }
    • #some styles are just the defaults, and don't need classes
    • if ($style eq "Normal") {
      • $style = "";
    • }
    • if ($style =~ /^Heading [0-9]$/) {
      • $style = "";
    • }
    • return ($tag, $style);
  • }
  • sub isList {
    • my($text) = shift;
    • if ($text =~ s/^[0-9]+\.\t//) {
      • return ($text, 'ol');
    • } elsif ($text =~ s/^&bull;\t//) {
      • return ($text, 'ul');
    • } else {
      • return ();
    • }
  • }
  • 1;

There’s probably a better way in Perl to convert diacriticals and other special characters to their HTML entities. Notice that the top of the file tells Perl to “use utf8”; that’s what Nisus uses when it sends text to your script. It tells Perl this automatically in the scripts it calls directly, but you need to specify it in any required files.

Wish list

Obviously it’d be a whole lot nicer (and likely more reliable) if I could get the style names directly from Nisus instead of having to parse them out of the RTF.

Less obviously, named sections would be useful. It would then be possible to put a DIV around each section with a predictable name, for applying styles to that section.

And if there were an Encode HTML equivalent to Encode RTF, this might allow me to maintain character-level styles.

Of course, the ultimate wish is for Nisus’s save as HTML to do all this automatically.

Note that Nisus also has what looks to be a very useful AppleScript dictionary as well.

And one final caveat: I don’t know RTF; that this works for me is no guarantee that it will work for you. It probably won’t.

  1. <- Speaking Geek
  2. Apache XML ->