Nisus “clean HTML” macro

Jerry Stratton, March 16, 2008

I’ve recently switched to using Nisus Writer Pro for writing most of what I write. It’s a lot easier to work with than Microsoft Word (for me), especially when it comes to (a) writing and (b) maintaining styles.

One thing that Word arguably does better than Nisus is create HTML from its documents. While Word fills up its HTML-ified documents with a lot of extra crap, it at least creates somewhat structured documents. If you can throw out the crap, a Word-created HTML file is fairly reasonable. Nisus makes pretty much everything be paragraphs, even headlines. Like Word, it tries to recreate the print formatting of the document in HTML styles. Unlike Word, it uses arbitrary class names instead of duplicating the in-document class names in the HTML. So, not only is the HTML unstructured, but it is also difficult to modify the layout.

I prefer to use different layouts on web pages than I use for print documents. The web is not print, and layouts that work in print are often wildly inappropriate for web browsing. What I want from Nisus is for it to keep headlines as Hx tags and to create predictable style classes so that I can optimize them for web viewing.

One thing Nisus has that leaves Word in the dust, however, is a serious scripting language. Nisus has a simple scripting language built in, and it has an advanced scripting language built in. The advanced scripting language in Nisus is Perl, and Nisus scripts can move in and out of Perl easily as needed.

[toggle code]

$currentParagraph = Read Selection
Begin Perl
- chomp($currentParagraph);
- $currentParagraph = reverse($currentParagraph);
End
Type Text $currentParagraph

This script grabs the current selection from Nisus, and then switches into Perl to work on the selection (reversing the text). Finally, it comes back out of Perl and in Nisus types the now reversed text.

It turns out to be not particularly difficult to write a Nisus “macro” that will write an entire document to structured HTML, retaining tables and images. The only thing it loses is character-based styles. I don’t use them much, and hopefully, a future version of Nisus will obsolete this script so that I won’t have to worry about them.

The trick is in the RTF

Nisus can grab your selections in two formats: a basically text format that can be written out as paragraphs, and the RTF of the same selection that contains all of the style information for that selection. RTF is not easy to read, but in this case it’s a little simpler because we don’t have to look at the RTF for an entire document; we can deal with it a paragraph at a time.

The script can loop through every paragraph, and use the “standard” version of the paragraph for writing to the HTML while using the RTF version to get the style name and any embedded images.

For example, here is a quick script that extracts every image in your document, and writes them to a folder called “images”.

[toggle code]

#extract all images in document to “images” folder
$currentFolder = Document Property "enclosing folder path"
$imageFolder = "$currentFolder/images"
If File Exists $imageFolder
- Prompt "Re-use existing folder?", "An images folder already exists. Cancel this script or overwrite existing items in that folder?", "Overwrite"
Else
- #ensure that folder exists
- Begin Perl
  - mkdir($imageFolder);
- End
End
#get to just in front of the first image
Select Image 1
Select Start
$imageCount = 0
While Select Next Image
- $imageCount += 1
- $currentImage = Read Selection
- $imageRTF = Encode RTF $currentImage
- Begin Perl
  - $imageRTF =~ /\\pngblip ([^}]+)/;
  - $image = pack("H*", $1);
  - $imageFileName = "image_$imageCount.png";
  - $imageFilePath = "$imageFolder/$imageFileName";
  - if (open $imageHandle, ">", $imageFilePath) {
    - print $imageHandle $image;
    - close($imageHandle);
  - }
- End
End

From “Begin Perl” to “End” is all Perl; the rest is Nisus. Variables created in Nisus can be used—and changed—by Perl, as $imageCount and $imageFolder are here. Variables created in Perl are not available in Nisus.

This script:

Gets the folder where the current document lives;
Gives a warning if there is already an “images” folder in the current folder, or creates the folder if it doesn’t exist;
Moves the selection to just in front of the first image;
Selects each image in turn and:
1. Adds one to the image count; this will be used for the filename;
2. Gets the selected image;
3. Converts the image to RTF;
4. Grabs the PNG from the RTF in Perl;
5. Converts the PNG back to binary in Perl;
6. Creates the filename as “image_” and the image number inPerl;
7. Writes the binary data to that filename in Perl.

It isn’t necessary to know how to read RTF, just to look for the specific code you need. In this case, Nisus stores images in RTF as a “pngblip” (a PNG image) ending with a space and a curly bracket.

You might find this Nisus macro useful for inspecting a selection’s RTF:

#convert the current selection to RTF and put in new document
$currentParagraph = Read Selection
$currentRTF = Encode RTF $currentParagraph
New
View:Draft View
Insert Text $currentRTF

Being able to extract images is a big step towards being able to create clean HTML from Nisus documents.

Clean HTML

Here’s the script I currently have. Except for character-level styles, it works with everything I currently throw at it. There are some things it won’t work with, such as multiple levels of lists. I don’t use those in any of the documents I used for testing, so I haven’t added that functionality in (I’m not sure how easy it would even be).

The script separates some common functionality into a separate file I called “nisus.nwm”; more on that below.

[toggle code]

$pageName = Document Property "file name without extension"
$title = $pageName
Begin Perl
- require "/Users/USER/bin/nisus.nwm";
- $title = cleanText($title);
- $pageName = slugify($pageName);
End
$currentFolder = Document Property "enclosing folder path"
$htmlPage = "$currentFolder/$pageName.html"
If File Exists $htmlPage
- Prompt "Erase existing file?", "The file $htmlPage already exists. Cancel this script or overwrite the existing file?", "Overwrite"
End
Write to File "<html>\n", $htmlPage
Append to File "\t<head>\n", $htmlPage
Append to File "\t\t<title>$title</title>\n", $htmlPage
Append to File "\t\t<link href=\"$pageName.css\" rel=\"StyleSheet\" media=\"all\" />\n", $htmlPage
Append to File "\t</head>\n", $htmlPage
Select Paragraph 1
Select Start
Append to File "\t<body>\n", $htmlPage
$tabs = "\t\t"
$inList = false
$previousRow = 0
$previousColumn = 0
$listType = ""
$imageCount = 0
While Select Next Paragraph
- $currentParagraph = Read Selection
- $row = Selection Row Index
- $column = Selection Column Index
- $currentRTF = Encode RTF $currentParagraph
- $precedingTag = ""
- Begin Perl
  - require "/Users/USER/bin/nisus.nwm";
  - $currentParagraph = cleanText($currentParagraph);
  - @precedingTags = ();
  - ($tag, $style) = parseParagraph($currentRTF);
  - #deal with ending lists
  - if ($inList && !isList($currentParagraph)) {
    - $inList = 0;
    - chop $tabs;
    - $precedingTags[$#precedingTags+1] = "$tabs</$listType>";
  - }
  - #handle tables
  - $needCell = 0;
  - if ($row > $previousRow) {
    - #new row
    - if ($row == 1) {
      - #new table
      - $precedingTags[$#precedingTags+1] = "$tabs<table>";
      - $tabs .= "\t";
    - } else {
      - #new row of existing table
      - chop $tabs;
      - $precedingTags[$#precedingTags+1] = "$tabs</td>";
      - chop $tabs;
      - $precedingTags[$#precedingTags+1] = "$tabs</tr>";
    - }
    - $precedingTags[$#precedingTags+1] = "$tabs<tr>";
    - $tabs .= "\t";
    - $needCell = 1;
    - $previousColumn = 0;
  - } elsif ($previousRow > $row) {
    - #end of table
    - chop $tabs;
    - $precedingTags[$#precedingTags+1] = "$tabs</td>";
    - chop $tabs;
    - $precedingTags[$#precedingTags+1] = "$tabs</tr>";
    - chop $tabs;
    - $precedingTags[$#precedingTags+1] = "$tabs</table>";
  - } elsif ($column > $previousColumn) {
    - #new column in existing row
    - chop $tabs;
    - $precedingTags[$#precedingTags+1] = "$tabs</td>";
    - $needCell = 1;
  - }
  - #need to open any cell(s)?
  - if ($needCell) {
    - #handle any empty cells before this one
    - $columnCount = $column-$previousColumn;
    - while ($columnCount>1) {
      - $precedingTags[$#precedingTags+1] = "$tabs<td></td>";
      - $columnCount--;
    - }
    - $precedingTags[$#precedingTags+1] = "$tabs<td>";
    - $tabs .= "\t";
  - }
  - #handle new lists
  - if ($tag eq "p") {
    - if (($newParagraph, $newList) = isList($currentParagraph)) {
      - $currentParagraph = $newParagraph;
      - $listType = $newList;
      - $tag = "li";
      - if (!$inList) {
        
        $inList = 1;
        
        $precedingTags[$#precedingTags+1] = "$tabs<$listType>";
        
        $tabs .= "\t";
      - }
    - }
  - }
  - #is there an image here?
  - $imageFile = "";
  - while ($currentRTF =~ /\\pngblip ([^}]+)/) {
    - $image = pack("H*", $1);
    - $currentRTF =~ s/\\pngblip [^}]+//;
    - if ($image ne $oldImage) {
      - $imageCount++;
      - $imageFolder = "$currentFolder/images";
      - mkdir($imageFolder);
      - $imageFileName = "$imageCount.png";
      - $imageFilePath = "$imageFolder/$imageFileName";
      - if (open $imageHandle, ">", $imageFilePath) {
        
        print $imageHandle $image;
        
        close($imageHandle);
      - }
      - if (($style ne "image") && $currentParagraph) {
        
        if ($imageCount % 2 == 1) {
        
        $order = "odd";
        
        } else {
        
        $order = "even";
        
        }
        
        $currentParagraph = "<img class=\"inline $order\" src=\"images/$imageFileName\" />" . $currentParagraph;
      - } else {
        
        if (!$style) {
        
        $style = "image";
        
        }
        
        $currentParagraph .= "<img src=\"images/$imageFileName\" />";
      - }
    - }
    - $oldImage = $image;
  - }
  - #create tag and class
  - if ($style) {
    - $style = slugify($style);
    - $startTag = "<$tag class=\"$style\">";
  - } else {
    - $startTag = "<$tag>";
  - }
  - $endTag = "</$tag>";
  - if ($currentParagraph) {
    - $currentParagraph = "$tabs$startTag$currentParagraph$endTag";
  - }
  - $precedingTag = join("\n", @precedingTags);
- End
- If $precedingTag
  - Append to File "$precedingTag\n", $htmlPage
- End
- If $currentParagraph
  - Append to File "$currentParagraph\n", $htmlPage
- End
- $previousRow = $row
- $previousColumn = $column
End
$closer = false
Begin Perl
- #check for open items
- @closers = ();
- #lists
- if ($inList) {
  - chop $tabs;
  - $closers[$#closers+1] = "$tabs</$listType>";
- }
- #tables
- if ($previousRow) {
  - chop $tabs;
  - $closers[$#closers+1] = "$tabs</td>";
  - chop $tabs;
  - $closers[$#closers+1] = "$tabs</tr>";
  - chop $tabs;
  - $closers[$#closers+1] = "$tabs</table>";
- }
- $closer = join("\n", @closers);
End
If $closer
- Append to File "$closer\n", $htmlPage
End
Append to File "\t</body>\n", $htmlPage
Append to File "</html>", $htmlPage

That looks fairly complex, but what it does is easy enough to grasp: it goes through each paragraph and writes it out to an HTML file. It uses the paragraph text to get the content, and it uses the paragraph RTF to get class names for styles, as well as heading levels.

Along the way it checks for lists and tables to make sure that those get converted correctly.

Support file

The above script calls a support file twice. You’ll need to store this file on your system and put your path where it gets required:

[toggle code]

#Nisus subroutines
#edit in Nisus for encoding
use utf8;
sub cleanText {
- my($text) = shift;
- chomp $text;
- $text =~ s/[ ]+$//;
- $text =~ s/&/&/g;
- $text =~ s/‘/‘/g;
- $text =~ s/’/’/g;
- $text =~ s/“/“/g;
- $text =~ s/”/”/g;
- $text =~ s/…/…/g;
- $text =~ s/—/—/g;
- $text =~ s/©/©/g;
- $text =~ s/ë/ë/g;
- $text =~ s/é/é/g;
- $text =~ s/\x{FFFC}//g;
- $text =~ s/\x{2028}/<br \/>\n/g;
- $text =~ s/\x{0C}//g;
- $text =~ s/\x{0A}//g;
- $text =~ s/^ +//;
- $text =~ s/ +$//;
- return $text;
}
sub slugify {
- my($text) = shift;
- $text = cleanText($text);
- $text =~ s/\&[a-z]+\;//g;
- $text =~ s/ /_/g;
- $text = lc($text);
- return $text;
}
#decide current tag
sub parseParagraph {
- my($RTF) = shift;
- my($tag) = "p";
- my($style) = "";
- #is this a heading?
- if ($RTF =~ /\\tcl([0-9]) /) {
  - my($headingLevel) = $1;
  - if ($headingLevel) {
    - $tag = "h$headingLevel";
  - }
- }
- #get the style if one exists
- #be careful not to get other info such as font name
- if ($RTF =~ /\\tcl[0-9] ([^;\\]+);}/) {
  - $style = $1;
- } elsif ($RTF =~ /\in0 ([a-z0-9 ]+);}/i) {
  - $style = $1;
- } elsif ($RTF =~ /[0-9]+ ([a-z0-9 ]+);}/i) {
  - $style = $1;
- }
- #some styles are just the defaults, and don't need classes
- if ($style eq "Normal") {
  - $style = "";
- }
- if ($style =~ /^Heading [0-9]$/) {
  - $style = "";
- }
- return ($tag, $style);
}
sub isList {
- my($text) = shift;
- if ($text =~ s/^[0-9]+\.\t//) {
  - return ($text, 'ol');
- } elsif ($text =~ s/^•\t//) {
  - return ($text, 'ul');
- } else {
  - return ();
- }
}
1;

There’s probably a better way in Perl to convert diacriticals and other special characters to their HTML entities. Notice that the top of the file tells Perl to “use utf8”; that’s what Nisus uses when it sends text to your script. It tells Perl this automatically in the scripts it calls directly, but you need to specify it in any required files.

Wish list

Obviously it’d be a whole lot nicer (and likely more reliable) if I could get the style names directly from Nisus instead of having to parse them out of the RTF.

Less obviously, named sections would be useful. It would then be possible to put a DIV around each section with a predictable name, for applying styles to that section.

And if there were an Encode HTML equivalent to Encode RTF, this might allow me to maintain character-level styles.

Of course, the ultimate wish is for Nisus’s save as HTML to do all this automatically.

Note that Nisus also has what looks to be a very useful AppleScript dictionary as well.

And one final caveat: I don’t know RTF; that this works for me is no guarantee that it will work for you. It probably won’t.

Nisus: I use Nisus Writer Pro for almost all of my new documents now. It’s a lot easier to use than the other word processors I’ve tried.

More HTML

Flash on iPhone not in anybody’s interest: Flash on iPhone is not in the interest of people who buy iPhones. The only people who really want it are poor web designers who can’t get out of 1992.
Web display of Taskpaper file: It is easy to use PHP to convert a Taskpaper task file into simple HTML conducive to styling via CSS.
ELinks text-only web browser: If you need a text browsers on Mac OS X, the ELinks browser compiles out of the box.
iPhone development another FairPlay squeeze play?: Why no iPhone-only applications? Is it short-sightedness on Apple’s part, or are they trying to encourage something big?
Cascading style sheets and HTML: You can use style sheets to simplify your web pages, making them readable across a wide variety of browsers and situations, without sacrificing presentation quality.
Six more pages with the topic HTML, and other related pages

More Nisus

Importing an index into Nisus: Nisus makes it very easy to import an externally-generated index into a document.
Text to image filter for Smashwords conversions: Smashwords has very strange requirements for ebooks. This script is what I use to convert books to .doc format for Smashwords, including converting tables to images.
Nisus HTML script now handles floating content: My Nisus simple HTML publish script now handles floating images and floating text boxes.
Lulu, Nisus, and Gods & Monsters: Lulu is sometimes really annoying. But they usually get the job done. Nisus, on the other hand, is rarely annoying to use and always gets the job done.
Nisus Writer Pro 2.0: The new Nisus is pure awesome: very easy to use, and it does everything I need.
Four more pages with the topic Nisus, and other related pages

More Perl

Simple .ics iCalendar file creator: A simple Perl script to create an ics file from a human-readable text of events.
No premature optimization: Don’t optimize code before it needs optimization or you’re likely to create unoptimized code.
Using Term::ANSIColor with GeekTool: Rather than using the raw codes directly, Perl (at least on OS X) comes with Term::ANSIColor built in.
Nisus HTML conversion: New features in Nisus’s scripting language make HTML conversion almost a breeze.
SilverService and Taskpaper: SilverService is a great little app if you commonly need to repetitiously modify text. Any application that supports services will support running selected text through command-line scripts via SilverService.
Three more pages with the topic Perl, and other related pages

Comments?

The undiscovered comment form, whose bourn no poster returns.

Your email, URL, and location are optional—but I won’t be able to contact you if you don’t leave a working email. Your email does not get displayed, your URL and location do. Your name is required but may vary as the needs of the day demand, or you can just use the anonymous Hark Thrice name. You can use the following tags: <em>, <a>, <blockquote>. Use them wisely and post intelligently. Comments may take some time to approve, especially if I’m stuck in a Mexican jail.

If you have private comments, or questions about this page, please, leave a message on the Negative Space Comments Page.

Lost?

If you’re looking for something here, use the search box in the navigation to limit your search to this part of the site, or use the Negative Space search page.

Jerry

I was lighting a cigarette outside the local strip mall video store, and a woman I do not know stopped and said, “You know, I realized that every time I lit a cigarette, I was really trying to stuff a feeling,” before flouncing on in to the newly opened espresso bar.

I remember thinking, (after “did that really happen?”) ahHAAA!, so this is the nineties! — Morgan Mussell

Contents of Negative Space™ as a whole Copyright © 1994-2024 Jerry Stratton. Individual copyrights remain held by their respective authors unless they specify otherwise. Site titles, such as Negative Space, Strange Bedfellows, Biblyon Broadsheet, Highland Games, and FireBlade Coffeehouse are trademarks of Jerry Stratton.

Code and code snippets, to the extent that they are copyrightable, may be re-distributed under the terms of the GNU General Public License 3.

Nisus “clean HTML” macro last modified May 22nd, 2011.

Your comment
Your name
Your email
Your web page
Your location

Mimsy Were the Borogoves

Nisus “clean HTML” macro

The trick is in the RTF

Clean HTML

Support file

Wish list

More HTML

More Nisus

More Perl

Editorials

Books, Movies, & Music

Technology & Hacks

Food

42 Astounding Scripts

Walkerville Reader

Biblyon Broadsheet

About Mimsy

Comments?

Lost?

Mimsy Were the Borogoves

Nisus “clean HTML” macro

The trick is in the RTF

Clean HTML

Support file

Wish list

More HTML

More Nisus

More Perl

Editorials

Books, Movies, & Music

Technology & Hacks

Food

42 Astounding Scripts

Walkerville Reader

Biblyon Broadsheet

Blogroll

Keep in touch

About Mimsy

Comments?

Lost?