Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, PHP, and whatever else I happen to feel like hacking at.

ht://Dig and smart quotes

Jerry Stratton, December 24, 2006

Despite its slow (and probably, currently, stalled) development, ht://Dig is a very nice little search tool for any site large enough to need one. It handles synonyms, boolean searches, and even (in 3.2) phrase searching. It is easy to create subsearches for subsections of a web site. It compiles easily on Linux and on Mac OS X. I use it for the Negative Space search page, and I’ve been very happy with it.

One big problem, however, which has become more noticeable now that I’m updating a lot of the Negative Space pages, is that it doesn’t recognize left and right single and double intelligent quotes. When such quotes appear in the text, it displays their entity codes (“, for example) rather than the actual entities.

It does this because it keeps a specific list of valid entities and only maintains ampersand-encoded entities if it knows doing so won’t hurt the results display. Unfortunately, this list is set at compile-time. If you want to change it, you need to modify a C file and recompile.

ht://Dig 3.2

In ht://Dig 3.2.0b6, the list of valid entities is in HtSCMLCodec.cc in the htcommon directory. In that file there is a list of entities that spans several lines, with the lines looking like:

  • myTextFromString << "&uml;|&copy;|&ordf;|&laquo;|&not;|&shy;|®|&macr;|&deg;|";

These lines are the list of valid entities. You can add your own line, such as:

  • myTextFromString << "“|”|‘|’";

Notice that all but the last line ends in a vertical bar.

After you make this change, you’ll need to redig your site before the change takes effect. Once it does, intelligent quotes will be displayed intelligently in search results.

ht://Dig 3.1.6

The latest “stable” release is still 3.1.6. However, if you can use 3.2, I’d recommend it, both for this issue and for the additional functionality. Unfortunately, on the host that I use 3.2 trips their “reaper” because it uses more memory than user processes are allowed to use. So for the moment I’m still using 3.1.6.

In ht://Dig 3.1.6, the change needs to be made in SGMLEntities.cc in the htdig directory. That file has a different list of special entities, and a list of one entity per line that looks like:

  • { "quot", '"' } ,
  • { "trade", 153 } , /* trade mark */

You can add your own after the yuml listing but before the { 0, 0 } listing:

  • { "lsquo", '\'' } , /* left single quote */
  • { "rsquo", '\'' } , /* right single quote */
  • { "ldquo", '"' } , /* left double quote */
  • { "rdquo", '"' } , /* right double quote */

After making this change and re-digging your site, intelligent single and double quotes will be replaced with dumb single and double quotes; not quite as nice a solution as in 3.2, but much better than displaying the entity codes themselves.

  1. <- mod_rewrite
  2. Django Feed GUID ->