Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, Swift, BASIC, and whatever else I happen to feel like hacking at.

Creating searchable PDFs in Ventura

Jerry Stratton, June 14, 2023

Your pour at tea… you play at cards: “You need new delicacies whenever you entertain.” From The Ladies Home Journal, February, 1926.; PDFKit; Hills Brothers Company; Dromedary; twenties; 1920s

An example pre-Ventura.

I’ve updated my searchablePDF (Zip file, 8.9 KB) script with a workaround for new behavior in Ventura. As you can see from my archive of old promotional cookbook pamphlets I’ve been using it extensively—and I have a lot more to come! Reading these old books and trying out some of their recipes has been ridiculously fun, and even occasionally tasty.

Recently, a friend of mine gave me an old recipe pamphlet in very bad shape; but it isn’t currently available on any of the online archives, so I decided to scan it anyway, more for historical purposes than as a useful cookbook. This was the first archival scan I did since upgrading to Ventura, and when I was finished and created the PDF, it was practically unreadable.

It was also a lot smaller than the source images. All of the previous PDFs had very predictable file sizes. When running this script under Monterey, all I had to do was sum up all of the source images (easily done by just doing Get Info on the folder they’re in) and that was the size of the resultant PDF file.

In retrospect, it sounds like in Monterey, where I originally wrote the script, PDFKit was re-using the original image data. I always scan at somewhere from 300 to 600 dpi. In Ventura it seems to be regenerating the image data and using the lower resolution of PDFs to do so. PDFs use 72 dots per inch; but PDFKit in Monterey did not downgrade the actual images to 72 dots per inch. Ventura does.

No problem, I thought, I’ll just increase the image’s size before creating the PDFPage. This worked—to improve the image’s quality. But the PDF opened in Preview as if it were bigger than the screen! Sure enough, Preview reported the document as being some immense number of inches wide and tall.

While flailing around looking for a solution, I noticed that if I didn’t bother trying to normalize the images to always be the correct ratio (i.e., for a 4x6 book, the ratio should be ⅔) but just assigned the 4x6 pageSize to image.size it both maintained the image quality and it created PDFs of the correct size in inches.

Because the image wasn’t normalized to the PDF’s size, however, there was weird white space either vertically or horizontally, depending on whether the original image was slightly too wide or slightly too tall.

But why not assign the new page size to the new, normalized NSImage? Unlike doing so to the original image, this actually adjusted the quality of the image, identical in both file size and image quality to just making the NSImage using that page size to begin with.

The Baker’s Dozen: Cover of the General Foods The Baker’s Dozen, from McCall’s Magazine in 1976.; cookbooks; Baker’s Coconut

One of the pamphlets I’ve scanned and then converted to PDF using this script.

Giving the normalized NSImage the page’s size after creation created very low quality images. Giving the normalized NSImage a doubled or quadrupled page size after creation improved the quality but also generated zoomed PDFs. This at least meant that the original image’s quality wasn’t being thrown out by the new NSImage. It just wasn’t being used by PDFKit.

Because altering .size directly worked, sort of, with the original NSImage, I experimented with creating file representations of the new, normalized NSImage. At the very worst, I assumed I could literally write out the normalized image at high quality, re-read it as an NSImage, and then adjust its .size. That turned out not to be necessary, however. All that’s required is requesting a representation at the higher resolution; once that representation is created, PDFKit appears to use it rather than generating a very low resolution one.

[toggle code]

  • //a data representation needs to be generated, or PDFKit under Ventura will generate a very low quality image
  • //at 72 pixels per inch instead of using the original image as it did under Monterey.
  • func toPageSize(image: NSImage, pageSize: NSSize, identifier: String) -> NSImage {
    • guard let imageData = image.tiffRepresentation else {
      • print("Trouble adding representation to image", identifier)
      • exit(0)
    • }
    • guard let bitmapVersion = NSBitmapImageRep(data: imageData) else {
      • print("Trouble getting bitmap of image representation for image", identifier)
      • exit(0)
    • }
    • guard let imageData = bitmapVersion.representation(using: NSBitmapImageRep.FileType.jpeg, properties:[.compressionFactor : compression]) else {
      • print("Trouble turning bitmap into a jpeg for image", identifier)
      • exit(0)
    • }
    • guard let pagedImage = NSImage(data:imageData) else {
      • print("Trouble creating image from data for image", identifier)
      • exit(0)
    • }
    • pagedImage.size = pageSize
    • return pagedImage
  • }

This generates a JPEG, it then rereads the JPEG data into an NSImage. After creation, it assigns the correct PDF dimensions to the new JPEG.

Because the default JPEG compression is not the same as whatever compression I used when scanning the images, this still generates files of a different size, but it’s a lot closer, and the quality of the PDF’s pages are back up to the quality of the original images. If you look in the real code, you’ll see that I set a compression factor of (as I write this) 0.537 (the value can range from 0.0 to 1.0). That seems to generate PDFs that approximate the sum of the constituent images. I’ll probably turn that into a command-line option if I end up needing different factors but it seems to work well for now.

That’s the major change to the script for Ventura. It’s very much a workaround. I’d rather keep the original image intact, as in Monterey, than re-compress it as this solution does. If you know how to retain the pre-Ventura behavior of NSImage and PDFKit, consider answering this question on StackOverflow.

I’ve also added one new feature, for scanning in very low-quality pamphlets. There is now a --missing option to insert a blank page or multiple blank pages into the document at a specific page, with the text “Missing page xx”.

Besides being mouse-eaten, the low quality pamphlet that got me started on this is missing the middle sheet, which is the middle four pages. So, when generating the PDF from my scans of that pamphlet, I include --missing 46 4 and it adds four blank pages after page 46.

It’s adding the text to the NSImage and not to the PDFPage, something I should probably change. Or possibly not; it lets me use the subroutines for altering the images that I know already work. In order to get around Ventura’s resolution issue, I create the blank image at a higher resolution (currently hardcoded at eight times the page’s dimensions), do the above conversion, and then assign the real page size to it.

There’s a still a minor problem that I haven’t tracked down. Most times I run the script I get a notice that “CoreGraphics PDF has logged an error. Set environment variable "CGPDFVERBOSE" to learn more.”. If I set the environment variable, all I get is a series of warnings, all the same: “Invalid image orientation, assuming 1.”

Since I don’t change the orientation of any of the images, I’m guessing that the invalid orientation value is in the original. “Assuming 1” appears to be the correct thing to do, since there are no incorrectly oriented pages in the PDF.

You can use setenv("CG_PDF_VERBOSE", "1", 1) to view the more detailed errors yourself if you wish—assuming you even see them at all. If, as I suspect, the error is the result of the image editor I’m using, you may not see that error at all.

In response to Create searchable PDFs in Swift: This Swift script will take a series of image scans, OCR them, and turn them into a PDF file with a simple table of contents and searchable content—with the original images as the visually readable content.