OCRing PDFs

For batch OCR on derivative PDFs.

A common workflow is to generate PDF derivatives for multi-page documents. These then need OCR to create embedded text that can be extracted and added to Hyrax's index during upload to support discovery.

Step-by-step guide

  1. Generate PDF derivatives during the Create Derivatives step.

    1. If you need to run OCR on original PDFs int the \masters folder that don't already have embedded text, you have to make a copy of them and place them into the derivatives folder.

    2. It is generally good practice to keep the original PDFs as well as the ones you run OCR on, as the OCR process significantly edits the file and makes certain assumptions about PDF files that may not be true, so there is some potential for data loss during the OCR process.

  2. Enter the package ID for the package that includes PDFs

    1. By default, this process will run OCR on every PDF in the package within the \derivatives folder

    2. PDFs that already have some pages with embedded text will be ignored.

  3. Optionally check the box and enter a subpath if you want to limit OCR to a certain folder or file path.

    1. Use Unix-style (/) path separators or a double backslash (\\) for Windows-style path separators

    2. Subpaths are relative to the \derivatives folder, so if you want to run OCR on the PDFs only in this folder:

\\Lincoln\Library\SPE_Processing\backlog\ua200\ua200_2Wmfb6o4Vjb2Hs7jCuvqSY\derivatives\Councils\Executive (SEC)\2009-10 SEC\Agendas

Then enter:

Councils/Executive (SEC)/2009-10 SEC/Agendas

  1. Click "Submit"!

OCR-ing existing PDFs

For existing PDFs this process actually extracts images from PDFs, and then runs tesseract on the individual page images, which converts them back to individual PDFs per page, and then finally re-combines them into a single PDF. Since this process makes alterations to images, its a good idea to keep original PDFs in the masters folder.

PDFs that already have embedded text in any pages will be skipped.

What it does

  1. Finds every PDF recursively within the \derivatives folder or a subpath of it

  2. Makes a temporary "converting-ocr" directory

  3. Extracts an image of each page to the "converting-ocr" directory

    1. Ignores the PDF if any pages already have embedded text or contains multiple images

  4. Fixes page size and rotation to match original PDF

  5. Runs tesseract 5.3.0 on each images to convert to a PDF with embedded text

  6. Uses pypdf to combine PDF pages into a single PDF