OCRing PDFs

For batch OCR on derivative PDFs.

A common workflow is to generate PDF derivatives for multi-page documents. These then need OCR to create embedded text that can be extracted and added to Hyrax's index during upload to support discovery.

Step-by-step guide

Generate PDF derivatives during the Create Derivatives step.
1. If you need to run OCR on original PDFs int the \masters folder that don't already have embedded text, you have to make a copy of them and place them into the derivatives folder.
2. It is generally good practice to keep the original PDFs as well as the ones you run OCR on, as the OCR process significantly edits the file and makes certain assumptions about PDF files that may not be true, so there is some potential for data loss during the OCR process.
Enter the package ID for the package that includes PDFs
1. By default, this process will run OCR on every PDF in the package within the \derivatives folder
2. PDFs that already have some pages with embedded text will be ignored.
Optionally check the box and enter a subpath if you want to limit OCR to a certain folder or file path.
Click "Submit"!

What it does

This process runs tesseract 5.3.0

OCRing PDFs

Step-by-step guide

What it does

Related articles