OCRing PDFs
For batch OCR on derivative PDFs.
A common workflow is to generate PDF derivatives for multi-page documents. These then need OCR to create embedded text that can be extracted and added to Hyrax's index during upload to support discovery.
Step-by-step guide
Generate PDF derivatives during the Create Derivatives step.
If you need to run OCR on original PDFs int the \masters folder that don't already have embedded text, you have to make a copy of them and place them into the derivatives folder.
It is generally good practice to keep the original PDFs as well as the ones you run OCR on, as the OCR process significantly edits the file and makes certain assumptions about PDF files that may not be true, so there is some potential for data loss during the OCR process.
Enter the package ID for the package that includes PDFs
By default, this process will run OCR on every PDF in the package within the \derivatives folder
PDFs that already have some pages with embedded text will be ignored.
Optionally check the box and enter a subpath if you want to limit OCR to a certain folder or file path.
Use Unix-style (/) path separators or a double backslash (\\) for Windows-style path separators
Subpaths are relative to the \derivatives folder, so if you want to run OCR on the PDFs only in this folder:
\\Lincoln\Library\SPE_Processing\backlog\ua200\ua200_2Wmfb6o4Vjb2Hs7jCuvqSY\derivatives\Councils\Executive (SEC)\2009-10 SEC\Agendas
Then enter:
Councils/Executive (SEC)/2009-10 SEC/Agendas
Click "Submit"!
OCR-ing existing PDFs
For existing PDFs this process actually extracts images from PDFs, and then runs tesseract on the individual page images, which converts them back to individual PDFs per page, and then finally re-combines them into a single PDF. Since this process makes alterations to images, its a good idea to keep original PDFs in the masters folder.
PDFs that already have embedded text in any pages will be skipped.
What it does
Finds every PDF recursively within the \derivatives folder or a subpath of it
Makes a temporary "converting-ocr" directory
Extracts an image of each page to the "converting-ocr" directory
Ignores the PDF if any pages already have embedded text or contains multiple images
Fixes page size and rotation to match original PDF
Runs tesseract 5.3.0 on each images to convert to a PDF with embedded text
Uses pypdf to combine PDF pages into a single PDF