Page Comparison

...

Councils/Executive (SEC)/2009-10 SEC/Agendas

Click "Submit"!

OCR-ing existing PDFs

For existing PDFs this process actually extracts images from PDFs, and then runs tesseract on the individual page images, which converts them back to individual PDFs per page, and then finally re-combines them into a single PDF. Since this process makes alterations to images, its a good idea to keep original PDFs in the masters folder.

PDFs that already have embedded text in any pages will be skipped.

What it does

Finds every PDF recursively within the \derivatives folder or a subpath of it
Makes a temporary "converting-ocr" directory
Extracts an image of each page to the "converting-ocr" directory
1. Ignores the PDF if any pages already have embedded text or contains multiple images
Fixes page size and rotation to match original PDF
Runs tesseract 5.3.0 on each images to convert to a PDF with embedded text
Uses pypdf to combine PDF pages into a single PDF

...

Versions Compared

Old Version 4

New Version Current

Key

OCR-ing existing PDFs

What it does