Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Councils/Executive (SEC)/2009-10 SEC/Agendas

  1. Click "Submit"!

OCR-ing existing PDFs

For existing PDFs this process actually extracts images from PDFs, and then runs tesseract on the individual page images, which converts them back to individual PDFs per page, and then finally re-combines them into a single PDF. Since this process makes alterations to images, its a good idea to keep original PDFs in the masters folder.

PDFs that already have embedded text in any pages will be skipped.

What it does

  1. Finds every PDF recursively within the \derivatives folder or a subpath of it
  2. Makes a temporary "converting-ocr" directory
  3. Extracts an image of each page to the "converting-ocr" directory
    1. Ignores the PDF if any pages already have embedded text or contains multiple images
  4. Fixes page size and rotation to match original PDF
  5. Runs tesseract 5.3.0 on each images to convert to a PDF with embedded text
  6. Uses pypdf to combine PDF pages into a single PDF

...