scripts-fabq/notes/pdf.md
Fabrice Quenneville b8b8817cd4 feat(notes): add guide for OCR and PDF manipulation on Linux
- Added instructions for setting up Tesseract with language support.
- Documented steps for converting PDFs to images using `pdftoppm` and alternatives like `ImageMagick`.
- Included examples for single and multi-page OCR processing.
- Detailed methods for merging extracted text into a single file.
- Added troubleshooting tips for improving OCR results and handling selectable PDFs with `pdftotext`.
2024-12-05 16:09:04 -05:00

2.9 KiB
Raw Blame History

PDF

Table of Contents

OCR

Install necessary tools Update the package list and install Tesseract with support for the desired language. Replace fra with your desired language code (e.g., eng for English).

apt update
apt install tesseract-ocr tesseract-ocr-fra

Verify the installation

Check the installed version of Tesseract and list available languages to ensure your chosen language is installed (e.g., fra for French).

tesseract --version
tesseract --list-langs

Install a utility to convert PDF pages into images

For PDFs that require OCR, you need a utility to convert PDF pages into images. Install poppler-utils, which includes pdftoppm.

apt install poppler-utils

Convert PDF to images

Convert the PDF into JPEG images, with each page saved as a separate file. Each page will be named sequentially (e.g., output-1.jpg, output-2.jpg, etc.).

pdftoppm -jpeg your_file.pdf output
  • Tip: Use a dedicated output directory to avoid overwriting existing files.

  • Alternative tools: If poppler-utils isnt available, consider using ImageMagick:

    convert -density 300 your_file.pdf output-%04d.jpg
    

Perform OCR on a single image

Run Tesseract OCR on an image to extract text. Specify the language using the -l option.

tesseract output-1.jpg output-text -l fra

The extracted text will be saved in output-text.txt.

Perform OCR on multiple images

For multi-page PDFs, process all images in a loop. This extracts text from each image and saves it to a separate .txt file.

for img in output-*.jpg; do
    tesseract "$img" "${img%.jpg}" -l fra
done
  • Note: The ${img%.jpg} syntax removes the .jpg extension, ensuring each .txt file matches its corresponding image.

Combine all text files

Merge the text from all processed pages into a single file. This is useful for assembling the full content of the PDF.

cat output-*.txt > complete_text.txt

If filenames are out of order, use a sorting approach before merging:

ls output-*.txt | sort -V | xargs cat > complete_text.txt

Troubleshooting and Tips

If Tesseract doesnt recognize text:

  • Ensure the images have sufficient quality and resolution. Use -r with pdftoppm to increase the DPI (e.g., -r 300 for 300 DPI).
  • Try additional Tesseract language packs for better recognition of specific text styles.

Using pdftotext for simpler PDFs:
If the PDF contains selectable text (not just images), pdftotext from poppler-utils can extract text directly without OCR.

pdftotext your_file.pdf output.txt

Replace convert with magick for newer versions.

Verify language codes for Tesseract:
You can find a list of supported languages on the Tesseract GitHub page.