Fabrice Quenneville b8b8817cd4 feat(notes): add guide for OCR and PDF manipulation on Linux

- Added instructions for setting up Tesseract with language support.
- Documented steps for converting PDFs to images using `pdftoppm` and alternatives like `ImageMagick`.
- Included examples for single and multi-page OCR processing.
- Detailed methods for merging extracted text into a single file.
- Added troubleshooting tips for improving OCR results and handling selectable PDFs with `pdftotext`.

2024-12-05 16:09:04 -05:00

2.9 KiB

Raw Blame History

PDF

PDF
- Table of Contents
- OCR

OCR

Install necessary tools Update the package list and install Tesseract with support for the desired language. Replace fra with your desired language code (e.g., eng for English).

apt update
apt install tesseract-ocr tesseract-ocr-fra

Verify the installation

Check the installed version of Tesseract and list available languages to ensure your chosen language is installed (e.g., fra for French).

tesseract --version
tesseract --list-langs

Install a utility to convert PDF pages into images

For PDFs that require OCR, you need a utility to convert PDF pages into images. Install poppler-utils, which includes pdftoppm.

apt install poppler-utils

Convert PDF to images

Convert the PDF into JPEG images, with each page saved as a separate file. Each page will be named sequentially (e.g., output-1.jpg, output-2.jpg, etc.).

pdftoppm -jpeg your_file.pdf output

Tip: Use a dedicated output directory to avoid overwriting existing files.
Alternative tools: If poppler-utils isn’t available, consider using ImageMagick:
```
convert -density 300 your_file.pdf output-%04d.jpg
```

Perform OCR on a single image

Run Tesseract OCR on an image to extract text. Specify the language using the -l option.

tesseract output-1.jpg output-text -l fra

The extracted text will be saved in output-text.txt.

Perform OCR on multiple images

For multi-page PDFs, process all images in a loop. This extracts text from each image and saves it to a separate .txt file.

for img in output-*.jpg; do
    tesseract "$img" "${img%.jpg}" -l fra
done

Note: The ${img%.jpg} syntax removes the .jpg extension, ensuring each .txt file matches its corresponding image.

Combine all text files

Merge the text from all processed pages into a single file. This is useful for assembling the full content of the PDF.

cat output-*.txt > complete_text.txt

If filenames are out of order, use a sorting approach before merging:

ls output-*.txt | sort -V | xargs cat > complete_text.txt

Troubleshooting and Tips

If Tesseract doesn’t recognize text:

Ensure the images have sufficient quality and resolution. Use -r with pdftoppm to increase the DPI (e.g., -r 300 for 300 DPI).
Try additional Tesseract language packs for better recognition of specific text styles.

Using pdftotext for simpler PDFs:
If the PDF contains selectable text (not just images), pdftotext from poppler-utils can extract text directly without OCR.

pdftotext your_file.pdf output.txt

Replace convert with magick for newer versions.

Verify language codes for Tesseract:
You can find a list of supported languages on the Tesseract GitHub page.

2.9 KiB Raw Blame History Unescape Escape

PDF

Table of Contents

OCR

2.9 KiB

Raw Blame History