diff --git a/notes/pdf.md b/notes/pdf.md new file mode 100644 index 0000000..c8fc53b --- /dev/null +++ b/notes/pdf.md @@ -0,0 +1,104 @@ +# PDF + +## Table of Contents + +- [PDF](#pdf) + - [Table of Contents](#table-of-contents) + - [OCR](#ocr) + +## OCR + +**Install necessary tools** +Update the package list and install Tesseract with support for the desired language. Replace `fra` with your desired language code (e.g., `eng` for English). + +```bash +apt update +apt install tesseract-ocr tesseract-ocr-fra +``` + +**Verify the installation** + +Check the installed version of Tesseract and list available languages to ensure your chosen language is installed (e.g., `fra` for French). + +```bash +tesseract --version +tesseract --list-langs +``` + +**Install a utility to convert PDF pages into images** + +For PDFs that require OCR, you need a utility to convert PDF pages into images. Install `poppler-utils`, which includes `pdftoppm`. + +```bash +apt install poppler-utils +``` + +**Convert PDF to images** + +Convert the PDF into JPEG images, with each page saved as a separate file. Each page will be named sequentially (e.g., `output-1.jpg`, `output-2.jpg`, etc.). + +```bash +pdftoppm -jpeg your_file.pdf output +``` + +- **Tip**: Use a dedicated output directory to avoid overwriting existing files. +- **Alternative tools**: If `poppler-utils` isn’t available, consider using `ImageMagick`: + + ```bash + convert -density 300 your_file.pdf output-%04d.jpg + ``` + +**Perform OCR on a single image** + +Run Tesseract OCR on an image to extract text. Specify the language using the `-l` option. + +```bash +tesseract output-1.jpg output-text -l fra +``` + +The extracted text will be saved in `output-text.txt`. + +**Perform OCR on multiple images** + +For multi-page PDFs, process all images in a loop. This extracts text from each image and saves it to a separate `.txt` file. + +```bash +for img in output-*.jpg; do + tesseract "$img" "${img%.jpg}" -l fra +done +``` + +- **Note**: The `${img%.jpg}` syntax removes the `.jpg` extension, ensuring each `.txt` file matches its corresponding image. + +**Combine all text files** + +Merge the text from all processed pages into a single file. This is useful for assembling the full content of the PDF. + +```bash +cat output-*.txt > complete_text.txt +``` + +If filenames are out of order, use a sorting approach before merging: + +```bash +ls output-*.txt | sort -V | xargs cat > complete_text.txt +``` + +**Troubleshooting and Tips** + +**If Tesseract doesn’t recognize text**: + +- Ensure the images have sufficient quality and resolution. Use `-r` with `pdftoppm` to increase the DPI (e.g., `-r 300` for 300 DPI). +- Try additional Tesseract language packs for better recognition of specific text styles. + +**Using `pdftotext` for simpler PDFs**: + If the PDF contains selectable text (not just images), `pdftotext` from `poppler-utils` can extract text directly without OCR. + +```bash +pdftotext your_file.pdf output.txt +``` + +Replace `convert` with `magick` for newer versions. + +**Verify language codes for Tesseract**: + You can find a list of supported languages on the [Tesseract GitHub page](https://github.com/tesseract-ocr/tesseract).