scripts-fabq/notes/pdf.md
Fabrice Quenneville b8b8817cd4 feat(notes): add guide for OCR and PDF manipulation on Linux
- Added instructions for setting up Tesseract with language support.
- Documented steps for converting PDFs to images using `pdftoppm` and alternatives like `ImageMagick`.
- Included examples for single and multi-page OCR processing.
- Detailed methods for merging extracted text into a single file.
- Added troubleshooting tips for improving OCR results and handling selectable PDFs with `pdftotext`.
2024-12-05 16:09:04 -05:00

105 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PDF
## Table of Contents
- [PDF](#pdf)
- [Table of Contents](#table-of-contents)
- [OCR](#ocr)
## OCR
**Install necessary tools**
Update the package list and install Tesseract with support for the desired language. Replace `fra` with your desired language code (e.g., `eng` for English).
```bash
apt update
apt install tesseract-ocr tesseract-ocr-fra
```
**Verify the installation**
Check the installed version of Tesseract and list available languages to ensure your chosen language is installed (e.g., `fra` for French).
```bash
tesseract --version
tesseract --list-langs
```
**Install a utility to convert PDF pages into images**
For PDFs that require OCR, you need a utility to convert PDF pages into images. Install `poppler-utils`, which includes `pdftoppm`.
```bash
apt install poppler-utils
```
**Convert PDF to images**
Convert the PDF into JPEG images, with each page saved as a separate file. Each page will be named sequentially (e.g., `output-1.jpg`, `output-2.jpg`, etc.).
```bash
pdftoppm -jpeg your_file.pdf output
```
- **Tip**: Use a dedicated output directory to avoid overwriting existing files.
- **Alternative tools**: If `poppler-utils` isnt available, consider using `ImageMagick`:
```bash
convert -density 300 your_file.pdf output-%04d.jpg
```
**Perform OCR on a single image**
Run Tesseract OCR on an image to extract text. Specify the language using the `-l` option.
```bash
tesseract output-1.jpg output-text -l fra
```
The extracted text will be saved in `output-text.txt`.
**Perform OCR on multiple images**
For multi-page PDFs, process all images in a loop. This extracts text from each image and saves it to a separate `.txt` file.
```bash
for img in output-*.jpg; do
tesseract "$img" "${img%.jpg}" -l fra
done
```
- **Note**: The `${img%.jpg}` syntax removes the `.jpg` extension, ensuring each `.txt` file matches its corresponding image.
**Combine all text files**
Merge the text from all processed pages into a single file. This is useful for assembling the full content of the PDF.
```bash
cat output-*.txt > complete_text.txt
```
If filenames are out of order, use a sorting approach before merging:
```bash
ls output-*.txt | sort -V | xargs cat > complete_text.txt
```
**Troubleshooting and Tips**
**If Tesseract doesnt recognize text**:
- Ensure the images have sufficient quality and resolution. Use `-r` with `pdftoppm` to increase the DPI (e.g., `-r 300` for 300 DPI).
- Try additional Tesseract language packs for better recognition of specific text styles.
**Using `pdftotext` for simpler PDFs**:
If the PDF contains selectable text (not just images), `pdftotext` from `poppler-utils` can extract text directly without OCR.
```bash
pdftotext your_file.pdf output.txt
```
Replace `convert` with `magick` for newer versions.
**Verify language codes for Tesseract**:
You can find a list of supported languages on the [Tesseract GitHub page](https://github.com/tesseract-ocr/tesseract).