feat(notes): add guide for OCR and PDF manipulation on Linux
- Added instructions for setting up Tesseract with language support. - Documented steps for converting PDFs to images using `pdftoppm` and alternatives like `ImageMagick`. - Included examples for single and multi-page OCR processing. - Detailed methods for merging extracted text into a single file. - Added troubleshooting tips for improving OCR results and handling selectable PDFs with `pdftotext`.
This commit is contained in:
parent
115eec5c62
commit
b8b8817cd4
104
notes/pdf.md
Normal file
104
notes/pdf.md
Normal file
@ -0,0 +1,104 @@
|
|||||||
|
# PDF
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
- [PDF](#pdf)
|
||||||
|
- [Table of Contents](#table-of-contents)
|
||||||
|
- [OCR](#ocr)
|
||||||
|
|
||||||
|
## OCR
|
||||||
|
|
||||||
|
**Install necessary tools**
|
||||||
|
Update the package list and install Tesseract with support for the desired language. Replace `fra` with your desired language code (e.g., `eng` for English).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
apt update
|
||||||
|
apt install tesseract-ocr tesseract-ocr-fra
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify the installation**
|
||||||
|
|
||||||
|
Check the installed version of Tesseract and list available languages to ensure your chosen language is installed (e.g., `fra` for French).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tesseract --version
|
||||||
|
tesseract --list-langs
|
||||||
|
```
|
||||||
|
|
||||||
|
**Install a utility to convert PDF pages into images**
|
||||||
|
|
||||||
|
For PDFs that require OCR, you need a utility to convert PDF pages into images. Install `poppler-utils`, which includes `pdftoppm`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
apt install poppler-utils
|
||||||
|
```
|
||||||
|
|
||||||
|
**Convert PDF to images**
|
||||||
|
|
||||||
|
Convert the PDF into JPEG images, with each page saved as a separate file. Each page will be named sequentially (e.g., `output-1.jpg`, `output-2.jpg`, etc.).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pdftoppm -jpeg your_file.pdf output
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Tip**: Use a dedicated output directory to avoid overwriting existing files.
|
||||||
|
- **Alternative tools**: If `poppler-utils` isn’t available, consider using `ImageMagick`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
convert -density 300 your_file.pdf output-%04d.jpg
|
||||||
|
```
|
||||||
|
|
||||||
|
**Perform OCR on a single image**
|
||||||
|
|
||||||
|
Run Tesseract OCR on an image to extract text. Specify the language using the `-l` option.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tesseract output-1.jpg output-text -l fra
|
||||||
|
```
|
||||||
|
|
||||||
|
The extracted text will be saved in `output-text.txt`.
|
||||||
|
|
||||||
|
**Perform OCR on multiple images**
|
||||||
|
|
||||||
|
For multi-page PDFs, process all images in a loop. This extracts text from each image and saves it to a separate `.txt` file.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
for img in output-*.jpg; do
|
||||||
|
tesseract "$img" "${img%.jpg}" -l fra
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Note**: The `${img%.jpg}` syntax removes the `.jpg` extension, ensuring each `.txt` file matches its corresponding image.
|
||||||
|
|
||||||
|
**Combine all text files**
|
||||||
|
|
||||||
|
Merge the text from all processed pages into a single file. This is useful for assembling the full content of the PDF.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat output-*.txt > complete_text.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
If filenames are out of order, use a sorting approach before merging:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls output-*.txt | sort -V | xargs cat > complete_text.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
**Troubleshooting and Tips**
|
||||||
|
|
||||||
|
**If Tesseract doesn’t recognize text**:
|
||||||
|
|
||||||
|
- Ensure the images have sufficient quality and resolution. Use `-r` with `pdftoppm` to increase the DPI (e.g., `-r 300` for 300 DPI).
|
||||||
|
- Try additional Tesseract language packs for better recognition of specific text styles.
|
||||||
|
|
||||||
|
**Using `pdftotext` for simpler PDFs**:
|
||||||
|
If the PDF contains selectable text (not just images), `pdftotext` from `poppler-utils` can extract text directly without OCR.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pdftotext your_file.pdf output.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace `convert` with `magick` for newer versions.
|
||||||
|
|
||||||
|
**Verify language codes for Tesseract**:
|
||||||
|
You can find a list of supported languages on the [Tesseract GitHub page](https://github.com/tesseract-ocr/tesseract).
|
||||||
Loading…
Reference in New Issue
Block a user