Steps to Repair and OCR a Scanned or Corrupted PDF in Ubuntu
Appearance
Steps to Repair and OCR a Scanned or Corrupted PDF in Ubuntu
Step 1: Clean or Repair the PDF
Use Ghostscript to rebuild damaged cross-reference tables and fix malformed PDF structure.
sudo apt install ghostscript gs -o fixed.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -dNOPAUSE -dBATCH "input.pdf"
What it does:
- Repairs broken references (
xreferrors) - Normalizes streams and compression
- Outputs a clean, standards-compliant PDF (
fixed.pdf)
If Ghostscript cannot fix the file, try qpdf:
sudo apt install qpdf qpdf --repair "input.pdf" fixed.pdf
Step 2: Run OCR on the Cleaned PDF
Use OCRmyPDF to embed searchable text into the PDF.
sudo apt install ocrmypdf tesseract-ocr tesseract-ocr-eng tesseract-ocr-fil ocrmypdf --jobs 4 --deskew --clean -l eng+fil fixed.pdf output_ocr.pdf
What it does:
- Performs OCR using Tesseract (English + Filipino)
- Deskews and cleans pages
- Embeds text layer for search and selection
If OCRmyPDF fails on rendering, use an alternate renderer:
ocrmypdf --pdf-renderer sandwich fixed.pdf output_ocr.pdf
If the PDF is too broken, force rasterization and OCR:
ocrmypdf --force-ocr fixed.pdf output_ocr.pdf
Step 3: Verify OCR Success
Check if text extraction works:
pdftotext output_ocr.pdf - | head
If you see readable text, the OCR worked successfully.
Summary Workflow
| Step | Tool | Command | Purpose |
|---|---|---|---|
| 1 | Ghostscript | gs -o fixed.pdf -sDEVICE=pdfwrite ...
|
Clean and repair corrupted PDF |
| 2 | QPDF | qpdf --repair input.pdf fixed.pdf
|
Alternate PDF repair if Ghostscript fails |
| 3 | OCRmyPDF | ocrmypdf --jobs 4 --deskew --clean fixed.pdf output_ocr.pdf
|
Add searchable text layer |
| 4 | Verify | head | Confirm OCR success |