Convert a PDF via OCR to a TXT file in UTF-8 encoding
Given one or more PDFs that may include text-as-image content, use OCR (Optical Character Recognition) to convert the content to TXT files (in UTF-8 encoding).
A survey of existing PDF-to-TXT solutions found no extant solutions that meet all of the following criteria:
ocr
(e.g., C:\Users\mark\Desktop\ocr
)C:\Program Files (x86)\Tesseract-OCR
or C:\Program Files\Tesseract-OCR
. Move this folder into your equivalent of C:\Users\mark\Desktop\ocr
, so that it is now located at Desktop\ocr\Tesseract-OCR
.
Desktop\ocr\poppler-0.68.0_x86
).C:\Users\mark\Desktop\\ocr\Tesseract-OCR
) and press OK.C:\Users\mark\Desktop\ocr\poppler-0.68.0_x86\poppler-0.68.0\bin
and press OK.Desktop\ocr
).cmd.exe
terminal, and navigate to the folder via the command line (e.g., cd Desktop\ocr\ocr2text-master
)pip install --user --requirement requirements.txt
echo %PATH%
. The output must include your equivalent of C:\Users\mark\Desktop\ocr\Tesseract-OCR
and C:\Users\mark\Desktop\ocr\poppler-0.68.0_x86\poppler-0.68.0\bin
for the script to work.ocr
(i.e., /Users/mark/Desktop/ocr
)sudo port install tesseract
) or Homebrew (brew install tesseract
/Users/mark/Desktop/ocr
).cd /Users/mark/Desktop/ocr/ocr2text
)pip install --user --requirement requirements.txt
sudo apt-get install tesseract-ocr
pdftoppm
and pdftocairo
. If they are not installed, refer to your package manager to install poppler-utils
pip install --user --requirement requirements.txt
If you have successfully completed the setup steps and are using Python version 3, usage should now be a breeze:
On the command line, navigate to the directory where you downloaded the script and run:
python ocr2text.py
You will see the following:
********************************
*** PDF to TXT file, via OCR ***
********************************
Indicate file or folder of source PDF(s) []:
(Press [Enter] for current working directory)
Enter the full path to the file or directory to convert.
Destination folder for TXT []:
(Press [Enter] for current working directory)
Enter the full path to the directory where the result file(s) should be outputted.
The script will now covert the PDF via OCR into a plaintext file:
For testing purposes, a test_files
directory is included. You can press [Enter] for the source and destination directories & verify that the image.pdf
file is converted. It will also be located in the test_files
directory:
Converted C:\Users\mark\ocr2text\image.pdf
Percent: [##########] 100%
1 file converted