How to extract text from image in Python
One of the fastest ways to do so is to use library pytesseract
. It's a python wrapper for Google Tesseract-OCR engine that allows easily recognize text on image. Also we might need wrapper for Python Image Library - pillow
.
So let's go step by step!
Step 1. Installing dependencies
Installing Tesseract-OCR and related libraries (example for Ubuntu 18.04+ users):
apt install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config
Small bonus for Windows:
- Open this link
- Select version of
exe
you would like to install (I'm currently using4.0.0-beta.1
, but it's ok to select something newer. Do not recommend to install versions 3 and lower, because they have less languages support) - Install and add it to your
PATH
- Well done!
For Python we'll need to install libraries mentioned at the beginning via pip
(or pipenv
if you prefer):
pip install pillow pytesseract
Step 2. Creating script
# Importing libs
from PIL import Image
import pytesseract
# Transforming image to string and printing it!
print(pytesseract.image_to_string(Image.open('test.png')))
Step 3. Run!
I will use this image for tests
Called it test.png
as in our script. And now, run the script!
python our_script.py
Expected result:
Well, that's it! You are breathtaking!
Small bonus
No pillows
We can do pretty same thing without pillow
library, but you will be restricted by pytesseract
supported formats
import pytesseract
print(pytesseract.image_to_string('test.png'))
Boxes, confidences, line and page numbers
If you'll need to get some useful data from from image, you can use this one:
import pytesseract
print(pytesseract.image_to_data(Image.open('test.png')))
Convert to pdf
import pytesseract
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
f.write(pdf)