How to extract text from image in Python
One of the fastest ways to do so is to use library
pytesseract . It's a python wrapper for Google Tesseract-OCR engine that allows easily recognize text on image. Also we might need wrapper for Python Image Library -
So let's go step by step!
Step 1. Installing dependencies
Installing Tesseract-OCR and related libraries (example for Ubuntu 18.04+ users):
apt install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config
Small bonus for Windows:
- Open this link
- Select version of
exeyou would like to install (I'm currently using
4.0.0-beta.1, but it's ok to select something newer. Do not recommend to install versions 3 and lower, because they have less languages support)
- Install and add it to your
- Well done!
For Python we'll need to install libraries mentioned at the beginning via
pipenv if you prefer):
pip install pillow pytesseract
Step 2. Creating script
# Importing libs from PIL import Image import pytesseract # Transforming image to string and printing it! print(pytesseract.image_to_string(Image.open('test.png')))
Step 3. Run!
I will use this image for tests
test.png as in our script. And now, run the script!
Well, that's it! You are breathtaking!
We can do pretty same thing without
pillow library, but you will be restricted by
pytesseract supported formats
import pytesseract print(pytesseract.image_to_string('test.png'))
Boxes, confidences, line and page numbers
If you'll need to get some useful data from from image, you can use this one:
import pytesseract print(pytesseract.image_to_data(Image.open('test.png')))
Convert to pdf
import pytesseract pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf') with open('test.pdf', 'w+b') as f: f.write(pdf)