Extract text using tesseract OCR

One of the fastest ways to do so is to use library pytesseract . It's a python wrapper for Google Tesseract-OCR engine that allows easily recognize text on image. Also we might need wrapper for Python Image Library - pillow .

So let's go step by step!

Step 1. Installing dependencies

Installing Tesseract-OCR and related libraries (example for Ubuntu 18.04+ users):

apt install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config

Small bonus for Windows:

  1. Open this link
  2. Select version of exe you would like to install (I'm currently using 4.0.0-beta.1 , but it's ok to select something newer. Do not recommend to install versions 3 and lower, because they have less languages support)
  3. Install and add it to your PATH
  4. Well done!

For Python we'll need to install libraries mentioned at the beginning via pip (or pipenv if you prefer):

pip install pillow pytesseract

Step 2. Creating script

# Importing libs
from PIL import Image
import pytesseract

# Transforming image to string and printing it!
print(pytesseract.image_to_string(Image.open('test.png')))

Step 3. Run!

I will use this image for tests

test.png

Called it test.png as in our script. And now, run the script!

python our_script.py

Expected result:

result

Well, that's it! You are breathtaking!

Small bonus

No pillows

We can do pretty same thing without pillow library, but you will be restricted by pytesseract supported formats

import pytesseract

print(pytesseract.image_to_string('test.png'))

Boxes, confidences, line and page numbers

If you'll need to get some useful data from from image, you can use this one:

import pytesseract

print(pytesseract.image_to_data(Image.open('test.png')))

Convert to pdf

import pytesseract

pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf)