Install Tesseract as the OCR Engine

Provides a tutorial example on how to install Tesseract as the OCR Engine.

What Is Tesseract? Tesseract is an open source OCR (Optical Character Recognition) engine and command line program.

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

Here is what I did to install Tesseract on my CentOS computer.

1. Search for "tesseract" package information. It seems to be in the EPEL repository:

herong$ sudo dnf info tesseract
  Extra Packages for Enterprise Linux 8 - x86_64

  Available Packages
  Name         : tesseract
  Version      : 4.1.0
  Release      : 1.el8
  Architecture : x86_64
  Size         : 1.4 M
  Source       : tesseract-4.1.0-1.el8.src.rpm
  Repository   : epel
  Summary      : Raw OCR Engine
  URL          : https://github.com/tesseract-ocr/tesseract
  License      : ASL 2.0
  Description  : A commercial quality OCR engine originally developed
               : at HP between 1985 and 1995. In 1995, this engine was
               : among the top 3 evaluated by UNLV. It was open-sourced
               : by HP and UNLV in 2005.

2. Try to install it. It seems to require a missing library, "liblept.so".

herong$ sudo dnf install tesseract

  Error:
   Problem: conflicting requests
    - nothing provides liblept.so.5()(64bit) needed
      by tesseract-4.1.0-1.el8.x86_64

3. Search "liblept.so" on the Internet. I found that liblept.so is the library created from the Leptonica package.

4. Search for "Leptonica" package. I found it in the "PowerTools" package repository:

herong$ sudo dnf --enablerepo=PowerTools install leptonica
  Installed:
    leptonica-1.76.0-2.el8.x86_64

5. Install Tesseract again. It seems to require a missing library, "liblept.so".

herong$ sudo dnf install tesseract
  Installed:
    tesseract-4.1.0-1.el8.x86_64
    tesseract-langpack-eng-4.0.0-6.el8.noarch
    tesseract-tessdata-doc-4.0.0-6.el8.noarch

  Complete!

6. Test the "tesseract" command:

herong$ tesseract --help
  Usage:
    tesseract --help | --help-extra | --version
    tesseract --list-langs
    tesseract imagename outputbase [options...] [configfile...]

  OCR options:
    -l LANG[+LANG]        Specify language(s) used for OCR.
  NOTE: These options must occur before any configfile.

  Single options:
    --help                Show this help message.
    --help-extra          Show extra help for advanced users.
    --version             Show version information.
    --list-langs          List available languages for tesseract engine.

Table of Contents

 About This Book

 Introduction to Linux Systems

 Cockpit - Web Portal for Administrator

 Process Management

 Files and Directories

 Users and Groups

 File Systems

 Block Devices and Partitions

 LVM (Logical Volume Manager)

 Installing CentOS

 SELinux - Security-Enhanced Linux

 Network Connection on CentOS

 Software Package Manager on CentOS - DNF and YUM

 Running Apache Web Server (httpd) on Linux Systems

 Running PHP Scripts on Linux Systems

 Running MySQL Database Server on Linux Systems

 Running Python Scripts on Linux Systems

 vsftpd - Very Secure FTP Daemon

 Postfix - Mail Transport Agent (MTA)

 Dovecot - IMAP and POP3 Server

 Email Client Tools - Mail User Agents (MUA)

 LDAP (Lightweight Directory Access Protocol)

 GCC - C/C++ Compiler

 Graphics Environments on Linux

 Conda - Environment and Package Manager

Tools and Utilities

Install Tesseract as the OCR Engine

 Administrative Tasks

 References

 Full Version in PDF/EPUB