Linux Apps Tutorials - Herong's Tutorial Examples - v1.02, by Herong Yang
Install Tesseract as the OCR Engine
This section provides a tutorial example on how to install Tesseract as the OCR Engine.
What Is Tesseract? Tesseract is an open source OCR (Optical Character Recognition) engine and command line program.
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.
Here is what I did to install Tesseract on my CentOS computer.
1. Search for "tesseract" package information. It seems to be in the EPEL repository:
herong$ sudo dnf info tesseract Extra Packages for Enterprise Linux 8 - x86_64 Available Packages Name : tesseract Version : 4.1.0 Release : 1.el8 Architecture : x86_64 Size : 1.4 M Source : tesseract-4.1.0-1.el8.src.rpm Repository : epel Summary : Raw OCR Engine URL : https://github.com/tesseract-ocr/tesseract License : ASL 2.0 Description : A commercial quality OCR engine originally developed : at HP between 1985 and 1995. In 1995, this engine was : among the top 3 evaluated by UNLV. It was open-sourced : by HP and UNLV in 2005.
2. Try to install it. It seems to require a missing library, "liblept.so".
herong$ sudo dnf install tesseract Error: Problem: conflicting requests - nothing provides liblept.so.5()(64bit) needed by tesseract-4.1.0-1.el8.x86_64
3. Search "liblept.so" on the Internet. I found that liblept.so is the library created from the Leptonica package.
4. Search for "Leptonica" package. I found it in the "PowerTools" package repository:
herong$ sudo dnf --enablerepo=PowerTools install leptonica Installed: leptonica-1.76.0-2.el8.x86_64
5. Install Tesseract again. It seems to require a missing library, "liblept.so".
herong$ sudo dnf install tesseract Installed: tesseract-4.1.0-1.el8.x86_64 tesseract-langpack-eng-4.0.0-6.el8.noarch tesseract-tessdata-doc-4.0.0-6.el8.noarch Complete!
6. Test the "tesseract" command:
herong$ tesseract --help Usage: tesseract --help | --help-extra | --version tesseract --list-langs tesseract imagename outputbase [options...] [configfile...] OCR options: -l LANG[+LANG] Specify language(s) used for OCR. NOTE: These options must occur before any configfile. Single options: --help Show this help message. --help-extra Show extra help for advanced users. --version Show version information. --list-langs List available languages for tesseract engine.
Table of Contents
Running Apache Web Server (httpd) on Linux Systems
Running PHP Scripts on Linux Systems
Running MySQL Database Server on Linux Systems
Running Python Scripts on Linux Systems
Conda - Environment and Package Manager
Graphics Environments on Linux