Tesseract OCR text recognition using tess4j encapsulation

This paper uses Tesseract-OCR for text recognition. Tesseract is an open source OCR (Optical Character Recognition) engine that can recognize image files in multiple formats and convert them to text, and currently supports more than 60 languages (including Chinese). Tesseract was originally developed by HP, and later maintained by Google.

The first part of me first download good tesseract and jTessBosEditorFX, then prepare the training data, training data with the "temporary regulations of real estate registration" of 10 pictures, using Tesseract-OCR of the realization of the picture to text, that is, the contents of the picture translated into Chinese, Tesseract-OCR for pictures in English and digital often recognition rate is also ideal, but for Chinese existing english.traindata translation Chinese effect is not ideal, Tesseract provides training for the data, so you can first all the pictures in the first few pictures as training data, training at their own .traindata, and then use their own .traindata to give text recognition. Prepare the data

1. generate a tif file of the dataset using the jTessBoxEditer tool

2. subsequently generate 5 box files by means of the cmd command

3. Use jTessBoxEditor tool to open each tif for character and position correction, and then save

4. Then generate a TR file by command line

5. Then create a new font feature file

6. Next, extract characters from all files

7. Enter the command to generate the unicharset file, and then generate the shape file

8. Enter the command to generate shapetable file:, then generate the aggregated character feature file

9. Finally, merge all tr files to generate training results to form Traindata files

Specific operations are as follows.


1. Use jTessBosEditorFX to train the text, I trained a total of 10 sheets here, and then form the traindata file



! [insert image description here](https://img-blog.csdnimg.cn/20210112142019930.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10, text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDgyMTkyMA==,size_16,color_FFFFFF,t_70#pic_center)

2. use Tess4j package, call tesseract to recognize text, first use the Chinese library chi_sim in tesseract to recognize

! [insert image description here](https://img-blog.csdnimg.cn/20210112142139714.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10, text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDgyMTkyMA==,size_16,color_FFFFFF,t_70#pic_center)
3. Check the recognition results by text comparison software, the result error rate of 20%

! [insert image description here](https://img-blog.csdnimg.cn/2021011214224235.gif)
4. import 5 traindata trained by ourselves in tesseract's traindata folder for text recognition
! [insert image description here](https://img-blog.csdnimg.cn/2021011214232064.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text) _aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDgyMTkyMA==,size_16,color_FFFFFF,t_70#pic_center)
5. Using text comparison software to check, the recognition rate is up to 98%
! [insert image description here](https://img-blog.csdnimg.cn/20210112141518372.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDgyMTkyMA==,size_16,color_FFFFFF,t_70#pic_center)



Read More: