Bangla optical character recognition from printed text using Tesseract Engine

Faruque, MD Yamin; Adeeb, MD Zahin; Kamal, Muhammad Maswood; Ahmed, Redwan

View/Open

16201059, 19241013, 16201038, 16241005_CSE.pdf (3.210Mb)

Date

2021-01

Publisher

Brac University

Abstract

Optical Character Recognition or OCR is a technology that enables us to detect and extract text from images. In our project, we are designing our OCR system around the Bangla language. This is primarily because, there are many models of text recognition of the English language in the market but there are very few on Bangla. Ourproposedsystemcomprisesofacquiringtheinputimage,pre-processing it,passingitintotheTesseractOCRengine(thebackboneofoursystem)andfinally getting digital output of the text. We have used the latest version of Tesseract, that is, version 5 and even though this is in its alpha stage, it is still stable for endusers. Next, to improve accuracy, we have focused on pre-processing the image as thoroughly as possible and laid out our chosen algorithm in each step. For example, forbinarization,wehaveusedOtsu’sThresholdingalgorithmasthisgaveusthebest results. For segmentation, we have used the Fully Automatic Page Segmentation from Tesseracts own repertoire of segmentation modes. Then we have done our training through Tesseract’s new LSTM engine and improved upon their existing trainedfilewithourfonts. Wehaveselectedthesefontsbasedontheirpopularityof use. Wecalculatedouraccuracyatthewordlevelandourmodelgaveusanaverage accuracy of 95.9% on multiple fonts and on multiple real life scenarios. At best case scenariowehaveevenmanagedtosecure100%accuracy. Finally, wehavediscussed future improvements like the addition of a custom dictionary in our model and how it would increase the overall accuracy in all cases.

Keywords

Optical Character Recognition; Bangla Language; Bangla OCR; Tesseract; RNN; LSTM; Open CV; Otsu’s Thresholding Algorithm; Python; jTessEditorFX; Image Processing; Custom Dictionary

Description

Cataloged from PDF version of thesis.

Includes bibliographical references (pages 44-45).

This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2021.

Department

Department of Computer Science and Engineering, Brac University

Type

Thesis

Collections

Thesis & Report, BSc (Computer Science and Engineering) [1588]