Implementation of an Optical Character Recognizer (OCR) for Bengali language

Chowdhury, Muhammed Tawfiq; Islam, Md. Saiful; Bipul, Baijed Hossain

dc.contributor.advisor	Rahman, Dr. Md. Khalilur
dc.contributor.author	Chowdhury, Muhammed Tawfiq
dc.contributor.author	Islam, Md. Saiful
dc.contributor.author	Bipul, Baijed Hossain
dc.date.accessioned	2015-09-03T07:14:51Z
dc.date.available	2015-09-03T07:14:51Z
dc.date.copyright	2015
dc.date.issued	8/24/2015
dc.identifier.other	ID 11101009
dc.identifier.other	ID 11101061
dc.identifier.other	ID 11101047
dc.identifier.uri	http://hdl.handle.net/10361/4374
dc.description	Cataloged from PDF version of thesis report.
dc.description	Includes bibliographical references (page 44).
dc.description	This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2015.	en_US
dc.description.abstract	Optical character recognition (OCR) is the process of extracting text from an image. The main purpose of an OCR is to make editable documents from existing paper documents or image files. A number of algorithms are required to develop an OCR. Noise removal, skew identification and correction, segmentation, etc are the different steps of developing an OCR. OCR primary works in two phases; they are character and word detection. In case of more sophisticated approach, an OCR also works on sentence detection to preserve documents' structures. In this paper, we would discuss the process of developing an OCR for Bengali language. Lots of efforts have been put on developing an OCR for Bengali. Though some OCRs have been developed, none of them is completely error free. For our thesis, we trained Tesseract OCR engine to develop an OCR for Bengali language. Tesseract is currently the most accurate OCR engine. This engine was developed at HP labs and currently owned by Google. We used a number of software to prepare our training files. Our OCR's library contains 18110 characters and 2617 words. We used "Solaimanlipi" font in our project. We used 200 input files to test the accuracy of our OCR . We are using the latest 3.03 version of Tesseract for windows operating system. For clean image files, the accuracy of our software was as high as 97.56%. It is important to mention that we measured accuracy as the percentage of correct characters and words.	en_US
dc.description.statementofresponsibility	Muhammed Tawfiq Chowdhury
dc.description.statementofresponsibility	Md. Saiful Islam
dc.description.statementofresponsibility	Baijed Hossain Bipul
dc.format.extent	52 pages
dc.language.iso	en	en_US
dc.publisher	BRAC University	en_US
dc.rights	BRAC University thesis are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission.
dc.subject	Computer science and engineering	en_US
dc.subject	Optical Character Recognition (OCR)	en_US
dc.subject	Bengali language	en_US
dc.subject	Tesseract	en_US
dc.subject	JTessboxEditor	en_US
dc.subject	Netbeans IDE	en_US
dc.title	Implementation of an Optical Character Recognizer (OCR) for Bengali language	en_US
dc.type	Thesis	en_US
dc.contributor.department	Department of Computer Science and Engineering, BRAC University
dc.description.degree	B. Computer Science and Engineering

Files in this item

Name:: 11101009.pdf
Size:: 2.645Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Thesis & Report, BSc (Computer Science and Engineering) [1480]

Show simple item record