Similarity search for Bangla

Morshed, Mahbub; Shahed, Md. Shahid

dc.contributor.advisor	Abdullah, Matin Saad
dc.contributor.author	Morshed, Mahbub
dc.contributor.author	Shahed, Md. Shahid
dc.date.accessioned	2011-12-07T10:55:26Z
dc.date.available	2011-12-07T10:55:26Z
dc.date.copyright	2011
dc.date.issued	2011-04
dc.identifier.other	ID 09201023
dc.identifier.other	ID 07101007
dc.identifier.uri	http://hdl.handle.net/10361/1518
dc.description	This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2011.	en_US
dc.description	Cataloged from PDF version of thesis report.
dc.description	Includes bibliographical references (page 29).
dc.description.abstract	Due to typos and misspelling search engines cannot provide users with proper information. Large search engines like Google provides suggestion tab "did you mean". But such options are not included in most of the open source search engines. Our goal was to find a way to implement an exhaustive similarity search in an efficient way and develop such option for Bangla search engine . We used Solr for that. And configured Solr with Lavenstine distance and Jaro Winkler algorithm to provide "Did you mean" for English. But to implement this for Bangla we needed a Stemmer for Bangla and that was not present in SoIr. In order to build a efficient stemmer we need to tag the tokens properly according to their parts of speech as the stemming process for different parts of speech is different. There are different approaches to the problem of assigning a part of speech (POS) tag to each word of a natural language sentence. We have used NLTK toolkit to develop a Regular expression tagger for Bangla verbs using the common suffixes( 1 i r ) found in Bangla grammar. Then we analyzed its performance on main verbs extracted from a 100K token 51 Page tagged-corpus. In this thesis we also compare the performance of a few POS tagging techniques for Bangla language, e.g. statistical approach (ngram) and transformation based approach (Brill's tagger). A supervised POS tagging approach requires a large amount of annotated training corpus to tag properly. We used the 100K token hand tagged corpus developed by Microsoft India to implement these techniques.	en_US
dc.description.statementofresponsibility	Mahbub Morshed
dc.description.statementofresponsibility	Md. Shahid Shahed
dc.format.extent	31 pages
dc.language.iso	en	en_US
dc.publisher	BRAC University	en_US
dc.rights	BRAC University thesis are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission.
dc.subject	Computer science and engineering
dc.title	Similarity search for Bangla	en_US
dc.type	Thesis	en_US
dc.contributor.department	Department of Computer Science and Engineering, BRAC University
dc.description.degree	B. Computer Science and Engineering

Files in this item

Name:: Morshed-09201023, Shahed-07101 ...
Size:: 5.135Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Thesis & Report, BSc (Computer Science and Engineering) [1480]

Show simple item record