Show simple item record

dc.contributor.advisorSadeque, Farig Yousuf
dc.contributor.advisorAkther, Afroza
dc.contributor.authorAnam, Rubayt
dc.contributor.authorTabassum, Anika
dc.contributor.authorArnob, Shanjid Ahmed
dc.contributor.authorKhan, Md. Tazfiq
dc.date.accessioned2025-01-20T05:03:18Z
dc.date.available2025-01-20T05:03:18Z
dc.date.copyright©2024
dc.date.issued2024-11
dc.identifier.otherID 20101617
dc.identifier.otherID 20101143
dc.identifier.otherID 20101207
dc.identifier.otherID 20301172
dc.identifier.urihttp://hdl.handle.net/10361/25217
dc.descriptionThis thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024.en_US
dc.descriptionCataloged from PDF version of thesis.
dc.descriptionIncludes bibliographical references (pages 54-57).
dc.description.abstractThis work introduces the performance comparison of two deep ASR models,Wav2Vec2 and Whisper, in recognizing the Bengali language with complex linguistics, dialectal variations, and phonetic intricacies. ASR technology has become crucial in interacting with devices in various applications because it bridges spoken speech to text or commands. Core ASR components include feature extraction, acoustic and language modeling, and decoding. These systematically provide a way of translating speech into actionable formats. Despite the advances, challenges in managing pronunciation diversity, code-switching, and ambient noise continue to face ASR systems, especially in languages like Bengali, which encompass significant phonetic and dialectal variations. Wav2Vec2 and Whisper were selected based on previous successes in other languages, but it was essential to study how efficiently adaptable they would be for Bengali ASR applications. The work was oriented toward sensitivity to the models regarding parameter tuning and generalization capability across linguistic features. Wav2Vec2 showed flexibility, with noticeable improvements in WER by tuning parameters such as learning rate, dropout, and gradient accumulation, which showcases its adaptability to Bengali’s phonetic nuances. Each tuning configuration showed progressive enhancements in model stability and evaluation accuracy, hence positioning Wav2Vec2 as one of the potential candidates for a real-world application in Bengali, which requires precision and flexibility. On the contrary, Whisper had rigidity in tuning: it kept the same WER of 100% for all settings and was thus insensitive to this tuning. So far, this may be a structural limitation within the Whisper model, which influences its performance in high-precision applications involving more linguistically complicated languages like Bengali. Preprocessing and feature extraction to standardize the audio data of both models included tokenization for linguistic alignment, and parallel inferences were performed to compare performance. While Wav2Vec2 showed promises with incremental improvements in transcription accuracy, Whisper’s inability to adapt underscores the need for architectural revisions to meet the demands of Bengali ASR tasks. These results also suggest that Wav2Vec2, by the flexibility in its parameter tuning process, is more capable of handling the linguistic diversity of Bengali. At the same time, Whisper, on the other hand, is purely suitable for standardized languages, where rigidity does not pose any problem. The current paper concludes thatWav2Vec2 should be more appropriate for sophisticated ASR applications in the Bengali language, especially under dialectally diverse or noisy contexts. In contrast, Whisper may require essential changes to become compatible with a linguistically complex setting. This is further compounded by dataset diversity, limitations of computational resources, and model tunability in the present study to mark the importance of specialized ASR frameworks necessary for linguistic diversity inherent in Bengali and similar languages.en_US
dc.description.statementofresponsibilityRubayt Anam
dc.description.statementofresponsibilityAnika Tabassum
dc.description.statementofresponsibilityShanjid Ahmed Arnob
dc.description.statementofresponsibilityMd. Tazfiq Khan
dc.format.extent67 pages
dc.language.isoenen_US
dc.publisherBRAC Universityen_US
dc.rightsBRAC University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission.
dc.subjectSpeech recognitionen_US
dc.subjectWav2Vec 2.0en_US
dc.subjectVoice recognitionen_US
dc.subjectMachine learningen_US
dc.subjectBangla languageen_US
dc.subjectLanguage modelingen_US
dc.subject.lcshAutomatic speech recognition--Bengali language.
dc.subject.lcshNatural language processing (Computer science).
dc.subject.lcshSpeech processing systems.
dc.subject.lcshComputational linguistics.
dc.subject.lcshAcoustical engineering.
dc.titleEnhancing Bangla speech recognition through acoustic and language modelingen_US
dc.typeThesisen_US
dc.contributor.departmentDepartment of Computer Science and Engineering, BRAC University
dc.description.degreeB.Sc. in Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record