dc.contributor.advisor | Sadeque, Farig Yousuf | |
dc.contributor.advisor | Akther, Afroza | |
dc.contributor.author | Anam, Rubayt | |
dc.contributor.author | Tabassum, Anika | |
dc.contributor.author | Arnob, Shanjid Ahmed | |
dc.contributor.author | Khan, Md. Tazfiq | |
dc.date.accessioned | 2025-01-20T05:03:18Z | |
dc.date.available | 2025-01-20T05:03:18Z | |
dc.date.copyright | ©2024 | |
dc.date.issued | 2024-11 | |
dc.identifier.other | ID 20101617 | |
dc.identifier.other | ID 20101143 | |
dc.identifier.other | ID 20101207 | |
dc.identifier.other | ID 20301172 | |
dc.identifier.uri | http://hdl.handle.net/10361/25217 | |
dc.description | This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024. | en_US |
dc.description | Cataloged from PDF version of thesis. | |
dc.description | Includes bibliographical references (pages 54-57). | |
dc.description.abstract | This work introduces the performance comparison of two deep ASR models,Wav2Vec2
and Whisper, in recognizing the Bengali language with complex linguistics, dialectal
variations, and phonetic intricacies. ASR technology has become crucial in interacting
with devices in various applications because it bridges spoken speech to text or
commands. Core ASR components include feature extraction, acoustic and language
modeling, and decoding. These systematically provide a way of translating speech
into actionable formats. Despite the advances, challenges in managing pronunciation
diversity, code-switching, and ambient noise continue to face ASR systems,
especially in languages like Bengali, which encompass significant phonetic and dialectal
variations. Wav2Vec2 and Whisper were selected based on previous successes
in other languages, but it was essential to study how efficiently adaptable they would
be for Bengali ASR applications. The work was oriented toward sensitivity to the
models regarding parameter tuning and generalization capability across linguistic
features. Wav2Vec2 showed flexibility, with noticeable improvements in WER by
tuning parameters such as learning rate, dropout, and gradient accumulation, which
showcases its adaptability to Bengali’s phonetic nuances. Each tuning configuration
showed progressive enhancements in model stability and evaluation accuracy, hence
positioning Wav2Vec2 as one of the potential candidates for a real-world application
in Bengali, which requires precision and flexibility.
On the contrary, Whisper had rigidity in tuning: it kept the same WER of 100%
for all settings and was thus insensitive to this tuning. So far, this may be a
structural limitation within the Whisper model, which influences its performance
in high-precision applications involving more linguistically complicated languages
like Bengali. Preprocessing and feature extraction to standardize the audio data of
both models included tokenization for linguistic alignment, and parallel inferences
were performed to compare performance. While Wav2Vec2 showed promises with
incremental improvements in transcription accuracy, Whisper’s inability to adapt
underscores the need for architectural revisions to meet the demands of Bengali ASR
tasks. These results also suggest that Wav2Vec2, by the flexibility in its parameter
tuning process, is more capable of handling the linguistic diversity of Bengali.
At the same time, Whisper, on the other hand, is purely suitable for standardized
languages, where rigidity does not pose any problem. The current paper concludes
thatWav2Vec2 should be more appropriate for sophisticated ASR applications in the
Bengali language, especially under dialectally diverse or noisy contexts. In contrast,
Whisper may require essential changes to become compatible with a linguistically
complex setting. This is further compounded by dataset diversity, limitations of
computational resources, and model tunability in the present study to mark the importance
of specialized ASR frameworks necessary for linguistic diversity inherent
in Bengali and similar languages. | en_US |
dc.description.statementofresponsibility | Rubayt Anam | |
dc.description.statementofresponsibility | Anika Tabassum | |
dc.description.statementofresponsibility | Shanjid Ahmed Arnob | |
dc.description.statementofresponsibility | Md. Tazfiq Khan | |
dc.format.extent | 67 pages | |
dc.language.iso | en | en_US |
dc.publisher | BRAC University | en_US |
dc.rights | BRAC University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. | |
dc.subject | Speech recognition | en_US |
dc.subject | Wav2Vec 2.0 | en_US |
dc.subject | Voice recognition | en_US |
dc.subject | Machine learning | en_US |
dc.subject | Bangla language | en_US |
dc.subject | Language modeling | en_US |
dc.subject.lcsh | Automatic speech recognition--Bengali language. | |
dc.subject.lcsh | Natural language processing (Computer science). | |
dc.subject.lcsh | Speech processing systems. | |
dc.subject.lcsh | Computational linguistics. | |
dc.subject.lcsh | Acoustical engineering. | |
dc.title | Enhancing Bangla speech recognition through acoustic and language modeling | en_US |
dc.type | Thesis | en_US |
dc.contributor.department | Department of Computer Science and Engineering, BRAC University | |
dc.description.degree | B.Sc. in Computer Science | |