Real time bengali speech to text conversion using CMU sphinx
Abstract
This paper aims to demonstrate the use of Speech-to-Text technology to convert Bangla spoken in a natural and continuous state into Bengali UNICODE font with good accuracy. This achievement required the usage of the open sourced framework Sphinx 4 created by Carnegie Melon University (CMU) which was written in Java and provides the required procedural coding tools to develop an acoustic model for a custom language like Bangla. It takes help of algorithms like Baum-Welch to create an Acoustic Model from training data which we gathered ourselves. Our main objective was to ensure that the system was adequately trained on a word by word basis from various speakers so that it could recognize new speakers fluently. We used a free digital audio workstation (DAW) called Audacity to manipulate the collected recording data via techniques like continuous frequency profiling to reduce the Signal-to-Noise-Ratio (SNR), vocal levelling, normalization and syllable splitting as well as merging to ensure an error free 1:1-word mapping of each utterance with its mirror transcription file text. The result is a speech to text recognition system with an acceptable accuracy of around 75% that was trained using recorded speech data from 10 individual speakers consisting of both males and females using custom transcript files that we wrote.