Comprehensive analysis and development of deep learning models for Bengali character’s spectrogram image classification in child speech: introduction of spectro SETNet
Date
2024-05Publisher
Brac UniversityAuthor
Ahmed, Syed IstiaqueHossain, Md. Jubayer
Hoque, Kayes Mohammad Bin
Tusher, Mahmadur Rahman
Islam, Sajedur
Metadata
Show full item recordAbstract
In a rapidly developing linguistic technology, the key role of phoneme recognition
consists of understanding language and language learning. The research will be
framed where a recognition system is developed for the language of Bangla—vowels,
consonants, and numbers for children of age three to six years. By adopting ad
vanced approaches like technological methods and classical phonetic education, the
spectrogram images of the Bengali children we investigate are classified. Among the
techniques associated with modern machine learning (ML) the pervasive techniques
are image recognition and large language models (LLM) which have extended to the
less explored domain of Bangla phoneme spectrogram image recognition. From our
group of 21 participants, we have generated balanced 31,147 spectrogram images
a new dataset that we have created from scratch. This is because the dataset was
done meticulously to serve as a complete resource for researchers of Bangla-speaking
children’s phoneme recognition. Therefore, we then trained ten pre-existing deep
learning models that were capable of interpreting and optimizing their performance
in Bangla phoneme recognition by using our dataset. Based on these, the SENet
model stood out among other existing models with a high performance of 96. 89%
accuracy on our testing data set. The ResNet50 and VGG19 models produced the
best outcomes among the deep learning models tested which ranked second and
third respectively with an accuracy of 88. 8% and 87%. Based on these findings, we
propose a novel architecture, Spectrogram SE-Transformer Block Network (Spectro
SETNet), which is a hybrid of the ResNet50 model to which the SE and Transformer
blocks have been added, in order to cope with more complicated data and to limit
the computational power. The original hypothesis is that the model not only im
proves the accuracy of Bengali speech recognition for children but also offers a new
standard for more complex data processing with less computational power.