Into the heart of Bangla speech: advancing speech sentiment recognition with semi-supervised multimodal machine learning model leveraging an iterative SHAP-based feature selection
View/Open
Date
2024-06Publisher
Brac UniversityAuthor
Shruti, Abanti ChakrabortyMetadata
Show full item recordAbstract
Automatic sentiment recognition from speech data is crucial for various applications.
As AI has grown in popularity, the application of the importance of speech
sentiment analysis is increasing along with the amount of speech in every industry.
Bengali is the seventh most spoken language in the world, yet research on voice sentiment
analysis in this language is lacking. This thesis investigates novel techniques
to enhance speech sentiment recognition in underresourced languages like Bengali.
We explore the efficacy of both unimodal (speech only) and multimodal (speech, Image,
and text) approaches for different fusion techniques. This research proposed a
semi-supervised Random Forest model, which achieved consistent and robust performance
across different modality combinations. This model demonstrated high accuracy
with fewer features, showcasing the efficiency and effectiveness of SHAP-based
semi-supervised learning in handling unlabeled data. Additionally, eight different
feature extraction techniques have been employed to extract acoustic features and
VGG19 and Bangla Word2Vec are used to extract image and text features. Moreover,
this study has experimented with different modality-based methods such as
LSTM, CNN, and BanglaBERT. We have used BanglaSER, SUBESCO, and KBES
datasets for our experiments. Among the various models tested, early fusion techniques
proved the most effective, achieving an accuracy of up to 83% when combining
speech and text modalities with LSTM classifiers and the proposed semi-supervised
model acquired the highest 77% accuracy for audio, text, and image modals. In contrast,
late fusion techniques showed reduced performance, though including speech
and image modalities improved accuracy to 62%. Detailed performance comparisons
for unimodal systems indicate that traditional Random Forest models perform
well with fully labeled datasets, but our semi-supervised model works comparatively
well with only 20% labeled data. Moreover, our proposed semi-supervised AdaBoost
model, using only 20 features and SHAP-based feature importance, outperformed
the traditional model trained with 50 features. Remarkably, the proposed Random
Forest model trained with 20% labeled and 80% unlabeled data achieved over
70% accuracy across different feature selection methods, with the weighted feature
selection technique achieving the highest accuracy of 72%. We believe this thesis
will contribute significantly to Bangla speech sentiment recognition by providing
a robust, efficient, and interpretable framework.