Multimodal emotion recognition from Speech and text using heterogeneous ensemble techniques
Abstract
Emotion recognition and sentiment analysis serves many purposes from analyzing
human behavior under specific conditions to enhancement of customer experience for
various services. In this paper, a multimodal approach is used to identify 4 classes
of emotions by combining both speech and text features to improve classification
accuracy. The methodology involves the implementation of several models for both
audio and text domains combined using 4 different heterogeneous ensemble tech niques - hard voting, soft voting, blending and stacking. The effects of the different
ensemble learning methods on the accuracy for the multimodal classification task
are also investigated. The results of this study show that stacking is the highest
performing ensemble technique, and the implementation outperforms several exist ing methods for 4-class emotion detection on the IEMOCAP dataset, obtaining a
weighted accuracy of 81.2%.