Early threat warning via speech and emotion recognition from voice calls
Abstract
The aim of this system is to identify potential cases of threats, and provide an early warning
or alert to such cases. This will be based on voice such as voice chat over telecommunication
networks or social media. The intended result will be achieved in three major steps. At first,
the conversion of speech to text from both real time audio recordings and from accent groups
will be applied using primarily IBM Watson’s Speech to Text. This will then be used to
identify possible trigger words or word patterns from a classified selection of threat-related
and negative words. And finally, the same audio source will be utilized for detecting emotions
from the frequency shifts through vocal feature extraction from audio input and processing
it using multiple classifier algorithms such as Support Vector Machines (SVMs), Random
Forests and Naïve Bayes. Libraries such as LibROSA will be applied to extract primary
audio features such as Mel Frequency Cepstral Coefficients (MFCC) to generate accurate
predictions. The system yields a result of approximately 84% using the SVM RBF (Radial
Basis Function) kernel, which highlights the accuracy of emotion detected based on the
speech.
Keywords— Emotion Recognition; Support Vector Machines; Speech to Text; Random
Forest; Feature Extraction; MFCC