Urban sound classification using convolutional Neural Network and long short term memory based on multiple features
Abstract
There are many sounds all around us and our brain can easily and clearly identify
them. Furthermore, our brain processes the received sound signals continuously and
provides us with relevant environmental knowledge. Although not up to the level
of accuracy of the brain, there are some smart devices which can extract necessary
information from an audio signal, with the help of di erent algorithms. And as
the days pass by more, more research is being conducted to ensure that accuracy
level of this information extraction increases. Over the years several models like
the CNN, ANN, RCNN and many machine learning techniques have been adopted
to classify sound accurately and these have shown promising results in the recent
years in distinguishing spectra- temporal pictures. For our research purpose, we are
using seven features which are Chromagram, Mel-spectrogram, Spectral contrast,
Tonnetz, MFCC, Chroma CENS and Chroma cqt.We have employed two models
for the classi cation process of audio signals which are LSTM and CNN and the
dataset used for the research is the UrbanSound8K. The novelty of the research lies
in showing that the LSTM shows a better result in classi cation accuracy compared
to CNN, when the MFCC feature is used. Furthermore, we have augmented the
UrbanSound8K dataset to ensure that the accuracy of the LSTM is higher than the
CNN in case of both the original dataset as well as the augmented one. Moreover,
we have tested the accuracy of the models based on the features used. This has been
done by using each of the features separately on each of the models, in addition to
the two forms of feature stacking that we have performed. The rst form of feature
stacking contains the features Chromagram, Mel-spectrogram, Spectral contrast,
Tonnetz, MFCC, while the second form of feature stacking contains MFCC, Melspectrogram,
Chroma cqt and Chroma stft. Likewise, we have stacked features using
di erent combinations to expand our research.In such a way it was possible, with our
LSTM model, to reach an accuracy of 98.80%, which is state-of-the-art performance.