AVCL: Audio Video clustering for learning Conversation labeling using Neural Network and NLP
Abstract
Audiovisual data is the most extensively used and abundantly distributed type of
data on the internet in today’s information and communication age. However, the
necessary audiovisual data is challenging to retrieve because the majority of them
are not correctly categorized. As a result, it is difficult to locate the necessary au diovisual data in times of need, and as a result, a great amount of potentially useful
information does not reach users in a timely manner. A piece of data is only as good
as the time frame in which it was acquired. Additionally, because audiovisual files
such as lecture notes, recorded classes, and recorded conversations are quite large,
skimming through a large amount of audiovisual data for the necessary information
can be time consuming, and the likelihood of not receiving the appropriate infor mation on time is relatively high. As a result, we propose a novel model that will
take any audio or video file as input and label it according to its content utilizing
Convolutional Neural Networks, BERT, and several machine learning techniques.
Our proposed model accepts any audiovisual file as input and extracts features from
the contents and uses convolutional neural network and transformer to recognize
and transcript the speeches of the conversations. Using BERT models and cosine
similarity keywords and phrases are extracted from the transcript and the input file
will be labeled with the key phrases and keywords that are most similar to the con text of the content. Finally, the input file will be appropriately labeled with these
key phrases and keywords so that anyone in the future in need of similar information
can quickly locate this audiovisual file.