Analysis of real-time hostile activitiy detection from spatiotemporal features using time distributed deep convolutional neural networks, recurrent neural networks and attention-based mechanisms
Abstract
Throughout time, there has been a surge of hostile activities in public places across
the globe. With the advancement in technology, it has been possible to monitor
public places through real time surveillance. Video surveillance has become essential for ensuring public safety as it provides a significant benefit in lowering the
crime rate, as well as monitoring the facility within its reach. Hence, CCTV cameras are installed in all areas where security is a priority. Although CCTV cameras
help a lot in increasing security, the main drawback in these surveillance systems is
that it requires constant human interaction and monitoring. To eradicate this issue, an automated surveillance system can be built using artificial intelligence, deep
learning and IoT (Internet of things). So in this research we explore deep learn ing video classification techniques that can help us automate surveillance systems
to detect violence as they are happening. Traditional machine learning or image
classification techniques fall short when it comes to classifying videos as they attempt to classify each frame separately for which the predictions start to flicker. So
many researchers are coming up with video classification techniques that consider
spatiotemporal features while classifying. However, deploying these deep learning
models are not always practical in an IoT environment. For this reason we cannot
use techniques that are acquired like skeleton points and optical flow through technologies like pose estimation or depth sensors. Although these techniques ensure a
higher accuracy score, they are computationally heavy. Keeping these constraints
in mind, we experimented with various video classification and action recognition
techniques such as ConvLSTM, LRCN (with both custom CNN layers and VGG-16
as feature extractor) CNN-Transformer and C3D (3D-CNN). We achieved a test
accuracy of 80% on ConvLSTM, 83.33% on CNN-BiLSTM, 70% on VGG16-BiLstm
,76.76% on CNN-Transformer and 80% on C3D model.