Optical flow based violence detection from video footage using hybrid MobileNet and Bi-LSTM
Abstract
This thesis presents a novel approach for the automatic detection, categorization, and sub-categorization of violent and nonviolent behaviors in video footage. This research addresses the growing necessity for enhanced security protocols in both public and private sectors. Surveillance cameras are commonly accessible and easily affordable; however, their utilization is frequently inefficient due to boundaries related to human real-time monitoring. This occurrence may lead to delayed responses to unanticipated events, hence highlighting the need for enhanced and efficient monitoring measures. Our thesis presents a novel approach for the automation of violence detection by utilizing machine learning and deep learning techniques. The techniques applied in this study integrate object and motion detection through the utilization of optical flow analysis and a MobileNet-Bi-LSTM fusion architecture. This methodology exceeds conventional methods by incorporating both temporal dynamics and spatial features. We have invested notable efforts in enhancing our dataset acknowledging the significance of training an efficient violence detection system. In addition to the existing dataset, we have systematically compiled an adequate number of video footage. The compiled videos contain a diverse array of circumstances, effectively representing a variety of environments, lighting conditions, and situations. The inclusion of this range is crucial in facilitating our model’s ability to generalize and adapt to real-world scenarios seamlessly. A thorough annotation procedure of meticulous labeling of ‘violent’ ‘non-violent’ actions, along with specific subcategories of violence like ‘Beating,’ ‘Use of Weapons,’ and ‘Burning’ was done to uphold the standards of quality and precision in the enhanced dataset. For an in-depth review, a comparison study was undertaken to examine two unique methodologies. The first approach centers on the categorization of actions into two distinct categories: ‘Non-Violence’ and ‘Violence,’ based on a binary classification system. The second approach entails the categorization of behaviors of the videos of our unique dataset named ‘Beating-Burning-Weapon (BBW) Violence’ Dataset into two main groups, namely ‘Non-Violence’ and ‘Violence,’ which further subdivided into three sub-categories of violence, which are ‘Beating,’ ‘Burning,’ and ‘Use of Weapons.’ In our comprehensive evaluation of violence detection methods, we tested two violence detection methods on the two previously mentioned datasets. The ‘Frame Selection at Equal Intervals’ method achieved higher accuracy, 90.16% in the ‘Real Life Violence Situations (RLVS)’ Dataset and 85.32% in the BBW Violence Dataset, making it a precise choice. On the other hand, the ‘Merged Frame Stacking’ method, offering computational efficiency, achieved respectable accuracies of 85% and 74% in the RLVS and BBW Violence Datasets respectively. This provides a foundational baseline for violence detection, thus highlighting method-specific advantages and trade-offs. Our research holds significant potential for proactive security management by promptly detecting and responding to possible threats.