Detecting anomalies in security cameras with 3D-convolutional neural network and convolutional long short-term memory

This paper presents a novel deep learning-based approach for anomaly detection in surveillance films. A deep network that has been trained to recognize objects and human activity in movies forms the foundation of the suggested approach. In order to detect anomalies in surveillance films, the proposed method combines the strengths of 3D-convolutional neural network (3DCNN) and convolutional long short-term memory (ConvLSTM). From the video frames, the 3DCNN is utilized to extract spatiotemporal features,while ConvLSTM is employed to record temporal relationships between frames. The technique was evaluated on five large-scale datasets from the actual world (UCFCrime, XD-Violence, UBIFights, CCTVFights, UCF101) that had both indoor and outdoor video clips as well as synthetic datasets with a range of object shapes, sizes, and behaviors. The results further demonstrate that combining 3DCNN with Con-vLSTM can increase precision and reduce false positives, achieving a high accuracy and area under the receiver operating characteristic-area under the curve (ROC-AUC) in both indoor and outdoor scenarios when compared to cutting-edge techniques mentioned in the comparison. This is an open access article under the CC BY-SA license.


1.
INTRODUCTION Even for individuals, a significant concern is the monitoring capacity to keep the people safe and its quick response to serve this purpose as protection is the main driver for the deployment of surveillance systems.Although the usage of monitoring devices has risen, human potential has not [1].As a result, even if there is a large loss of labor and time considering relative to normal events, how improbable abnormal events are to occur, a lot of oversight is necessary to identify odd events that could damage people or a business [2].
For organizations like law enforcement, security, and others, surveillance footage is a valuable source of information.It is an automated system that is used to keep an eye on both interior and outdoor spaces including parking lots, malls, and airports.With the use of 2D or 3D cameras, the captured video streams are transformed into images.Computer vision algorithms analyze these photos to find objects, people, and actions in the scene.In order to detect and respond to unforeseen incidents like robberies, assaults, vandalism, or traffic accidents, video surveillance systems must be able to recognize anomalous actions in these settings.

❒
ISSN: 2088-8708 However, compared to typical events, anomalous occurrences are uncommon.The development of computer vision systems that automatically detect anomalous action in surveillance movies is essential since monitoring surveillance videos is very vital and time-consuming.It might be challenging to detect changes in the scene in many surveillance recordings due to their low quality and discontinuous character.Handcrafted feature extractors are used in conventional methods to solve this problem to find abnormal events.These methods take a lot of work, and they are challenging to keep up as the video format changes over time.Machine learning innovations in recent years have made it possible to train algorithms to detect anomalies without explicitly defining features.
The problem definition for detecting anomalies in surveillance videos involves developing algorithms that can identify events or behaviors that deviate from the expected norm in a given environment.This task can be particularly challenging due to the complexity and variability of real-world environments, as well as the need for algorithms to operate in real-time, with minimal delay or latency.Furthermore, the algorithms must be able to distinguish between normal and abnormal events with a high level of accuracy to avoid false positives or negatives.To achieve this, various approaches have been proposed such as machine learning techniques that can automatically learn from training data and identify patterns of normal behavior.Another approach is to use deep learning techniques, such as convolutional neural networks (CNNs), which have shown promising results in detecting anomalies in surveillance videos.However, the effectiveness of these approaches depends heavily on the quality and quantity of training data available, as well as the specific features that are used to represent normal and abnormal behaviors.Additionally, the development of more advanced sensors and cameras that can capture high-quality video data with greater detail and resolution has also contributed to improving accuracy.
In this research, we suggest a novel strategy for instantly identifying and employing CNN, which draw attributes from video frames, to classify anomalies in video recordings in accordance with different anomaly classes such as assault, robbery, and fighting.In this method, we first choose convolutional long short-term memory (ConvLSTM) to learn the long-term spatial and temporal characteristics of anomalies, then a 3Dconvolutional neural network (3DCNN) to learn the short-term spatial and temporal characteristics.In order to improve training stability and performance, we then merge these networks into a single architecture to perform classification of surveillance videos.In order to learn certain picture properties that are discriminative for various anomaly classes and for each class, multiple layers of convolutional networks are trained on thousands of photos, they are taught to identify the normal from the abnormal frames in a video clip.In order to do this, a characteristic taken from a typical frame is compared to a feature retrieved from an anomaly frame of the same class to determine how similar they are, and the frame is then classified as normal or abnormal by generating a similarity score between the two feature vectors.The biggest disadvantage of this approach is the large amount of training images and datasets required to train the network to recognize important image features.It is a sizable dataset on which we trained our proposal because more than 128 hours of recorded video are available in UCFCrime, which is classified into 8 anomaly courses and 1 normal class.Using the held-out test data, we evaluate the performance of our proposal and find that it outperforms other existing approaches and has a suitable classification accuracy for various anomaly events kinds.
In order to identify various forms of anomalies, we first detail the data set that was used in this study as well as how it was pre-processed, trained, and evaluated using a 3DCNN technique.Following that, we give a summary of the test dataset's findings and display each dataset's classification accuracy and area under the receiver operating characteristic-area under the curve (ROC-AUC).This essay is structured as follows: The literature review of numerous publications connected to this research study is described in section 2. 3D-CNN is discussed in section 3. The method is described in section 4. The dataset is described in section 5, and the pre-processing of the training data is briefly described in section 6, which is followed by a discussion and conclusions.

RELATED WORK
Using computer vision to recognize certain actions in security cameras has gained prominence in the action detection industry.The field of computer vision is relevant to this work.In order to automate the task of video anomaly detection, many academics have been working to build effective machine-learning approaches.Figure 1 displays the distribution of papers on anomaly detection from works that were published in the public domain between 2015 and 2021.A model for detecting anomalies in surveillance footage is presented in [3], [4].There are two phases to the system.Numerous handcrafted elements have been shown on this platform.Cool-3D (C3D) characteristics have also been extracted using deep learning approaches and anomaly detection using support vector machine (SVM) from video data.Sultani et al. [5] used these methods.The next stage is behavior modeling.During this step, SVM is trained using a bag-of-visual-words (BOVW) to understand how to represent typical behavior.The most dangerous form of bullying in schools is campus violence and is a problem for society worldwide.There are a number of potential techniques, including video-based ones, to prevent college violence as remote monitoring and artificial intelligence (AI) capabilities develop.In order to identify campus violence, Ye et al. [6] utilize aural and visual data.Role-playing is employed to collect information about campus violence, and each of the video's 16 frames is extracted to create 4096-dimension feature vectors.When applying the 3DCNN for classification and feature extraction, a total precision of 92% is attained.
Lv et al. [7] publish the weakly supervised anomaly localization (WSAL) technique, which focuses on temporally localizing anomalous regions inside anomalous films.influenced by the striking contrast of odd videos.The evolution of adjacent temporal segments is evaluated in order to identify abnormal portions.To achieve this, a high-order context encoding model is proposed that measures dynamic fluctuations in addition to extracting semantic representations to make effective use of the temporal environment.Video classification is more difficult than it is for static images because it is challenging to accurately capture both the spatial and temporal information of succeeding video frames.Ji et al. [8] proposed the 3D convolution operator for computing features from both geographical and temporal data.Wu et al. [9] provide a self-supervised-sparserepresentation (S3R) framework in 2022 that represents the idea of anomalous at the feature level by looking at the synergy between dictionary-based representation and self-supervised learning.In order to improve the discriminativeness of feature magnitudes for recognizing anomalies.Chen et al. [10] proposed the magnitudecontrastive-loss and the feature amplification mechanism.Test results using benchmark datasets from XD and UCF for crime and violence.

3D-CONVOLUTIONAL NEURAL NETWORK
A particular kind of neural network termed a 3DCNN is composed of several 2D convolutional layers, followed by many layers of fully linked nonlinear units, all of which are organized in multiple parallel planes like 3D.Similar to how convolutional layers can extract spatial patterns from picture data.To extract temporal patterns from the data, a convolution can be used along the time dimension.However, if our data includes both temporal patterns and spatial, as is the case with video data, we should investigate these two types of patterns jointly since they can combine to produce more complex spatio-temporal patterns.The fundamental principle behind a 3DCNN is to successively process an image or a video clip in two dimensions (spatial and temporal) in order to produce the desired outcome.
Detecting anomalies in security cameras with 3D-convolutional neural network and ... (Esraa A. Mahareek) By increasing the convolution kernel, 3DCNN accomplishes this by extending CNN.Utilizing 3DCNN [11] is efficient for extracting video features.3DCNN extracts the spatial-temporal information from the entire video for a more thorough analysis.Given the data format of the video, the 3D convolution kernel is utilized to extract regional spatio-temporal neighborhood information.In (1) represents the formula 3DCNN: where the activation function of the buried layer is denoted by Relu.The i th and j th feature graph sets' current value at point (x, y, z) is represented as v xyz ij .The term b ij denotes the bias of the i th layer and the j th feature map.The kernel value (p, q, r) th associated with the m t h feature map in the layer before is represented byw pqr ijm .The convolution kernel's height and breadth are denoted by P i , Q i , while its size along the temporal axis is denoted by R i .

4.
CONVOLUTIONAL LONG SHORT-TERM MEMORY ConvLSTM was developed specifically to help with problems predicting spatial-temporal sequences.Compared to regular LSTM, ConvLSTM may be more effective at extracting spatial and temporal characteristics from feature graph sets [11].This allows ConvLSTM, which analyzes and predicts the occurrences in time series, to incorporate the spatial data from a single feature map.ConvLSTM can therefore be applied to dynamic anomaly recognition to more successfully address timing difficulties.In order to create the ConvLSTM [12] equations, flowing equations are used.
The inputs are P 1 , P 2 , ..., P t , the cell outputs are C 1 , C 2 , ..., C t , and the hidden states are K 1 , K 2 , ..., K t .The gates i t , f t , andO t represent the 3D tensors of ConvLSTM, respectively.The last 2D, which are spatial, are rows and columns.The operators " * " and "•", respectively, stand for the convolution operator and "Hadamard product".In this instance, the ConvLSTM is supplemented with the batch normalization layer and dropout layer.

PROPOSED METHOD
To classify videos, 3DCNN and ConvLSTM are combined.We will go over the 3DCNN ConvLSTM model's design in this part.We proposed a 3DCNN followed by a ConvLSTM network as a feature extraction model for the dynamic anomaly detection process.Figure 2 displays the architecture of our model.The input layer is composed of a stack of 1632323 downscaled continuous anomalous video frames.The architecture consists of four 3D convolutional layers, each with a distinct filter (32, 32, 64, and 64). the same 333 kernel size, though.After that, a ConvLSTM layer with 64-unit sizes was introduced.ReLU and batch normalization layers are placed after each 3DCNN layer.The 3D max between each pair of 3DCNN layers were pooling and dropout layers.With values of 0.3 and 0.5, dropout layers were used.The softmax activation function follows the output probability in a fully connected layer with 512 and has a significant number of output units equal to the number of anomalous video classes.

DATASETS
The practice of identifying and analyzing anomalies in video data is gaining popularity.We employ our method to identify and analyze anomalies in numerous important video datasets in order to satisfy this need.For instance, the citations for UCFCrime [5], XDViolence [13], UBIfights [14], NTU CCTVFights [15], and UCF101 [16].
The first dataset consists of 128 hours of video in various sizes and types.Eight categories of crimes, totaling 1,900 words each, are listed in these films.These offenses consist of assault, arson, fighting, breaking and entering, explosion, arrest, abuse, and traffic accidents.Additionally, "Normal" videos-those without any footage of crimes are part of the collection.This dataset can be used to complete two tasks.The first step is to do a general analysis of anomalies, taking into account all anomalies in one group and all usual activities in another.Figure 3 depicts the distribution of the number of videos by class for each UCFCrime course.

❒ ISSN: 2088-8708
With a runtime of 217 hours and a total of 4,754 untrimmed films with audio signals and shaky labels, the second dataset, XDViolence [13], is a sizable, multi-scene dataset.The third dataset, UBIfights, is focused on a specific anomaly detection while still providing a wide range of fighting scenarios.It consists of 80 hours of film that has been fully annotated at the frame level.consisting of 1,000 films, of which 784 show typical daily events and 216 show war scenarios.All unnecessary video clips, including video introductions and news, were removed to avoid interfering with the learning process.The titles of the videos include descriptions of the several types of videos, such as those shot with fixed, rotated, or moveable cameras, or those shot in indoor and outdoor settings, in red, green, and blue (RGB) or grayscale, or both.
The fourth dataset, UCF101, consists of 101 different real-world activity categories of YouTube videos.There are 13,320 videos in 101 different activity categories.The movies are separated into 25 groups, each including four to seven videos of a different activity drawn from the 101 action categories.Similar backdrops, points of view, and other traits may be presented in videos from the same group.CCTVFights, the final dataset, contains 1,000 videos of actual fights that were recorded by CCTVs or handheld cameras.There are 280 CCTV films in total, with bouts lasting an average of 2 minutes and anywhere from 5 seconds and 12 minutes.In addition, it contains 720p video of actual battles obtained from other sources (referred to as non-CCTV in this text), primarily from mobile cameras but also on occasion from dashcams, drones, and helicopters.These movies average 45 seconds in length and range in length from 3 seconds to 7 minutes, however some of them have several fights that help the model draw broader conclusions.Table 1 gives a detailed description of the datasets that were used in this experiment.and 16 GB of RAM, the deep learning model was tested.The Anaconda environment, the Spyder editor, and Python were all used in the system's implementation.In the deep learning libraries, Keras and TensorFlow both appeared.Data handling and pre-processing were done with the Python OpenCV library.Deep learning models have a variety of characteristics that affect how well they develop and perform.We'll discuss how our network's functioning is impacted by the number of iterations.The number of iterations is one of the most important hyperparameters in modern deep learning systems.As a result of graphic processing units (GPU) parallelism, the model may be trained with fewer iterations in practice, which drastically speeds up computation.As opposed to that, training took longer while using larger iteration numbers than when using smaller ones, but testing accuracy was higher.The number of iterations and batch size can both be significantly impacted by the size of the model training dataset.

EXPERIMENTAL RESULTS
A crucial responsibility is performance review.Therefore, the performance of the multi-class classification issue is evaluated or represented using the AUC.It is one of the most basic evaluation criteria for determining whether a categorization model is effective.The level or measure of separability is known as AUC.It demonstrates how well the model can distinguish between classes.
Accuracy and AUC are the two metrics employed by classification methods.A highly accurate model produces extremely few incorrect predictions.However, the cost of those wrong projections to the company is not taken into account.When applied to these business problems, accuracy measurements abstract away the TP and FP characteristics and provide model forecasts with an excessive degree of confidence, which is detrimental to business objectives.AUC is the preferable statistic in such cases because it calibrates the trade-off between sensitivity and specificity at the best-selected threshold.Additionally, while AUC compares two models and evaluates a single model's performance at several thresholds, accuracy evaluates the performance of a single model.
The performance of the trained models was assessed using the AUC test and recognition accuracy.Table 2 summarizes the results of our trials for recognition accuracy and AUC utilizing the suggested 3DCNN-ConvLSTM model with batch sizes of 32 and iterations of 10, 30, and 50.Figures 4 and 5 depict the performance on the UCFCrime datasets' training and validation runs across 10 and 30 iterations, respectively.The training dataset clearly showed that the model performed effectively, performance on the UCF101 training and validation datasets is shown in Figure 6 at 100 iterations.The training accuracy for the model was almost 100%.The best recognition accuracy rate for the UCF101 dataset was 100 percent after 50 iterations when 25% of the dataset was used to test the trained model.While the model's accuracy rates for the UCFCrime, XDViolence, CCTVFight, and UBIfight datasets are 98.5%, 95.1%, 99.1%, and 97.1%, respectively.When trained for 50 iterations, which takes 25 hours for the UCFCrime dataset, the model gets the highest recognition accuracy among the five datasets.While the model's AUC for the UCFCrime, XDViolence, CCTVFight, and UBIfight datasets, respectively, are 92.2%,87.7%, 94.3%, 93.3%, and 92.3%.Given that the results are competitive with those of the recent studies compared in Tables 3, 4, 5, and 6.In order to properly assess the model, Table 3 contrasts the outcomes for additional models provided by other studies for the UCFCrime dataset.It shows that our proposal produces the top AUC outcomes, 92.2 across 50 iterations.and achieves 87.7 in AUC and 95.1 in accuracy for the CCTVFights dataset, respectively.In order to fully assess the model, Table 4   Table 3.Comparison of our model's output with that of additional models for UCFCrime dataset Reference AUC Technique Year [10] 86.98% Magnitude-contrastive glance-and-focus network (MGFN) 2022 [9] 85.99% Self-supervised sparse representation (S3R) 2022 [7] 85.38% Weakly supervised anomaly localization (WSAL) 2020 [17] 84.89% Learning causal temporal relation (LCTR) and 2021 Feature discrimination for anomaly detection (FDAD) [18] 84.48% Multi-stream-network with late-fuzzy-fusion 2022 [19] 84  [23] 98.64% Frame selection SMART 2020 [24] 98.6% OmniSource 2020 [25] 98.2% Text4Vis 2022 [26] 98.2%

CONCLUSION
We proposed an anomaly detection model using deep learning in this work since it is an effective artificial intelligence method for categorizing videos.In order to solve the problem of anomaly detection, the 3DCNN and ConvLSTM models collaborate.We evaluated the proposed method by applying it to five large-scale datasets.The five datasets displayed excellent performance, and model training accuracy was 100%.The recognition's reliability was 98.5%, 99.2%, and 94.5%, respectively.In comparison to 3DCNN, 3DCNN+ConvLSTM performed admirably on the datasets.Our study's findings demonstrate that the model is superior to the competing models in terms of accuracy.As a continuation of our current work, we want to develop a model for anticipating anomalies in surveillance footage.

Figure 1 .
Figure 1.Papers on violence detection are distributed annually Int J Elec & Comp Eng, Vol.14, No. 1, February 2024: 993-1004 Int J Elec & Comp Eng ISSN: 2088-8708 ❒ 997 To categorize the test video, it split into 16 consecutive frames, and fed into the trained model.The features that the model has discovered are used to determine the likelihood score for each frame.The majority voting schema predicts the label of the video sequence using the probability score of each frame after receiving a prediction of 16 frames as input (7) contains the voting formula for a majority.Y = modeC(X1, C2, C3, ..., C(X16) (7) X1, X2, ..., andX16 denote the frames collected from the tested video, and Y denotes the class name for the sign gesture video.For each frame,C(X1), C(X2), C(X3), ..., C(X16) reflect the expected class designation.

Figure 3 .
Figure 3.The percent of UCFCrime classes For evaluation, we split each dataset into 75:25 training and testing divisions.The remaining films are for testing.Each split is further divided into five folds, each of which contains about one-third of the total number of movies for training or validation.Using a Windows 10 Pro machine, an Intel Core i7 CPU, Int J Elec & Comp Eng, Vol.14, No. 1, February 2024: 993-1004 Int J Elec & Comp Eng ISSN: 2088-8708 ❒ 999

Figure 7 .Figure 8 .
Figure 7.The real-time detection of anomalies in explosion videos

Table 1 .
Specifics about each dataset used for comparison

Table 2 .
Performance evaluation of our model using the UCFCrime, CCTVFights, UBIfight, XDViolence, and UCF101 datasets contrasts the outcomes for additional models provided by previous studies ❒ ISSN: 2088-8708 for the XDViolence dataset.It shows that our proposal offers the top AUC outcomes 87.7% at 50 iterations.In order to fully assess the model, Table5contrasts the outcomes for additional models provided by other methods for the UBIfights dataset.It shows that our proposal offers the top AUC outcomes 93.3 percent in 50 iterations.In order to fully assess the model, Table6contrasts the outcomes for additional models provided by previous research for the UCF-101 dataset.It shows that our proposal offers highest levels of accuracy, 100% in 50 iterations.Figures7 and 8show characteristics typical of real-time abuse and explosion videos, for instance.

Table 4 .
Comparison of our model's output with that of additional models for the XDViolence dataset

Table 5 .
Comparison of our model's output with that of additional models for the UBIfights dataset

Table 6 .
Comparison of our model's output with that of additional models for the UCF101 dataset