Automatic video censoring system using deep learning

ABSTRACT


INTRODUCTION
"We become what we see" is an ancient famous quote that holds even today.It points out the importance of the visual information effect on people in a way that is profoundly ingrained.Internet is full of things that are inappropriate for children and even teenagers that can affect their young minds in a very negative way.Violent scenes that encourage young viewers to reproduce the roles performed on television by their characters will cost them their lives.Films claiming that drug use is health-destroying simply promote drug use as young minds become curious to do it in real life.These scenes should be censored tighter.
The internet's media presence is vast because of the widespread use of video sharing sites and cloud facilities.This data volume makes it impossible to monitor the content of those videos.One of the most crucial issues about the video contents is whether it has an objectionable topic, for example, violence or abuse in any form.More than telling if a video is either appropriate or inappropriate, it is also important to identify which parts of it contain such content, for preserving parts that would be discarded in a simple broad analysis.
Censoring boards and committees are there for large scale movies and television shows, but people are moving to other sources like smartphones, tablets, and computers.as their primary source of entertainment from movies theatres and Television shows.There are several sources of information and Int J Elec & Comp Eng ISSN: 2088-8708  Automatic video censoring system using deep learning… (Yash Verma) 6745 entertainment like Netflix, Amazon prime video, YouTube, Instagram, Facebook, and Twitter.As these can be very convenient at times due to their portability and reach to the major population in the world.A mechanism has not been put into place to censor the content which gets shared through these platforms.
Especially in developing countries like India, this is a big issue.As opposed to the primary movies given by central board of film certification (CBFC), the broadcasting content complaints council (BCCC), and the over-the-top (OTT) platforms do not have a regulatory authority to monitor the content streamed and therefore enjoy its rights.The contents on these websites are also subject to observation by the supreme court direct contradiction to the various laws of the country.But the legal loopholes and grey fields are also troubling [1].Also, many notorious groups deliberately put inappropriate content on social media.There were many instances where people put up content related to unbearable violence and bloodshed that is unbearable even for fully grown adults.The proposed system will help in filtering the above-said content.
Censoring content in the videos may be specified as an example of binary classification where two different classes can be appropriate and non-appropriate.Some of the popular machine learning (ML) are logistic regression, k-nearest neighbors, decision trees, support vector machines (SVM) and naïve-Bayes algorithm.But here the data to be classified is in the form of complex images that vary very much.Due to this complex data, traditional ML algorithms are not enough as discussed in the literature review section below.So, this problem has to be tackled by deep learning (DL).The best known and most effective deep learning technique is the convolutional neural network.Through this work, different deep learning models have been compared to find the model that is best suited for the task of automatic censoring of videos and images by classifying the frames as violent or non-violent.And the complete architecture of the system is provided in this paper along with the results and at last, conclusions are drawn.
A lot of research was conducted for the domain of video censoring using machine learning and deep learning.Some of the most notable work relevant to this project has been mentioned.Nievas et al. [2] evaluated the various state of the art video descriptors for the task of detection of fights on datasets containing action movie scenes and National Hockey League footage.The popular bag of words approach has been found to 90% accurately classify the frames.It was discovered that for hockey data set that the accuracy of the choice of feature descriptor of low level and vocabulary size is not important; furthermore, on the latter dataset, descriptor choice was critical; under either circumstance, the motion SIFT (MoSIFT) functionality was significantly greater than the best spatio-temporal interest points (STIP).
Deniz et al. [3] proposed a new approach for the detection of violent actions in which the major discriminatory attribute is extreme acceleration patterns.The proposed algorithms resulted in 12% accuracy improvement over state-of-the-art methods in surveillance scenarios including sports where complex actions take place.It was hypothesized from the experiments that motion alone is sufficient for categorization and visuals could be a cause of confusion in the detection process and cost any additional computations.The method found to be 15-fold faster than with a very a smaller number of features.This can also be used as the first stage of a maximum accuracy system with STIP or MoSIFT features.
Fu et al. [4] presented an efficient approach for violence detection in videos using analysis of motion without action events, gestures, or complex behavior recognition.A heuristic framework has been suggested, built on a decision tree structure, to derive more accurate motion information from optical flow images to distinguish movement types by motion, count, size, and direction according to the movement regions extract.Motion attraction was also suggested to measure intensity between two motion zones, which will measure different statistics as classification features.
Bilinski and Bremond [5] proposed a technique that is based on improved fisher vectors (IFV) allowing for both spatio-temporal as well as local features for video representation for the purpose of recognizing violence in videos.When comparing the proposed technique with temporal-spatial positions IFV, the proposed extension got similar or better accuracy.This new strategy has been more effective than the previous approaches for violence detection too.Here, the approach using sliding-window have also been studied and IFV has been reformulated for increasing the speed of the framework for detection of violence on four states of the art datasets.
Nar et al. [6] showed that for abnormal behavior detection in front of automatic teller machines (ATMs), the recognition of posture could be used.The experiments were done on the data captured using Kinect 3D camera.Logistic regression was used as the classification technique.For the calculation of optimal parameters, gradient descent was used.
Xie et al. [7] proposed an algorithm using a motion vector for the detection of violence for surveillance videos.First, the system removes motion vectors from compressed videos and next analyses the space-time distribution of the magnitude of the motion vectors and the path to take region motion vectors (RMV), then uses radial based SVM, which classifies RMV and determines violent behavior in monitoring videos.It can increase video-monitoring systems performance on the detection of violent activities in real-time and increase the retrieval systems performance for videos on the position of the violence in historical videotapes or other media source.The methodology provided is highly appropriate for front video  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 12, No. 6, December 2022: 6744-6755 6746 encoders operating on an integrated digital signal processor (DSP) network since it saves measures to identify and trace movements that are normally applied in the conventional method of behavior analysis.The provided methodology could classify the UCF sports dataset with 91.9% accuracy.
Ribeiro et al. [8] worked on the issue of detection of violence where detection is difficult due to some external factors like backgrounds that are clustered and dynamic, occlusion making the scene complex for detection.A proposition was given for rotation-invariant feature modelling motion coherence that is specific to the violence detection problem.That is used to differentiate against unstructured movements and structure capture.The value of the histogram of optical flow vectors, obtained from the 2 nd order statistics of time instants, is calculated locally, densely, and integrated into the Riemannian spherical manifold.It is dependent on its values.It was shown through experiments, the accuracy given by the provided approach similar to those of state-of-the-art models in the laboratory as well as real-world setting.
Senst et al. [9] presented an automatic special violence detection technique based on the Lagrangian technique.We propose a new feature based on a spatio-temporal model, which uses appearance, context motion compensation, and details on the longer-term motion.We provide an extensive bag-of-words procedure as a per-video classification scheme to ensure suitable spatial and temporal scales.The proposed architecture was thoroughly reviewed by the London metropolitan police on multiple challenge data sets and real-world non-public data.
Zhou et al. [10] proposed FightNet, convolutional neural network (CNN) for modelling long-term temporal structure for detecting interactions that are violent.The model was trained on the UCF101 dataset.Acceleration fields are researched for capturing motion features and found to be responsible for a 0.3% increase in inaccuracy.The system was able to achieve higher accuracy with decreased computational cost.Firstly, the framing of each video is red, green, blue (RGB).Second, the field of optical flow is calculated by following photos.Acceleration is obtained by the optical flow field.Third, FightNet has three types of input modes, including RGB images, optical flow images and acceleration images for temporal networks.When fused from all inputs, we infer whether a video says a violent incident or not.
Chaudhary et al. [11] put forward a technique for automatically detecting violence in surveillance videos.The method proposed contains three key steps: moving object identification, observing objects and comprehension of behavior for movement recognition.Key features (speed, direction, center, and dimensions) are defined by using the feature extraction method.This helps trace objects in video frames.For extracting foreground, the Gaussian mixture model was utilized.This method could categorize the videos with 90% accuracy.
Zhou et al. [12] also proposed a different algorithm involving optical flow fields.First, the movement regions are divided by optical flow field distribution.Next, in motion areas, a proposal was made that the presence and dynamics of aggressive actions be extracted from two kinds of low-level features.The suggested low-level characteristics of the local appendix histogram (LHOG), derived from the local appendix optical flow histogram (LHOF), extracted from optic flow photographs, are the Low-level characteristics.Now, for each video, a vector of a certain length is obtained, and the features collected are coded using the bag-of-words (BoW) to remove unnecessary content.The last thing was to use SVMs to classify vectors at the level of each video.
Khan et al. [13] proposed a model to classify cartoon content as inappropriate for children.The model used the transfer learning approach taking benefit of the MobileNet model.The system proved to be 97% accurate.
Song et al. [14] made a proposal on 3D ConvNets with modified pre-processing steps for the detection of violence.This paper proposes a new sampling method in which instead of uniform sampling, keyframes are identified and those are implemented for categorization of sequence instead of every frame.Through this method, as well as exceptional precision on benchmark datasets, the computing is greatly reduced which result in faster classification.An important point is that the shorter clips are treated in a similar manner as uniform sampling rate, just lengthy clips use the new method which further maintains the accuracy and speed.For three public violent detection datasets: hockey fight, movies, and crowd violence, individualized strategies are implemented to suit the varied clip length.They used uniform sampling for short clips.However, for longer videos, the fixed sample method brings the problem of redundancy and the discontinuity of motions.A new method is proposed for longer clips.The proposed scheme obtains competitive results: 99.62% on hockey fight, 99.97% on movies, and 94.3% on crowd violence.
Khan et al. [15] proposed a model based on a similar approach as the Song et al. [14].Initially, the entire film is divided into images, and then an individual image is chosen from each scene depending on salience intensity.These pictures can then be translated from a lightweight model, which is well tuned to distinguish aggression and nonviolence in a film using a transfer learning technique.Finally, all nonviolent scenes are combined into a series to provide a non-violent film that children can enjoy, and paranoid persons can watch.The model is tested on benchmark datasets and good accuracy has been achieved.Freitas et al. [16] proposed a multi-model method that classifies inappropriate and appropriate scenes using both visual and audio parameters using convolutional neural networks.The InceptionV3 has been used for video and AudioVGG for audio classification.Then the principal component analysis is used for feature selection and finally, SVM performs the last categorization.The model achieved 98.94% and 98.95% F1 scores for inappropriate and appropriate content, respectively.
Gkountakos et al. [17] proposed a framework for the detection of violence by crowd-based using ResNet and 3D ConvNets.The framework is compared with several states of the art models on the Violent-Flow dataset.The 3D-ResNet50 framework proved to give better accuracy and is quicker than the compared models from the experiments using Violent-Flows dataset.
Roman and Chávez [18] proposed an approach for violence detection and localization in surveillance videos.The proposed method is based on ConvNets and dynamic images.Instead of using the computationally costly optical flow, researchers used dynamic images which besides reducing the cost of computation also helped the analysis of long temporal information.This helped in getting good accuracies with lower computational cost.
Li et al. [19] proposed a multiple stream method for violence detection for the video surveillance system.The solution suggested improves the detection of acts of violence in the video by merging three separate streams: spatial RGB source, time stream and local space stream.The focus-based spatial RGB stream discovers from soft-attention mechanisms the spatial attention regions of people that are highly likely to be action regions.As the input to retrieve temporal characteristics, the temporary stream uses optical flow.The local spatial stream uses block images as feedback to learn spatial local characteristics.The algorithm's proposal was tested on a self-compiled elevator surveillance dataset and found to be satisfactory.
Accattoli et al. [20] used C3D, which is a 3D ConvNet architecture that allows the computation of video descriptor features to remove motion features without any previous information.These descriptors were then used to characterize videos as either aggressive or peaceful bypassing descriptors as an input for a linear support vector machine.For the model, performance was improved for the application of both crowd and individual violent action recognition than the state-of-the-art models.
The technique developed by Sharma et al. [21] was composed of three phases.In the first step, the whole film is split into shots, and then a random sample from each shot is chosen.The next step is to transfer learning, which was done to fine-tune the classification of violence and nonviolence.To put an end to all the non-violence sequences, the violence-free movie that can be seen by children and violent paranoid individuals is made by stitching all the non-violence segments together.The model was evaluated on the Violence in movies dataset, the Hockey fights dataset, and the VSD dataset benchmarks, and it obtained an overall accuracy of 96.3%.MobileNetV2 [22] has been a popular lightweight model for the task of object detection.Xception [23], InceptionV3 [24], VGG16 [25], ResNet50 [26] are some of the winners for the ImageNet challenges in different years that showed excellent accuracies for the ImageNet dataset.

METHOD 2.1. Dataset
The movies fight detection dataset [27] has been used for comparison purpose.The dataset is comprised of 200 videos of variable length taken from various international movies.For computation, the individual frame images have been extracted from these videos.A few samples of nonviolent and violent images from the data set have been provided as shown in Figures 1 and 2.

System methodology
The overall methodology is explained using the following steps.The complete workflow of the system has been illustrated through the flowchart in Figure 3. Input raw video is taken from dataset.The frames were splits from the uploaded video.Image processing techniques applied using deep neural network.The pretrained dataset also been fed into the model.The videos were classified into violent or not using pass unchanged and gaussian blurring.

Input raw video
The first step is the input step.The system can accept both image and video input.The system can be in one of the common formats for images like JPG or PNG and mp4 for videos.The minimum expected resolution of the input is 240×240 pixels.

Splitting frames from video
Every video consists of a series of frames.So, if the input is in the form of a video, it is first separated into individual frames.The system is capable of handling a high frame rate but for testing purposes Int J Elec & Comp Eng ISSN: 2088-8708  Automatic video censoring system using deep learning… (Yash Verma) 6749 to decrease the processing time (because of unavailability of high specification hardware), the sampling rate here has been set to ten frames per second.

Image processing
To give more flexibility input could be in any resolution greater than 240×240 pixels.But the system is designed to work upon input of 240×240 pixels, all the frames are converted to this dimension.This resolution has been chosen owing to the hardware bottleneck.So, in presence of hardware of higher specifications, the system can be easily adjusted to work upon higher resolution frames.

Processing using proposed deep neural network
The processed frames are fed to the DL model as illustrated below and the output is one of two categories i.e., violent, or non-violent.There are techniques which are used for processing of videos using deep neural network are VGG19, Convolution 2D layer as well as Max pool layer.The architecture of the model is explained below and illustrated in Figure 4.This layer function in a similar way as the above dense layer to enforce learning in the model after the dropout layer.This is the last layer processing multi-dimensional inputs and outputs.This layer is deeply connected neural network layer is very common and frequently used layer.Dense layer operates on frames of input and return the output.

− Flatten layer
Flatten layer is used to convert multidimensional feature maps into a long single-dimensional vector so that the final classification can be performed by the fully connected output layer quickly.It makes the multidimensional frame input one-dimensional frame output.This layer helpful in transition of convolution layer to the fully connected layer.− Dense output layer This is the last layer in the proposed model where actual classification takes place.Here, the sigmoid activation function is used for classification.The output of this layer is one of the two classes that is violent or non-violent.

Applying blur to violent images
Based on the output of the model, gaussian blur is applied to the frame.If the frame is violent, the frame is blurred otherwise no change is made to the frame.The output is recognized as violent and differentiated with nonviolent frames on the basis of blurred texture observed in the frame.

Merge to form video
The processing till now is done on individual frames.In this step, all the frames are again combined and encoded to form the video output (or image in case of image inputs).The multidimensional videos are converted into one dimensional.The unit frames merged and encoding is performed to get the video back.

Saving to disk
In this step, the output video or image file is stored on the storage medium at the path specified that is the same path as the input file by default.After classifying video into violent and non-violent videos.These videos will be merged and stored in local hard disk.

RESULTS AND DISCUSSION
It has been inferred from the literature review that the deep learning approach has been found more effective than the alternative ML approaches.So, some of the popular deep learning models are assessed for the application of classifying violent and non-violent scenes.For comparison, the last layer of every network compared was removed and the base model layers have been set as non-trainable.All the models used for transfer learning have been concatenated with similar layer architecture so that the results will not be biased due to anything except base model architecture.Tables 1 to 4 show the values of training accuracies, validation accuracies, training losses, and validation losses respectively for all the considered models for the same dataset.Shown below are the model training and validation metrics.Both the models have shown considerably lower training losses than the other compared models.But the validation losses in the case of VGG16 have been found to be significantly lower when compared to the validation losses when using ResNet50.
The time required for validation and training of ResNet101 and VGG19 is more than double as of ResNet50 and VGG16 as being the most complex model among the compared models, but the accuracy is worse than the ResNet50 and VGG16.Xception lies somewhere between VGG16 and ResNet101.The MobileNetV2 being the lightest model has also got the least training time, but the accuracy is also worse than  This is shown to be 99.68% for training and 95.59% accurate for validation.Due to the absence of hardware with high specifications, the system is made to reduce the resolution of output to 240×240 pixels and the sampling rate is reduced to 10 frames per second.The system is capable to work at high sampling rates and work for high-resolution output if better hardware is used for processing.
ISSN: 2088-8708  Automatic video censoring system using deep learning… (Yash Verma) 6755 Shaveta Bhatia is a professor at Department of Computer Applications, MRIIRS.Deputy Director, Manav Rachna Online Education.She has been awarded her Ph.D. degree in Computer Applications.She has completed her master's in computer applications (MCA) from Kurukshetra University.She is having 18 years of academic and research experience.She is a member of various professional bodies like ACM, IAENG and CSI.She has participated in various National and International Conferences and actively involved in various projects.There are more than 40 publications to her credit in reputed National and International Journals and Conferences.She is also member of Editorial board of various highly index journals.Her specialized domains include Mobile Computing, Web Applications, Data Mining and Software Engineering and guiding research scholars in these areas.She can be contacted at email: shaveta.fca@mriu.edu.in.

Mridula Batra
Faculty of Computer Applications, MRIIRS.She has participated in various National and International Conferences and actively involved in various projects.There are more than 20 publications to her credit in reputed National and International Journals and Conferences.She is also member of Editorial board of various highly index journals.She can be contacted at email: Mridula.fca@mriu.edu.in.

Figure 4 .
Figure 4. Proposed deep neural network architecture VGG16 and ResNet50.The training time for VGG16, ResNet50 and Inception are similar.Since VGG16 and ResNet50 have shown the best accuracies among all, they are trained for 5 more epochs to get the most accurate models.Both the models start to overfit after epoch 13.Also, the validation and training losses are at optimum at epoch 13, as observed from Figures 5 to 12. So, the model training is  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 12, No. 6, December 2022: 6744-6755 6752 stopped at epoch 13 at 99.68% training and 95.59% validation accuracy for VGG16, and 99.69% training and 92.20% validation accuracy for ResNet50.The validation and training losses are also at optimum at epoch 13, as observed from Figures 6, 8, 10 and 12. So, the model training is stopped at epoch 13 at 99.68% training and 95.59% validation accuracy for VGG16, and 99.69% training and 92.20% validation accuracy for ResNet50.The VGG16 performed well in respect of accuracy as 99% as compared to validation accuracy achieved by ResNet.

Table 3 .
Training losses