Attention correlated appearance and motion feature followed temporal learning for activity recognition

ABSTRACT


INTRODUCTION
Convolutional neural networks (CNNs) with fixed-size input and output vectors at sophisticated structures have been successfully demonstrated with high performance in the classification of subject activities in videos [1]- [3].In order to improve performance in visual perception, several generations of CNNs have been created with the input vectors taking care of one image or multiple images.Particularly, multiple images are commonly adopted as an input vector which has the embedded temporal information as well as the spatial information [2], [4].In addition to improving learning, many researchers used temporal networks to perform large-scale visual learning and activity classification from video clips, where temporal networks had recurrent connections to aid in video context understanding regarding time [2], [4]- [7].
Since all spatiotemporal patterns of subject interaction are not equally likely, some actions can be identified by their appearance alone [5].For example, while playing musical instruments does not change ISSN: 2088-8708  Attention correlated appearance and motion feature followed temporal learning for … (Manh-Hung Ha) 1511 many motions in the video sequences, blowing candles is caused by the varying candlelight illumination, and the swing of a golf club is rapid.The recognition of the category can be handled by the semantics feature with a few red green blue (RGB) frames of video sequences.In contrast, the activity of walking, the transition from standing to walking, and running do not change the identity.The motion being performed can be at a fast-refreshing speed, and individual frames can be ambiguous.Therefore, motion cues provide a necessary approach by allowing the compensated optical flows to pick up potential [2], [4].Another important reason is that current CNNs architectures are not able to take full advantage of temporal information and their performance is consequently often dominated by appearance recognition.The structure associated with temporal information plays a critical role in achieving good performance in activity recognition.Accordingly, we investigate a general temporal network structure that has a feature generation layer, a temporal layer, and a fully connected layer.Six topologies of many-to-one, many-to-many plus global maximum pooling, many-to-many plus global average pooling, many-to-many plus many-to-one, bidirectional many-to-one, and many-to-many plus bidirectional many-to-one associated with the temporal layer are further explored, along with two cells of long short-term memory (LSTM) and gated recurrent unit (GRU).Instead of the usual feed-forward neural network with dropout and softmax making the final prediction, in the inference classifier, the predicting voting classifier (PVC) scheme based on the multiple nearest class mean (NCM) classifiers [8], SoftMax, and majority voting [9] are developed to determine the action class.In this study, two datasets from HMDB51 and UCF101 [2] are adopted to evaluate the proposed deep neural network (DNN).To ensure a fair comparison, the UCF-101 dataset is used to validate the temporal network structures of 12 different types, which are then simulated and compared to the best one.Simulation results reveal that the temporal network structure using the bi-directional GRU layer yields the best performance, with an average accuracy of around 91.8% (split 1).It is because neighboring image frames in a video clip have forward and backward relationships.Additionally, GRU may be superior to LSTM under some conditions.As compared to the conventional DNNs, the proposed temporal network using the bidirectional GRU layer is fairly good for realizing activity recognition.
The deep learning architecture approaches can learn some representational features automatically, and their impressive results lead to the extensive use of them in various pattern recognition domains.Xception [10] is an extension of the inception architecture that replaces the standard Inception modules with depth-separable convolutions.It is applied in 2D space and has been proven to be powerful in terms of extracting spatial information [9].Some action recognition studies [11]- [14] used two-dimensional convolutional neural networks (2DCNN) to extract spatial information, which is known as an auxiliary clue.In order to improve the accuracy, extending the CNN from image to video is the exploitation of temporal information.among various temporal network architectures, LSTM is the most popular one as it is able to maintain observations in memory for extended periods of time [15].Further research explicitly demonstrated the robustness of LSTM even as experimental conditions deteriorated and indicated its potential for robust real-world recognition [16], [17].
To reliably and precisely generate subject descriptors, the recognition process may focus on the meaningful parts to increase the accuracy.For example, attention features were generated automatically from the DNN's intermediate layer(s) and then used to focus on the most meaningful part of an image for identification [2], [4].In [5], the recurrent mechanism that assigned the weighted attention to the feature map from the convolutional layer was proposed for action recognition based on RGB images.Instead of using the RGB stream, the spatiotemporal attention mechanism adopts the joint points from the 3D skeleton for action recognition.They developed an end-to-end network with three temporal networks that individually performed the classifications, by selectively focusing on the discriminative joints of the skeleton (spatial attention), and assigning weights to the key sequential images (temporal attention) [2], [4].
In the processing pipeline approach, the RGB stream and flow stream are applied to process the multiple streams of the video data.Two frames are sampled from each input video.The network processes the sequence frame and the predictions are merged simply by late fusion [18].The fusion modes of early, late, average, concatenation, sum, and 3DCovNet are approached at several levels feature [2].We now have predictions from the two streams, based on the spatial and temporal streams separately.The last step is to combine the two streams to produce the final output through an attention mechanism for the temporal network.
To effectively determine the subject's actions, many methods, such as Bayesian, hidden Markov models, gaussian mixture models (GMMs) [19], support vector machines (SVMs) [20], and feed-forward neural networks, are commonly used.To handle diversely growing data sets, it may be a good choice to use a model-free method.The classic model-free methods include the K-nearest neighbor (KNN) [21], and NCM classifiers [8].The class-incremental learning mechanism was developed to train multiple NCM classifiers accompanied by feature representations simultaneously.The deep NCM (DNCM) classifier directly learns highly non-linear visual representations to yield performance as good as the softmax optimized networks.The few-shot learning approach of the networks was used to learn a deep representation based on the NCM  [8].Our contribution to this paper is as follows: i) we proposed an architecture for video activity recognition that is able to be composed using both rich spatial and temporal feature abstraction by attention mechanisms.This presentation enhances performance, enabling easier learning, and interpretability of the model; ii) we conducted extensive experiments on temporal networks and compared the six topologies of LSTM and GRU to obtain well-defined architectures in the UFC101 and HMDB51 datasets; and iii) we proposed the PVC method for incremental predicting, while achieving significantly better accuracy than existing incremental counterparts.
The remainder of this paper is organized as follows.Section 2, the proposed framework for a better solution is described in detail.We then conducted an implementation on the UCF101 and HMDB51 datasets.Section 3 shows the effectiveness of the architecture, and an analysis of the obtained result.Finally, the paper is concluded in section 4.

PROPOSED DNN FOR ACTION RECOGNITION
As shown in Figure 1, the genetically proposed DNN consists of two stream fusion attention (2SFA), a temporal layer, and a classifier is devised for action recognition.In the first step, a video clip is decomposed into multiple video segments.Each of these has an interval of a few seconds that is sufficient to contain an action.The neighboring video segments are overlapped so that the RGB and optical flow sequences from each video segment are used as the inputs.Owing to the fixed dimension of the input neural layer in the proposed DNN, the number of frames and frame size in a video segment may need to be converted.The preprocessing techniques of upsampling, downsampling, and size scaling are employed to transform frames in a video segment into the required number of specific size frames, which are inputted to two 2DCNNs to produce appearance and motion feature maps.In each video segment, the fusion attention (FA) generation layers yield the local descriptors.The temporal networks continuously process the outputs of the FA generation layers to generate latent spatial-temporal features.Finally, the inference classifier of the PVC scheme is utilized to attain the final class determination.

2SFA network
Figure 1 illustrates our concept of a generic network which can be described in parallel, namely appearance feature and motion feature fusion based attention.The output of the fusion of two streams by the attention mechanism will be fed into the bidirectional GRU one layer to learn temporal information.Our architecture is empirically more effective for video recognition.

Appearance RGB and flow stream adopt backbone 2DCNN structure
For the RGB stream, work on a video clip that captures the spatial appearance feature.T individual video frames were used as network inputs, followed by several convolutional layers, pooling layers on pre-trained ImageNet on the Inception V3 model [2], and finally fine tuning the UCF101 and HMDB51 datasets.Finally, after fully connected (FC) layers, the network outputs are taken as the predicted probabilities of the video classes by the FA layer to yield probability.
Attention correlated appearance and motion feature followed temporal learning for … (Manh-Hung Ha) 1513 In order to capture the motion information, the sampled k*T frame is used for the flow stream.In our experiment, the value of k=6 for HMDB51 and k=5 for the UCF101 dataset was set.The temporal 2DCNN takes stacked optical flows as input of a pair of consecutive frames, the horizontal and vertical components of the calculated displacement vector fields.To further consider temporal information, one can stack the optical flow images of each frame at time t and its subsequent frames into a stacked 2L-channel optical flow image [2].The network architecture and training process of the temporal CNN are basically the same as their spatial counterparts, except that the input images have a different number of channels.There are multiple stacked 2L-channel optical flow images in the video.The way of fusing predictions on these individual images is also the same as that of the spatial channel.We now have predictions from the two CNNs, based on the spatial and temporal streams separately.The last step is to combine the two streams to produce the final output.

FA layer scheme
In order to be sensitive to the meaningful details of an action, we propose the FA generation layers between the 2DCNNs and temporal networks by highlighting the distinguishable components among actions.The FA generation layer, as depicted in Figure 2, combines the appearance feature map with the coordinate values of the motion feature map and computes the combined data to yield the attention weights.After that, it convolutes the attention weights with the feature element to produce the output result of the FA.It is because the tracks of some actions depend on a part of a subject's movement.Hence, an attention feature map with reasonably emphasized regions can be beneficial to action distinction.
where the function   (. ) denotes to the fully convolution, and ⨀ refers to the concatenated operation.This full connection layer performs the operations of the input vector multiplied by the corresponding weights, and added with the biases to become the accumulated data, which go through the tanh function to yield output results.The abovementioned operations are to build up the correlation between the appearance feature map and the motion feature map.Second, the attention parameters  are normalized according to a single feature frame at the dimension of  ×  in ( 2 where  ̅  = { ̅   | = 1. . .} is the output, the attention feature of the FA generation, such an operation embeds normalized attentions to the static feature map to highlight the critical part at each single  ×  feature map as well as reduces the dimension.

Different temporal network for activity recognition
In this study, we introduce the temporal network for enhancement classification of the temporal sequence feature.The temporal network in Figure 1 is depicted in Figure 3 with six temporal topologies, which are many-to-one in Figure 3(a), many-to-many plus global maximum pooling in Figure 3(b), many-tomany plus global average pooling in Figure 3(c), many-to-many plus many-to-one in Figure 3(d), bidirectional many-to-one in Figure 3(e), and many-to-many plus bidirectional many-to-one to interpret the spatial and temporal relationship in Figure 3(f).The many-to-one network has T cells to yield one output vector.Each topology of many-to-many plus global maximum and average pooling has one output vector generated from global maximum and average pooling, respectively, where each of them includes T cells.The network of many-to-many plus many-to-one is to cascade the many-to-many and many-to-one layers where 2T cells are used to produce one output vector.The bi-directional many-to-one network consists of 2T cells with forward and backward connections to generate two output vectors.The network of many-to-many plus bi-directional many-to-one needs 3T cells to bring out two output vectors.The bidirectional networks not only provide the memories for learning dependencies but also allow the networks to make predictions from the future to the past as well as from the past to the future, in order to increase the classification power of the network layer.The cells inside these topologies have two choices of LSTM and GRU that are able to keep the memory and state and to remember the characteristics from the previous cell in the long term.LSTM discloses the memory state unit with separate input and forget gates, whereas GRU exposes whole state information to the other cells through its reset gate.The structure of LSTM includes nonlinear function gates and a memory cell.The information of the next memory cell is adjusted based on the previous cell's memory, and the input gate is activated by multiplying the activation from the forget gate and the signal from the input gate.LSTM also utilizes the output gate to control the information received by a hidden state variable.Similarly, GRU has gating units that modulate the information flow inside the cell without a separate memory unit.The main differences between LSTM and GRU can be described as follows: In GRU, a single gating unit simultaneously controls the forgetting factor and the decision to update the state unit.The reset and update gates can individually ignore a part of the state vector.The update gates act like conditional leaky integrators that can linearly gate any dimension, thus choosing to copy or completely ignore it by replacing it with the new target value.The reset gate controls which parts of the state are used to compute the next target state, introducing an additional non-linear effect in the relationship between the past and future states [21].

Incremental predicting class
In Figure 1, the outputs from temporal networks pass through the individual fully-connected neural layers to yield feature vectors for further classification.Based on these two feature vectors, the classifiers of SVM, GMM, DNCM, KNN, global average pooling (GAP)+SoftMax, FC+SoftMax, and PVC are investigated and compared.In this work, the PVC scheme, including the NCM classifier, SoftMax, and majority voting, is developed to achieve good performance.
The NCM classifier can be regarded as incremental class learning and classification [8].The feature vector can be represented as an equally long sub feature vector of the corresponding class.Figure 4 shows the block diagram of the proposed PVC scheme, in which the output is the feature vector from the GRU by FC layer.Here, is a C-action capsule associated with the corresponding classes in a certain dataset which performs the NCM classifier to obtain its class group with the minimum Euclidean distance from the mean of that group.
In the beginning, the training data in each epoch is used to generate the class mean of the NCM classifier.Afterward, the class means are updated from the classifieds in the class groups at each epoch (each mini batch) [8].The softmax fulfills the Euclidean distances from all feature vectors to compute the probability of each class to which the closest distance between feature vector and class mean may belong.Finally, the majority voting module from the scikit learn toolbox predicts the class label based on the argmax of the sum of the predicted probabilities to make the final class determination in a video clip.PVC will assign the input data to the class label with the largest probability by the argmax function.Unlike the loss function that is calculated by margin loss, during the training process, the loss function of PVC, which contains the recomputed mean of class at each epoch, is estimated by minimizing the cross-entropy loss as in a regular classification neural network in the similar formula of [8].

EXPERIMENT RESULTS AND DISCUSSION
In order to make sense of the sequence in our system, a single frame is used instead of a group of frames as a segment of the video.We assume that analyzing three seconds of video at a time is enough to make a good prediction of the activities.For this, the extracted frame of three seconds is downsampled to T frame (in this study, the typical value is T=15) into one single pattern, which will be the input of the neural network.The length of the video is a separate pattern (or clips).

Dataset and parameter setups for experiments
The UCF101 dataset with 13,320 video clips associated with 101 action categories was used.The video clips in this dataset are split into three groups for training and validation (9,537 entries), testing (3,783 entries).The HMDB51 dataset has 6,766 video clips concerning 51 action categories and diverse background contexts and variations in motion camera.There are three groups split into 3,570 training and validation and 1,530 testing data videos.
In each of the parameter setups and simulation situations, there are 100 epochs in each of which the well-trained model is obtained.When the accuracy cannot be incrementally improved at 10 sequential epochs, the training process is terminated by an early stopping scheme.Additionally, the validation loss and accuracy associated with the well-trained model at each epoch are stored.Among these values of validation loss and accuracy from all well-trained models, the model with the lowest loss and highest accuracy is chosen for the testing.At the training, the optimization is fulfilled by Adam.With a momentum of 0.9, the learning weights use mini-batch stochastic gradient descent.The number of frames extracted from a video clip is T, where T frames are input frames.Our computation platform employs an 8-core Intel Core i7 computer accompanied by a graphic processing unit (GPU) of a Nvidia Titan X 1080/12 GB with 32 GB RAM where the CPU system is the 64-bit Ubuntu 18.04, and the GPU is supported by the Anaconda distribution for Python.

Simulations and comparisons of temporal network at six type using LSTM and GRU
In this paper, we developed a model that combines CNN and the temporal network.In our model, a CNN is used to perform the task of extracting the features from each video.Besides the temporal network, we evaluated the action recognition framework by evaluating six temporal topologies using LSTM and GRU simulations.Finally, the classification layer using FC+softmax is used to predict the category.Figure 5 displays the validation loss and accuracy of the temporal networks at 12 types where the symbols of LSTM1/GRU1, LSTM-GMP/GRU-GMP, LSTM-GAP/GRU-GAP, LSTM2/GRU2, BDLSTM1/BDGRU1, BDLSTM2/BDGRU2, denote the temporal networks using LSTM/GRU at topologies of many-to-one, manyto-many plus global maximum pooling, many-to-many plus global average pooling, many-to-many plus many-to-one, bidirectional many-to-one, and many-to-many plus bi-directional many-to-one, respectively.In Figure 5(a), the performance of the temporal networks at six temporal topologies using LSTM is compared to conclude that BDLSTM1 yields the best.On the other hand, the performance of the temporal networks at the types of BDGRU1 exhibits the best, as depicted in Figure 5(b).The best one is BDGRU1 of the temporal networks at the bidirectional many-to-one topology using GRU cells, which performs well and is rapid.In particular, the temporal networks at the topologies using GRU converge faster than those at the corresponding topologies using LSTM.The accuracy of the temporal networks at the topologies using GRU is likely better than that of the corresponding topologies using LSTM at most categories of the UCF-101 dataset.GRU, on average, has lower complexity, easier modification, faster training, and higher performance than LSTM with fewer training data, whereas LSTM remembers longer sequences and outperforms at tasks requiring long-distance relationships.Table 1 compares the performances of the temporal networks at 12 types where F1 measure and testing accuracy are included, as well as training accuracy and validated accuracy/loss.The results reveal that 2SFA+BDGRU1 is the best, with a validation loss of 0.78 and a testing accuracy of 91.8%, and an F1 measure of 0.91 average micro and macro factor.

Model 2SFA + BGRU1 + PVC
Figures 6(a) and 6(b) also shows that the proposed model was learned on the UCF101 and HMDB51 datasets.Training validation and loss validation of 2SFA+BGRU1+PVC on UCF101 obtained an approximate 91% and 0.78, respectively.Performance on HMDB51 reached about 71% training validation, a 0.8 validate loss.
For quantitative evaluation, the PR curve is the better method performs as shown in Figure 6(c) and Figure 6(d) which depicted how the prediction-recall (PR) curve compare difference classification performance of GRU model.Figure 6(c), the 2SFA+BDGRU1 model achieved the best results on PR curves evaluation metrics.It means that on the UCF101 dataset, the proposed DNNs are better than the other models in terms of AUC-PR metrics.Figure 6(d), the AUC of the PR curve on the 2SFA+BDGRU1 model gets a promising result of 0.972 and 0.965 on UCF101 and HMDB51, respectively.We found that the proposed model can reach 1.0 for PR on UCF101, and 0.9 for HMDB51 respectively.
We compare two approach policies, 2SFA and 2SFA+BDGRU1, on the three different split ratios of the UCF101 and HMDB51 datasets.The results are shown in Table 2, which reveals the average accuracies of split 3-fold in the UCF101 and HMDB51 datasets.The experiments can reach 91.2% and 70.3% of 3-ford in the UCF101 and HMDB51 datasets, respectively, where the proposed system adopts 2SFA+BDGRU1.The results clearly show that the proposed system, which employs 2SFA and temporal bidirectional GRU in one layer, achieves the best performance.Most of our daily activities can be completed in this manner.In that    Evaluation of Inference Classifiers: In addition to the feature generation, correlation, and highlights from the spatial and temporal domains, the final interpretation needs an adequate classifier in the proposed 2SCNN+FA+BDGRU1 to ameliorate the performance.Herein, the popularly used inference classifiers of FC+softmax, SVM, GMM, KNN, and our PVC are explored.The proposed PVC would work as a trainable feature distinguisher.Results listed in Table 3 at the UCF101 dataset indicate that the proposed PVC is superior to the other classifiers of FC+softmax, SVM, and GMM by 0.08%, 0.34%, and 1.07%, respectively, as well as increasing from 0.02% to 0.43% at the HMDB51 dataset.Hence, the proposed PVC is a good choice for the final classifier in our DNN.
We compare different fusion strategies in Table 4, where we report the average accuracy of the UCF101.We see that max and average perform considerably lower than sum and concatenation.For all the fusion methods shown in Table 5 shows the average accuracy of the comparison of our proposed to previous work DNNs in which all of the modalities are only using 2DConvNet at the same dataset.Because of the use of different network architectures and improvement schemes, the proposed DNNs are nearly as accurate as hybrid DNNs on UCF101 and achieve higher accuracy than those done on HMDB51 [9], [22], [23].Most of the methods are not directly comparable to our results.We obtained a new state-of-the-art result of 70.8% and achieved a substantial high accuracy of 91.9% as second best by only using 2DConvNet.Table 5. Performance comparison of the proposed model using only 2DConvNet classification on UCF-101 datasets Modality Feature Set UCF101 HMDB51 ConvNet [11] Slow fusion spatio-temporal 65.4 -LRCN [15] Learning sequential dynamics 82.9 -DT [12] Multi-view super vector 83.5 55.9 LSTM-comp [16] RGB+Flow model 84.3 -iDT [24] dense trajectories by camera motion -57.2 boVW [13] Bag of visual words and fusion 87.9 61.9 FstCN [25] SCI fusion 88.1 59.1 MoFAP [26] Single Shot multi-Span in FC3D 88.3 61.7 Motion Infor [27] TrajShape + TrajMF 78.5 57.0 TrajShape + TrajMF + Wang and Schmid [18] 87.2 57.3 Gaussian Pyramid [22] Multi-skip feat.stacking 89.1 65.4 Hybrid fusion+ DeepNet [23] Supervised mid-to-end learning + non-linear classification 90.6 67.8 STAN [28] FRA+OPF + CLIP 93.6 -Confidence Distillation [14] Distillation loss for student and teacher learning 91.2 -2STG 2SFA+ BGRU1 + PVC 91.9 70.8

CONCLUSION
We have proposed a DNN architecture with a fusion attention layer, additional temporal networks, and a PVC layer.We investigate a general temporal network structure that has a feature generation layer, a temporal layer, and a fully-connected layer.The six topologies of the temporal layer are explored with the use of LSTM and GRU.The UCF-101 dataset was adopted for simulations of the temporal networks with the 12 resulting types.The experiment results reveal that the temporal networks at the bidirectional temporal layer using GRU show the best performance.The reason is that the bidirectional topology can effectively interpret the forward and backward temporal relationships in a video clip.Additionally, GRU exhibits relatively better performance than LSTM.Hence, the temporal networks at the bidirectional topology using GRU is recommended for the proposed fusion two stream flow temporal neural network.We also show that the proposed DNN model can improve performance with a classifier by PVC layer in the UCF101 and HMDB51 datasets, leading to the suggestion of the importance of learning on highly abstract spatialtemporal features.The simulation results show that the proposed system demonstrates fairly good performance in recognizing activities of subjects in different situations.

Figure 1 .
Figure 1.Our proposed DNN including 2SFA followed temporal network for action recognition

Figure 3 .
Figure 3. Six temporal topologies: (a) many-to-one, (b) many-to-many plus global maximum pooling, (c) many-to-many plus global average pooling, (d) many-to-many plus many-to-one, (e) bidirectional many-to-one, and (f) many-to-many plus bidirectional many-to-one

ISSN: 2088- 8708 Figure 5 .
Figure 5. Validation loss of the 12 types of variant temporal networks, (a) evaluating six types of LSTM and (b) evaluating six types of GRU

Table 4 ,
fusion by our proposed FA results in higher performance compared to The performance is slightly better because the layer spatial correspondences between appearance and motion are fused, which would have paid more attention by correlation attention weight in the area.
Attention correlated appearance and motion feature followed temporal learning for … (Manh-Hung Ha) 1519 other methods.

Table 3 .
Accuracy on test dataset.Comparison PVC to other method on multi-class classification (split 1)