New Hybrid Deep Learning Method to Recognize Human Action from Video

There has been a tremendous increase in internet users and enough bandwidth in recent years. Because Internet connectivity is so inexpensive, information sharing (text, audio, and video) has become more popular and faster. This video content must be examined in order to classify it for different purposes for users. Several machine learning approaches for video classification have been developed to save users time and energy. The use of deep neural networks to recognize human behavior has become a popular issue in recent years. Although significant progress has been made in the field of video recognition, there are still numerous challenges in the realm of video to be overcome. Convolutional neural networks (CNNs) are well-known for requiring a fixed-size image input, which limits the network topology and reduces identification accuracy. Despite the fact that this problem has been solved in the world of photos, it has yet to be solved in the area of video. We present a ten stacked three-dimensional (3D) convolutional network based on the spatial pyramid-based pooling to handle the input problem of fixed size video frames in video recognition. The network structure is made up of three sections, as the name suggests: a ten-layer stacked 3DCNN, DenseNet, and SPPNet. A KTH dataset was used to test our algorithms. The experimental findings showed that our model outperformed existing models in the area of video-based behavior identification by 2% margin accuracy.


INTRODUCTION
The internet is now widely used by individuals all around the world. Social media plays an important role in information distribution. During the same time period, they also express their feelings on social sites for a specific topic so that other users may rapidly learn what is going on. As a result, user opinions are utilized to estimate public reactions to various moves. However, hiring someone to evaluate people's activities through a plethora of content is extremely tough and time-intensive. The researchers use a machine learning approach to classify video activities in order to assess public views. Video action classification is a type of mining that examines video using video processing and feature identification in order to discover people's points of view by collecting and evaluating social and other subjective knowledge resources. Other methodologies are less trustworthy and successful than deep learning methodology. In terms of precision and performance, deep learning-based methodology currently exceeds previous techniques in the image [1], video [2], audio [3], and text [1] [4][5] [6] classification.
There are many numbers of applications of video sentiment classification. Crime detection from video of real-time game [7], Video Scenario classification [8], Event or occasion video prediction [9], Animated movie video classification [10], Sport player action recognition [11], Movie classification [12]. Some regularly used algorithms are both supervised and uncontrolled, such as SVM and CNN. Throughout most of the action recognition, there is also a range of options (LSTM, GRU, etc.). This introduction part displays the most extensively used video categorization method, as well as its operating mechanism, application benefits, and drawbacks. A method for identifying video using Nave Bayes and a video classification dictionary. If the independence predictors statement is correct, a classifier based on Naive Bayes functions outperforms other models. Another method is to use the internet of video classification to find hateful speech [11]. Another task was to use the SVM for classification with data pilot and weighted production to improve classification performance. When the target classes are overlapping, the SVM algorithm does not perform effectively.
For video labeling, K means are utilized in a variety of ways. Peng [8] recently completed another video categorization task. By segmenting the video, this method is utilized to extract visual features from it and exchange visual feature materials. The typical k-means aggregation approach improves the original grouping values of labialized video samples. When we hold k smalls, K-means execution is usually quicker than hierarchical clusters. The disadvantages are that K-value is difficult to estimate, the K neighbor (KNN) approach is simple and straightforward to use for categorization and extraction, and HMM (Hidden Markov Model) is employed. In the case of R-CNN and HMM methods of real-time video, a new method to the study of kid facial speech [13]. Gains from the HMM method various-length input is a simple generalization for sequences, and quick learning procedures through raw sequence data may occur explicitly. The disadvantages of HMM include the following: HMMs have a lot of unstructured requirements, and they can't rely on hidden states to solve them. Work demonstrates that 3D CNN is most suitable for video classification, as well as analyzing its outcomes under the title of numerous levels pipeline template-based designs to boost the entire 2-D and 3-D CNNs on FPGA [14]. Long-term model RNNs explicitly transfer time variables to video frames of varying lengths. The RNN achieves this by creating networks with iteration that allows information to persist [15]. This loop shape will be used by the neural network to grasp the input series. This is how the RNN works. Wherever we require meaning, RNN can help us out with past comments. There are two forms of LSTM in RNN, as well as one type of GRU. In sporting video sequences with SIFT characteristics, an RNN with neurons in LSTM is being learned [16]. Baccouche's work has long been admired for its steadiness. With the development of deep learning algorithms and models, function extraction becomes automatic. RNN could be improved by using backpropagation. At the conclusion of the day, new slices of footage with improved fidelity were created utilizing 2D Gated Bidirectional Neural Networks for the recognition of violence. Other strategies are less dependable and efficient than deep literacy [4]. This method of learning is more efficient.

RELATED STUDY
In fact, leveraging temporal data to the convolutional process is a straightforward approach to action analysis using deep learning. Huang et al. [17] proposed a 3DCNN that uses 3D convolutional layers to get features across spatial and temporary dimensions in order to collect spatiotemporal information across neighboring frames. Even though the 3D CNN is simple, it creates a significant number of parameters, increasing the network's complexity. Due to its thick connections, DenseNets [18] has recently piqued the interest of computer vision scientists. This network has demonstrated outstanding results in natural picture categorization challenges. In a feed-forward way, the system enables any layer to any other layer. During training, this dense connection structure can help with gradient flow. The dense connection of DenseNets decreases the repetitive calls of intermediary parameters, which can reduce a significant set of network parameters and thereby reduce the network's complexity, as compared to the 3D convolutional neural network. DenseNets, on the other hand, has only been used in the context of photos; they haven't fared well in the field of video. This criterion may have an impact on the recognition accuracy of pictures or sub-images of any size or scale. Using the spatial pyramid pooling (SPP) model, the fixed-size constraint can be abolished in this method. Picture cropping is decreased as a result of the application of SPP in the image field, resulting in less loss of image information. Is it possible for the video field to help with such information loss? What effect will implementing SPP to the video world have?. We present a new network model based on video action recognition that takes into account the benefits and drawbacks of both 3D CNN and DenseNet networks, as well as the SPP network model.
To compare the methods, we've included the most relevant papers in video action classification. A method for video action detection has been developed that uses a 3DCNN model with dense pooling and achieves 91.97 % accuracy [19]. This approach is suitable for long videos, but it is also suitable for short video analysis. SVM classification techniques were used to recognize 2391 scenes of six human behaviors acted by 25 participants in four different contexts [20]. With 71.7 % accuracy, this technique performed well in the early stages of video classification. An approach for video recognition using a spatially windowed data algorithm  [21]. This approach achieved an accuracy of 81.2 %, but feature handling for multiscale needs to be improved. An unique 3D CNN model for action recognition that uses high-level characteristics to regularize the outputs and combines the predictions of several independent models [17]. The accuracy of this technique was 90.5 %, however data labelling and preprocessing should be enhanced for even better results.
We suggest a novel technique to address some of the shortcomings of the previous one. A new network design based on 3D convolutional networks, densely CNN, and the SPP network is proposed in this research. "Stacked 3D-DenseNet-SPP" is how we refer to it. As the major network structure in our architecture, we use DenseNet. We can accomplish the goal of merging DenseNet with such a ten stacked 3D CNN by has component information indeed to the DenseNet. As a result, when compared to a 3D convolution network alone, this not only makes full more use temporal and spatial information but also lowers gradient diminishing and minimizes the network's training parameters to some amount. The spatial pyramid pooling architecture is then added to the underlying network to meet any scale/size inputs.
Our main contributions are as follows. The first contribution is to utilize spatial pooling with normalization in 3D-CNN. The next contribution is to select and handle important features. The third contribution is to develop a new hybrid deep learning-based video action classification with ten stacked 3DCNN to get higher accuracy. The last contribution is Performance evaluation and comparison of our proposed method of video action classification to show better performance of our method.
The remainder of this document is organized as follows. The first section provides background information on the video classification technique study. Section 2 contains a critical examination of current relevant research on video action classification, as well as a summary of the problem and our contributions. The primary technique is described in section 3, and the results are described in part 4. Part 5 contains the discussion, while section 6 has the conclusion.

METHOD
This section describes the methodology for video action classification. This section includes a description of the different layers to develop the proposed model with main figures, equations, and algorithms. A sequential description of each layer is described below.
The spatial data from the neighbor feature preceding layer can be extracted by a 3D convolution neural network, as well as the temporal features across adjacent frames. In general, a 3D CNN is used to get spatial as well as temporal features from video input in order to recognize actions. Its outstanding performance is due to 3D convolution and 3D pooling operations. Convoluting a 3D kernel to the cube generated by stacking numerous frames integration yields 3D convolution. The local features in the CNN layer are coupled to numerous consecutive frames in the preceding layer using this method, allowing motion information to be captured. We typically need to add a 3D max pooling with 3D convolution layers to minimize the network parameters and size of the video representations. The operation of 3D pooling is as simple as adding the spatial ordering to 2D pooling. 3D pooling procedure in a 3D convolutional neural network can decrease computation and overfitting.
We introduce a new deep learning architecture, a 3D DenseNet with the spatial pyramid pooling (3D-DenseNet), that can work on classification tasks and is influenced by the hierarchical feature pooling model, which is used to photographs of random size and the growing popularity of densely connected network systems in video classification tasks. To create the Stacked 3D DenseNet, we just use DenseNet as a primary network and 3D pooling with 3D CNN.
Dense connections are used to increase information flow across layers, resulting in fewer network parameters and quick running performance. (The levels are linked and can be accessed directly from one another.) The speed will be faster by less transition among layers).
Assume that xl is the production of the -th layer. The output xl is calculated in typical feed-forward networks by xl = fl(xl-1), where is not a linear modification of the -th layer. In most cases, especially in deep neural networks, this method cannot prevent gradient disappearance or explosion. The -th layer, on the other hand, must deliver the feature maps to all following layers in the dense interconnections structure. Layer , in turn, will obtain feature maps from all preceding layers as input: xl = fl ([x0, x1, . . . , x l-1]) as integrated of feature maps generated in layers 0, 1,.., l -1]. The function is denoted by the letter . The DenseNet structure uses better features with gradient transfer learning by employing dense connections. Data in this study comprises not only spatial but also spatial data.
Batch normalization, ReLU activation, and 3D convolution are the three operations that make up the composite function . This is used as input of the 3D convolution layer. We use many densely connected dense blocks in the proposed model to make greater use of stacked 3D CNN and 3D pooling. A spatial pyramid pooling layer is employed before the fully linked layer to suit the needs of frames of any size. Conventional SPP networks are commonly employed in 2D network topologies. Through the SPP structure, the top layer's two-dimensional feature map is produced in parallelism to the linked layer. The 3D feature vector is transferred into the SPP framework in our model. To link appropriately, we expand the typical SPP network topology by adding a temporal dimension. At the very same time, the spatial pyramid pool is conducted only in the spatial architecture to guarantee that the SPP layer is not influenced by temporal information interference. At every layer of the SPP, the time data is guaranteed to be the same, and the spatial features are normally measured using the SPP system. Our approach can be seen as a step forward from the densely connected network model in some ways. In comparison to DenseNet, our model is capable of not only processing 3D visual information but also entering video frames of any size. In contrast to the 3D convolutional neural network, our model contains fewer parameter values. Fig. 1 gives illustrations of the main Stacked 3DCNN model

Algorithm for Model development
This section states the main model for our proposed model. Sequential steps of the model development are given below. This algorithm clearly shows that how our method works and is analyzed. From start to end, each step shows sequential operations done by our proposed method.

Data set used in video action classification
In media-related disciplines of research, several authors have spent a lot of work obtaining and labeling video data sets. To assess the performance of our model, we use KTH data. With 2391 video sequences, the KTH dataset contains six categories of human behaviors. All of the segments were shot with a static focus lens at a frame rate of 25 frames per second over unique backgrounds. This video length is 4 seconds and has a resolution of 160 multiplied by120 pixels. More detailed information about KTH data is summarized in Table  1. Source of KTH data link: https://www.csc.kth.se/cvap/actions/. We use 80% data for training and 20% data for validation. The confusion matrix for the KTH actions is given in Fig. 2. This illustrates the correlations among all classes of video action from KTH data.

RESULT
This section completely presents the result and analysis. To check the performance of the proposed method, we have compared all the model's accuracy with our model for each class of video from kth data. Here Table 2 shows compared result that shows our model working better than all compared methods. This table shows each class performance on KTH data by different compared methods. From this table, it is clearly shown that our method performs better with the highest average accuracy of 93.8%, and that is about 2% margin accuracy. 3DCNN based approaches are doing better than the SVM algorithm. Among the 3DCNN methods, our stacked-based 3DCNN method is doing better for video action classification from KTH data. Code and data source of this paper: https://github.com/shafiq-islam-cse/Video-action-recognitnion  Fig. 3 shows the compared result that shows our model working better than all compared methods. Each bar indicates every class performance, and the color is indicated on the side Fig. 3. In this figure, we just showed the performance result on different video actions (walking, Boxing, Handclapping, Jogging, Waving, and Running) recognition from KTH data.
To get deeper details on performance, we run our method for 50 epochs. We track running history to get performance results after every 10 epochs in 50. It is clearly shown in Table 3 that epoch number 50 shows better accuracy than the previous. We have also checked the performance of our model at the different number of epochs. Table 3 shows the performance of our model for the different number of epochs.    Fig. 5 shows overall training and validation performance, respectively, for our method. The axis represents number of epochs. In Fig. 4, it is clearly shown that axis present achieved accuracy at each epoch and in Fig. 5, axis shows loss.

DISCUSSION
The results of the experiments revealed that our approach not only enables the training of video frames of any size but also performs better with 2% margin accuracy. For action recognition, we used a mix of the 3DCNN model, Normalization, and DenseNet in this research. There are a variety of additional deep designs, including two-stream ResNet models. This method is frequently employed in the area of pictures and performs well. It would be fascinating to see whether these models could be expanded to include video. This possibility will be investigated further in the next analysis. However, efficient video categorization requires an