A review on Video Classification with Methods, Findings, Performance, Challenges, Limitations and Future Work

In recent years, there has been a rapid development in web users and sufficient bandwidth. Internet connectivity, which is so low cost, makes the sharing of information (text, audio and videos) more common and faster. This video content needs to be analyzed for prediction it class in different purpose for the users. Many machines learning approach has been developed for the classification of video to save people time and energy. There are a lot existing review papers on video classification, but they have some limitations such as limitation of analysis, badly structured, not mention research gaps or findings, not clearly describe advantages, disadvantages, and future work. But our review paper almost overcomes these limitations. This study attempts to review existing video-classification procedures and to examine the existing methods of video-classification comparatively and critically and to recommend the most effective and productive process. First of all, our analysis examines the classification of videos with taxonomical details, latest application, process and datasets information. Secondly, overall inconvenience, difficulties, shortcomings and potential work, data, performance measurements with the related recent relation in science, deep learning and the model of machine learning. Study on video classification systems using their tools, benefits, drawbacks, as well as other features to compare the techniques they have used also constitutes a key task of this review. Lastly, we also present a quick summary table based on selected features. In terms of precision and independence extraction functions, the RNN(Recurrent Neural Network), CNN(Convolutional Neural Network ) and combination approach performs better than the CNN dependent method. A review on video classification with Methods, Findings, Performance, Challenges, Limitations… Islam) This paper reviews different approaches for video classification. There are a number of review and survey paper in video classification. Some recent review papers are listed here with their works and limitations. A nice review on deep learning based on video classification and captioning task [2]. This review is on only deep learning-based approach for video classification with good description on deep model, data and feature extraction tools but does not able to mention research gaps, advantages and performance. A simple review on video classification technique proposed in 2019 by Q. Ren [3]. This is a very simple review because it just presents video classification approach with a short description. This method does not describe method, dataset, performance metrics, research gaps, limitations of existing methods. In 2020, A very simple review has given by Anusya for video classification [4]. This review simply gives introduction and state some recent existing method in video classification for tagging. There are many lacking this review like has limited information, does not provide information about the research limitations, used tools in existing method. A recent review on video classification in 2020 by Rani [5]. This review states video classification approach and summary-based description of recent works. The limitations of this work are short description, not properly analyzed on recent task to find research output, gaps and finding. Another systematic, recent, and good review [6] on live sport video classification has done by s in 2020. This review properly presents recent works in live sport video classification with tools, feature extraction, video interaction features etc. This is a longer review and has no summarized table for research gaps, finding, advantages and disadvantages of existing methods. The explanation above indicates that most of reviewers have historically reviewed existing research.


INTRODUCTION
The internet is currently commonly used by the people worldwide. Social media have an essential role to play of content distribution (audio, video, text, image) sharing [1]. About the same period, they also share their emotions in social media about a certain aspect so that those users can quickly find out exactly what is happening and with this reason, user views are used to estimate the public opinion on certain issues. However, if consumer employ a person to evaluate the views of people through multitudes of content it is very difficult and time consuming. In order to evaluate public attitudes, the researchers present a machine learning approach to data mining. Video classification is part of mining which analyzes text through natural language processing, video by machine linguistics in order to find views of people by gathering and analyzing social and other resources of subjective knowledge. Deep learning methodology is more reliable and effective than other approaches. This paper reviews different approaches for video classification. There are a number of review and survey paper in video classification. Some recent review papers are listed here with their works and limitations. A nice review on deep learning based on video classification and captioning task [2]. This review is on only deep learning-based approach for video classification with good description on deep model, data and feature extraction tools but does not able to mention research gaps, advantages and performance. A simple review on video classification technique proposed in 2019 by Q. Ren [3]. This is a very simple review because it just presents video classification approach with a short description. This method does not describe method, dataset, performance metrics, research gaps, limitations of existing methods. In 2020, A very simple review has given by Anusya for video classification [4]. This review simply gives introduction and state some recent existing method in video classification for tagging. There are many lacking this review like has limited information, does not provide information about the research limitations, used tools in existing method. A recent review on video classification in 2020 by Rani [5]. This review states video classification approach and summary-based description of recent works. The limitations of this work are short description, not properly analyzed on recent task to find research output, gaps and finding. Another systematic, recent, and good review [6] on live sport video classification has done by s in 2020. This review properly presents recent works in live sport video classification with tools, feature extraction, video interaction features etc. This is a longer review and has no summarized table for research gaps, finding, advantages and disadvantages of existing methods.
The explanation above indicates that most of reviewers have historically reviewed existing research. Current survey articles usually describe the techniques and related studies with a basic introduction of method. In the traditional survey paper for video research, we usually see similar classification trends of the comparative research or associated study. But with numerous forms of critical research our research study paper is unique. The following can be mentioned as our key contribution to this review paper: The rest of this paper is arranged as follows. Section 1 gives background knowledge of the research for of video classification technique. Section 2 states critical analysis on recent research with their advantages, disadvantages, features based quick summary, quick summary, drawback, challenges, limitations, and future works followed by the conclusions. Overall methodology of this review is shown in Figure 1.

Video classification architecture
Video classification technique have some basic steps and those steps should be done in sequential order. Figure 2 shows basic steps in video classification. First step is data collection, then preprocessing of data for feature extraction, then method execution for feature matching and classification. In data collection section data can be in form of video, text, speech, and image on the review of video. Preprocessing part deals an important task in video processing, its play the role of video conversion, segmentation and analysis for further feature or information extraction. Feature extraction, feature matching and feature classification with algorithm is the main part of the video classification process.

Data set used in video classification
Many science and research organizations have invested a lot of time gathering and marking video data sets in media-related fields of research. YouTube-8M, HCF-50, HCF-101, HMDB51 and many more are the widely used datasets. The small sets of data include Weizmann, KTH, and Hollywood, with smaller, but very welllabeled overall quantity and video forms. And over 50 images, such as UCF101, Thumbos'14 and HMDB51, are included in medium set info. The big data collection such as the YouTube 8M (Google collects), Sports-1M, ActivityNet, Kinetics and others. More detailed information is summarized in Table 1. Here Weizmann and KTH dataset are static, but all other dataset are dynamic.

Performance metrics in Video classification
Throughout this section we describe most common video classification efficiency metrics. Using performance metrics demonstrate how well a dataset approach works. In the scope of the video classification, there are several performance analysis steps called Precession (Precision measures are conducting positive meaning determination), Recall (Precision tests are percentage tests for productive detection of positive result of the classifier). However, a Table 2 presents some of the performance measures used for the assessment of research on video classification from the latest work on video classification. Here Table 2 is given for related research with performance metrics.

Applications of video classification
There are many numbers of applications of video classification. Here, I have mentioned some of them with recent work reference in Table 3. For the application of video in firewall task, user must sure the specification of the types of videos that allowed to login. Live streaming prediction, action recognition, violence detection, character recognition, traffic control, social media analysis, emotion analysis, movie review, event prediction is also the application of video analysis.

APPROACH USED IN VIDEO CLASSIFICATION
Because many videos are present in the real world, an effective way to classify those videos is important. The main aim of the video classification method is to classify whether the video is used for athletics, films, amusing videos, school, etc. There are three different ways of classifying video called audio, video and text. Apart from these three methods we can also use hybrid approach (using one more method combined approach) to categories the videos. Figure 3 shows Taxonomy of Video Classification Approaches.

Text based Approach
We generate video texts and evaluate them for classification in this process. Might be a visible text or text from speech extracted. The text on the computer is derived in the first category. For example, the playing board, number in the player's jersey, subtitles on the display, etc. The text of that sort could be extracted with OCR [18] [19]. The text is derived from voice through voice recognition throughout the second category. This technique is used primarily for subtitles and closed subtitles. Closed subtitles are also used for other sound forms such as pet sound or songs. In order to make it clear, subtitles are put on video.

Audio Based Approach
This technique is being used more that text based on analysis which is ascribed to the fact that audio processing takes fewer time and energy. Audio as well as its characteristics need less space to be stored than video and text. For audio processing, a single signal is sampled, and some characteristics are retrieved for inspection of each sample. In certain instances, these samples may be overlapped. The time domain as well as the frequency domain could be used as the functions.

Video Based Approach
Most scholars used the approach as most knowledge dependent on the vision is interpreted by human beings. Some authors have also where necessary coupled these visual aspects including audio and text. Visual capability is primarily derived from image sequences or video files. Video's basic structure is like a combination of pictures is a fundamental part of video. Video may also be named as a collection of frames. Visual characteristics are typically dependent on color, motion or shot time. These features must convey lighting, movement, background or video speed detail.

Comparison among video classification approach
From the description we see that each approach works has some own way of working and success outcomes, based on the suitability to application of existing approach we present a comparison table below. The Table 4 explains the advantages and disadvantages of each method in detail. Large size Computation is expensive Pre-processing is needed Identification of shots, track is difficult

Background study on video classification method
Some commonly used methods are supervised, such as SVM, CNN, and also unregulated. There also also a variety of solutions throughout the video classification (LSTM, GRU etc.). This section demonstrates the most widely used method of video classification including their working technique, application advantages and disadvantages. A way to identify video using Naïve Bayes and dictionary for the video classification [8]. If the statement of independent predictors is valid, a classifier from Naive Bayes functions works better than other models. Naive Bayes' primary imitation is autonomous predictors' inference. SVM is also a method of detection that is commonly in video classification [20]. Another approach is used to identify hateful speech from the world wide web of video classification [14]. Another job to classify Twitter videos [9] was to work with SVM tool to classify the pilot and to weight production in order to improve classification precision. SVM method does not performs well for noisy data and when target class are overlapped.
K means are used in various ways for video labeling. Another video classification task performed recently by Peng [13]. This approach is used to retrieve the visual features from video and share resources of visual features by segmenting the video. Original clustering levels of labialized video samples are enhanced with the  Islam) standard k-means aggregation algorithm. K-means computation is faster than hierarchical clusters most of the time when we hold k smalls. The drawbacks are that K-value is hard to estimate, the K-proximate neighbor (KNN) method is simple and easy to apply for classification as well as extraction purposes, HMM (Hidden Markov Model) is used. A new approach to the study of child face speech for the R-CNN and HMM method of real-time video surveillance [21]. HMM approach gains Solid theoretical base, fast learning algorithms through raw sequence information may take place explicitly and different-length inputs are the simplest generalization for sequence data.
HMM's drawbacks include a. HMMs also have a multitude of unstructured criteria and cannot rely on hidden states to rely on them. A paper shows that 3D CNN is best suited to the classification of the video, and also to analyze its success with the title of an effective deep pipeline template-based architectures to accelerate the whole 2-D and 3-D CNNs on FPGA [22]. Action recognition was used with 3D Deep convolutional Neural Networks [23]. 3D convolutions combine spatial information as well as motion information successfully. Longterm model RNNs maps time dynamics explicitly to variable length video frames. To accomplish this the RNN produces networks with loops that cause knowledge to survive [24]. The neural network will use this loop form to record the input series. The RNN functions like this. RNN assists from the previous feedback anywhere we need meaning.
RNN has two types of LSTM, as well as the other type GRU. RNN with neurons in long short-term memory (LSTM) is being trained in sports video sequences with SIFT features [25]. Baccouche work has been very much respected for its consistency. The function extraction is automatic with the creation of deep learning techniques and architecture. Through back propagation, RNN could be optimized. A new piece of videos with higher fidelity, using the 2D Gated Bidirectional of Neural Networks for the identification of aggression at the end of the day. Kyunghyun developed Gated Return Units (GRUs) as an existing Neural Networking Feature (CNN) [26]. Deep literacy is more reliable and efficient than other techniques [27]. The approach to learning is more effective. Table 5 offers a detailed comparison of the deep-learning video classification system. Has short memory ability, could not be able in real situation. LSTM Can capture both spatial and temporal feature from sequence data.
Gradient explosion, Takes more training time.

GRU
Can capture both spatial and temporal feature from sequence data in a faster time The reset gate of GRU controls if the previous hidden state needs to be ignored.

Result discussion with critical analysis on related works on Video Classification
Many deep learning approaches may use a large-scale data collection and working capability resolve the limitation of current or usable methods with increased precision and accurateness. This segment presents analyses of recent progress on the classification of videos. Table 6 contains analytical style columns having data, methods, model, type, advantage, and disadvantage of most recent video classification methods in 2020. Throughout the field of video classification research, there are several similar works. Traditional methods are subdivided into following types: traditional machine learning and deep learning. The SVM is the classification tools generally used in video classification, a system that uses Naïve Bayes and Dictionary [8]. K means used in various ways for video classification. A video classification work performed recently by Peng [14]. Latest analysis focused on the you tube video content Classification with Random Forest algorithm [28]. End-to-End Information Diagrams video classification and K next to neighbor classification [29].
A new approach to the study of child face speech for the R-CNN and HMM method of real-time video surveillance [21]. A deep learning structure automatically operates in order to learn then represent data across different processing layers through specifically classifying specific input data or vine frames [30]. Unlike a typical designed architectural design, no identifiers or practical extractors are required. For example, in deep learning model, local characteristics are immediately learned from an image rather than through a whole picture [31]. Deep learning techniques that are able to identify high-level or complicated behavior that attract enormous research [32]. The widespread examples of profound learning models are the CNN, repetitive neural network (RNN) as well as a long-term memory (LSTM). The use of deep learning to video data analysis was motivated by an outstanding success with a high accuracy of deep learning method in such a visual work. At first, CNN operates separately for data extraction from still pictures [23]. Although in video streams 2D-CNN cannot retrieve temporary information. For the massive video classification, the paper [33] Islam) networks (CNN) and reveals that the slow melting system performs better than the usually early fusion model [34] evaluated CNN with LSTM-RNN and identified the potential for a stronger creation of Recurrent Convolutional Neural Networks. In [35] the CNN two-stream structure is being used, one for spatial and another one for temporal functionality.
The [36] study uses description and CNN in activities to recognize activity and behavior. The grade bundling codes period details by grouping video frames in sequential order. A Bi-level optimization approach be used for the learning algorithm by convolutions of neural networks. The CNN extractor and batch standardization LSTM function extractor could also be used to optimize performance [37]. Non-linear context gating was introduced in [38] to model interdependencies between features and it was the used to classify videos. 3D-CNN then has designed to retrieve both spatial and temporal knowledge from video frames in able to fix this problem for 2D CNN [39]. The behavior identification was followed by this RNN. The RNN-based approach efficiently records time knowledge based from both current and past observations [40]. This forecast is based on current measurements. However, RNN architecture does indeed have a short-term memory that cannot be extended in the real-world case. The LSTM model was suggested to mitigate this problem. This model will extract time from sequential video files. The LSTM model has a memory device which determines when secret states are to be remembered and forgotten [41]. The LSTM model is primarily used in computer vision applications including action recognition, owing to its excellence. Table 6 given represent some recent approaches for video classification. The literature is having different methods of video classification based on text, audio and video feature extraction. Different algorithms HMM, ANN, SVM and RNN, all have their own advantages and disadvantages. If it is possible to combine any of these two or more approaches, then there are advantages of both the methods in one scheme.

Quick Summery of video classification techniques based on used features
A lot of methods are used to estimate the outcome in the video classification. Deep learning and machine learning based video classification system works with the learning as well as tuning numerous parameters, normalization and support of a layered neural network. A short overview table for the details on the methods used in the video classification can be found in this section. Machine learning as well as underlying neural network models are used to produce a successful outcome in mixed or deep learning. There are several methods and functions that are used for video classification. Here we have included a quick overview of the used for video classification tools and functions. We choose twenty features shown here in Table 7 based on the methods used, tools, algorithms and so forth.  Table 8 gives a short review of recent techniques for classification tasks based on machine learning. This table is focused on the 20 different features selected throughout Table 7. Here we are introducing latest work and using a sign ✓ to show the marching features with twenty characteristics.

Future Work and challenges in Video classification
This research may be generalized to incorporate new methods in the future. However, the characteristics of frames as well as frame retrieval are key to the effective video classification. The role of patterns is also to boost the quality of the classification tasks. Another potential challenge of incorporating larger types of video into the dataset with more efficient and generic functionality, to research methods that expressly clarify camera movement. To classify longer video, to recognize multiple action in video, to find correlation among different videos, classification of multiple objects action in the video. Live steaming game video prediction is the trends work in video classification.

CONCLUSION
This article critically reviews on different approach and method in video classification with their advantage, finding, limitations, challenges, data summary, research gaps, and performance. From the analysis of this paper. It is concluded that video-based approach for video classification works better over text and audio. The least employed process of video classification becomes text extraction. In different applications audio and video features extractions are used, but as we can appreciate, also the performance of the classification tasks can be more enhanced if the extraction both of visual and audio features is taken with same importance in the collection of video features. Audio-based solution needs little computing source. We also have the chance to identify videos in multiple ways to overcome the limitations of existing methods. By first segmenting images, we will identify them and then use the threshold and afterwards classify them in order to create new techniques. We may use movie and game or event forecasting classification algorithms. In order to identify the aspect of the movie and also the songs, fighting scene, funny scene, here is also a chance to classify videos. There are many existing methods for video classification and already have shown their performance. This review article also shows limitations of existing methods like unable to handle multiple features at a time, higher training time of deep learning, less adaptability of traditional machine learning, low accuracy to handle multilevel video. To overcome the limitations of the video classification is the trends and opportunity for the researcher. To classify longer video, to recognize multiple action in video, to find correlation among different videos, classification of multiple objects action in the video. Live steaming game video prediction is also trending and future work in video classification.