Text Classification Using Long Short-Term Memory with GloVe Features

Received 19 December 2019, Revised 30 January 2020, Accepted 04 February 2020. In the classification of traditional algorithms, problems of high features dimension and data sparseness often occur when classifying text. Classifying text with traditional machine learning algorithms has high efficiency and stability characteristics. However, it has certain limitations concerning largescale dataset training. In this case, a multi-label text classification technique is needed to be able to group four labels from the news article dataset. Deep Learning is a proposed method for solving problems in text classification techniques. This experiment was conducted using one of the methods of Deep Learning Recurrent Neural Network with the application of the architecture of Long Short-Term Memory (LSTM). In this study, the model is based on trial and error experiments using LSTM and 300-dimensional word embedding features with Global Vector (GloVe). By tuning the parameters and comparing the eight proposed LSTM models with a largescale dataset, to show that LSTM with features GloVe can achieve good performance in text classification. The results show that text classification using LSTM with GloVe obtain the highest accuracy is in the sixth model with 95.17, the average precision, recall, and F1-score are 95. Besides, LSTM with the GloVe feature gets graphic results that are close to good-fit on average.


INTRODUCTION
Text classification is an important part of Natural Language Processing with many applications [1], such as sentiment analysis [2] [3], information search [4], ranking [5], and document classification [6]. The text classification model is generally divided into two categories: machine learning and deep learning. Much research on text classification has involved traditional machine learning algorithms such as k-Nearest Neighbors [7] [8], Naive Bayes [9] [10], Support Vector Machine [11] [12], Logistic Regression [13]. Also, compared to traditional machine learning classification algorithms have high efficiency and stability characteristics. However, it has certain limitations in the case of large-scale dataset training [14].
Recently, neural network-based models are becoming increasingly popular [15][16] [17]. These models achieve excellent performance in practice, tend to be relatively slow both during training and testing, limiting their use to very large datasets [14]. Several recent studies have shown that the success of deep learning about text classification is highly dependent on the effectiveness of word embedding [17]. Specifically, Shen et al. 2018 quantitatively show that the task of text classification based on word embedding can have the same level of difficulty regardless of the model used, using the concept of intrinsic dimension [1].
Some applications of deep learning methods used for text classifications include convolutional neural networks [16] [17], autoencoder [19] [20], deep belief network [21]. Recurrent Neural Network (RNN) is one of the most popular architectures used in natural language processing (NLP) because the recurrent structure is suitable for variable length text processing. One of the deep learning methods proposed in this study is RNN with the application of the Long Short-Term Memory (LSTM) architecture. RNN can use a distributed word representation by first changing the token consisting of each text into a vector, which forms a matrix. Whereas, LSTM was developed to solve exploding and vanishing gradient problems that can be faced when training traditional RNN [22]. In addition to expanding memory, the classification of texts using LSTM in this study because the structure of LSTM is a sequence in which an integrated whole or cannot be cut as well as the structure of text documents that if cut will change the meaning of the sentence. The use of word embedding will be an input feature on LSTM before classifying text.

RESEARCH METHOD 2.1 Methodology
In general, the steps in the research methodology used to assist in the preparation of this research proposal require a clear framework for the stages. The research framework used as in Figure 1, which consists of literature review by research in the past 1 year and 5 years, in preparing data the dataset used in this study is AGNews consist of 400,000 data samples, after preparing the dataset is pre-processing data by removing punctuation and tokenization, do the classification process with LSTM, and analyzing the results, and make conclusions. The classification process with LSTM consists of 3 sub-processes, namely the training process, validation, and testing.

Feature Extraction
Feature extraction is an important part of machine learning, especially for text data. Text dataset is the most unstructured data which is necessary to produce meaning and structure used by machine learning algorithms. Recently, T. Mikolov introduced a better technique for extracting features from text using the concept of embedding or placing words into vector spaces based on context. This approach to word embedding, called Word2Vec, solves the problem of representing contextual word relationships in computable feature space [23]. J. Pennington in 2014 developed a vector representation of learning spaces from words called GloVe and placed them in Stanford's NLP lab [24]. In this study use 300 embedding dimensions of GloVe to be an input feature in LSTM.

Recurrent Neural Network
RNN is a type of neural network with a memory status for processing sequence inputs. Traditional RNN has a problem called gradient vanishing and exploding during training [25]. Recurrent node activation consists of feedback for itself from one time-step to the next. RNN is included in the deep learning category because data is processed automatically and without defining features [26]. RNN can use the internal states (memory) to process the input sequence. This makes it applicable to tasks such as Natural Language Processing (NLP) [15], speech recognition [25], music synthesis [27], and time-series financial data processing [28]. There are two implementations of RNN i.e Backpropagation Through Time (BPTT) algorithm for calculating gradients and Vanishing Gradient problems that have led to the development of LSTM and GRU, the two most popular and powerful models currently used in NLP. The basic equation of RNN, = tanh (

Long Short-Term Memory
Long short-term memory (LSTM) has recently become a popular tool among NLP researchers for their superior ability to model and learn from sequential data. These models have shown state-of-the-art results on various public benchmarks ranging from the classification of sentences [29] and various tagging problems [30] for language modeling [16] [17], and sequence-to-sequence predictions [26]. LSTM aims to solve the RNN problem called gradient vanishing and exploding. LSTM replaces hidden vectors from recurrent neural networks with memory blocks equipped with gates. This can maintain long-term memory in principle by practicing appropriate gating weights and has proven to be very useful in achieving state-of-the-art for various problems, including speech recognition [31]. LSTM was proposed by Hochreiter and Schmidhuber, 1997 to specifically address this problem of learning long-term dependency. LSTM stores separate memory cells in it which can update and display their contents only if necessary [32]. The LSTM gates mechanism implements three layers; (1) inputs gate, (2) forget gate, and (3) output gate [33].
Each LSTM unit, can be seen in Figure 2 has a memory cell, and the states at time t are represented as ct. Reading and modifying are controlled by the sigmoid gate and affect the input gate it, forget gate ft and output gate ot. LSTM is calculated as follows: At the moment of the moment, the model receives input from two external sources (ht-1 and xt). The hidden states ht is calculated by the xt input vector the network received at time t and the previous hidden states ht-1. When calculating the hidden layer node states, input gate, output gate, forget gate and xt will simultaneously affect the state of the node.

Fig. 2. LSTM Architecture
A step-by-step explanation of the LSTM cell and its gates is provided below:

Evaluation
The multi-label evaluation steps of the confusion matrix in the following equations:

Optimization
There are some types of optimizers for deep learning models such as SGD, Adam, RMSProp, etc. This paper applied Adam and RMSProp for training the data. Adam Optimizer can control sparse gradient issues [34]. It is an expansion to stochastic gradient descent that has currently seen wider adoption for deep learning applications such as Natural Language Processing.
where m and v refer averages the first two moments of the gradient, g indicates gradient on current minibatch. RMSProp can adapt the learning rate for each of the parameters. It aims to divide the learning rate for weight by a running average of the magnitudes of recent gradients for that weight [35].
Where γ is the forgetting factor.
And the parameters are updates as,

One-hot encoding
The first pre-processing in this research is One-hot encoding. One-hot encoding is changing text data (categorical) into numbers. Machine learning algorithms cannot work with categorical data directly. Categorical data must be converted to numbers. This applies because research works with the type of sequence classification and uses deep learning methods such as Long Short-Term Memory Recurrent Neural Networks.

Tokenization and Remove Punctuation
Tokenization is the process of breaking up the flow of text into words, phrases, symbols, or other meaningful elements called tokens. Tokenizing means splitting up text into units that have minimal meaning. This is a mandatory step before all types of processing. This process will divide the text into sentences and sentences into typographic tokens. That means separating punctuation. The feature generated from tokenizing is training data. In this process, padding is also carried out to identify the end of the sentence because the decoder is trained sentence by sentence.

Dataset
Previous research on Zhang 2015, Wang 2018 has shown work well with large-scale datasets [16] [36]. From eight large-scale datasets, the AGNEWS dataset was taken for training. AGNews is a classification of topics in four categories of Internet news articles consisting of titles and descriptions classified into four classes: World, Entertainment, Sports, and Business. The dataset is shown in Table 1, with the following content specifications:

Training Process
AGNews dataset is divided into 80% each for training and 20% for testing. The training dataset used is not used for LSTM testing, and vice versa. From 80% of the training data, 10% is used for the data validation process. The amount of each dataset is randomly split, with an automatic data split.

Training Models
The hyper-parameters used are the Relu and Tanh activation functions, Adam and RMSProp optimizers will be validated with a learning rate of 0.001 and 0.0001 to minimize errors. The dimensions of word embedding are 300. The structure and hyper-parameters used in LSTM validation with the Glove features can be shown in Table 2.

LSTM Models
The LSTM sequence classification training process using the word embedding feature Global Vector (GloVe) 300 dimension is trained with hyper-parameter embedding matrix obtained from pre-processing the GloVe feature on input, activation of Relu and Tanh on hidden gate, softmax activation on output gate, optimizer Adam and RMSprop, with dropout 0.5 and epoch 50, have been trained in each of 8 models with tuning Learning rates of 0.001 and 0.0001. The hyperparameter learning rate controls the rate or speed at which the learning model. Specifically, this controls the number of divided errors whose model weights are updated with each time they are updated, such as at the end of each batch of training examples. The learning rate is perhaps the most important hyperparameter. Table 3 shows the results of the evaluation performance of the LSTM training process that was trained using Relu activation, Adam optimizer with a learning rate of 0.001. The accuracy of the training obtained in model 1 is 95.33. Confusion matrix will be used to calculate the Precision, Recall, and F1-score, the results of which can be seen in Table 3 as a result of the evaluation performance of the test. The results in Table 4 show that the training and testing accuracy values are not much different, which is 95 with an average value of Precision, Recall, and F1-score of 95. To see the comparison of training and testing per epoch in the accuracy curve can be seen in Figure 3 and the curve loss in Figure 4.   Table 5 shows the results of the performance evaluation of the LSTM training process that was trained using Tanh activation, Adam optimizer with a learning rate of 0.001. The accuracy of the training obtained in model 2 is 95.34. Confusion matrix will be used to calculate the Precision, Recall, and F1-score, the results of which can be seen in Table 6 as a result of the evaluation performance of the test. Based on the two models above using the same optimizer and learning rate with both Relu and Tanh activation, the resulting value is also not much different. The value of training and testing accuracy, average precision, recall, and f1-score of 95. Figure 5 shows the comparison curve of training and testing accuracy, and Figure 6 shows the Loss curve.

Model 3
In model 3 is trained with Relu activation hyperparameter, RMSprop optimizer and learning rate 0.001. The results of the training evaluation performance can be shown in Table 7, while the results of the testing evaluation are shown in Table 8. The accuracy obtained in the training process is 94.25 with an average value of precision, recall, and f1-score of 94. Not much different from the value of testing accuracy which is equal to 94.37. The comparison training curve and testing of accuracy and loss can be seen in Figure 7 and Figure 8.

Model 4
In model 4 is trained with Tanh activation hyperparameter, RMSprop optimizer and learning rate 0.001. The results of the training evaluation performance can be shown in Table 9, and the results of the testing evaluation are shown in Table 10. The accuracy obtained in the training process is 94.32 with an average value of precision, recall, and f1-score of 94. The testing accuracy is 94.56. The test results in Table 10 show that the macro average of precision is 95, while the recall and f1-score are 94. The accuracy value in this process is 95. The comparison training curve and accuracy testing can be seen in Figure 9 and the loss in Figure 10. Both Adam and RMSprop optimizers trained with a learning rate of 0.001 showed results that are not much different.

Model 5
The LSTM model 5 was trained with the same hyperparameter with a tuning learning rate of 0.0001. Table 11 shows the results of the training evaluation performance and Table 12 shows the performance results of the classification testing evaluation with the activation of Relu, Adam optimizer, and 300-dimensional GloVe word embedding. The accuracy value in the training and testing process for learning rates 0.001 and 0.0001 with the same optimizer, namely Adam gets results that are not much different, both precision, recall, and f1-score of 95. However, the accuracy and loss curves obtained in learning a rate of 0,0001 is better than an accuracy and loss curve with a learning rate of 0.001. It can be seen in Figure 11 and Figure 12.

Model 6
In Model 6, training was carried out with the same hyperparameter with Tanh activation, Adam optimizer, and a learning rate of 0.0001. The results of training evacuation performance and confusion matrix are shown in Table 13 with training accuracy of 95. While the results of testing evaluation performance are in Table 14 with an average value of precision, recall, and f1-score of 95. Figure 13 shows a comparison curve of training and testing accuracy for 50 epochs. Although the loss in the validation process continues to decrease, at the 40th epoch the same and slightly greater than the training loss can be seen in Figure 14.   Table 15 shows the results of the training evaluation performance and confusion matrix of 50 epochs obtained an accuracy of 93.24. The results of the evaluation performance of precision testing, recall, and f1-score are in Table 16. The accuracy curve resulting from training and testing can be seen in Figure 15 and the loss curve in Figure 16, which shows that the results of the RMSprop optimizer parameter with a tuning learning rate of 0,0001 are more fit than the RMSprop with a learning rate of 0.001 although there is a slight up and down in accuracy and loss.  Table 17 where there are four class confusion matrix multilabel. The results of the evaluation performance of the test are in Table 18 with an average value of precision, recall, and f1-score of 94. The comparison training curve and testing accuracy model can be seen in Figure 17  than the training loss can be seen in Figure 18. Table 19 shows a comparison of the testing accuracy of the eight LSTM models using the word embedding GloVe.  In the eight of tuning models LSTM using the word embedding Glove feature, the highest test accuracy was 95.17 in model 6 with Tanh activation parameters, Adam optimizer, and a learning rate of 0.0001. While the accuracy and loss model which close to good-fit on models with a learning rate of 0.0001 either with Adam or RMSprop optimizer. Table 20 shows the comparison results of previous works.

CONCLUSION
Text classification using LSTM is done by conducting trial and error experiments. Text classification using LSTM with the Glove feature does hyper-parameter tuning to get the best model. Whereas, the LSTM and hyperparameter structure used from the test results are using embedding of the GloVe features in the input, softmax activation function in the output, Relu and Tanh activation functions, loss categorical cross-entropy function, learning rate 0.001 and 0.0001, with the number epoch 50. The highest accuracy with the Glove feature is on the sixth model of 95.17 with an average precision, recall, and F1-score of 95. It can be concluded that the LSTM evaluation results using the GloVe feature can achieve good performance both in accuracy and the curves.