Text classification based on gated recurrent unit combines with support vector machine

Received Jan 22, 2019 Revised Jan 18, 2020 Accepted Feb 1, 2020 As the amount of unstructured text data that humanity produce largely and a lot of texts are grows on the Internet, so the one of the intelligent technique is require processing it and extracting different types of knowledge from it. Gated recurrent unit (GRU) and support vector machine (SVM) have been successfully used to Natural Language Processing (NLP) systems with comparative, remarkable results. GRU networks perform well in sequential learning tasks and overcome the issues of “vanishing and explosion of gradients in standard recurrent neural networks (RNNs) when captureing long-term dependencies. In this paper, we proposed a text classification model based on improved approaches to this norm by presenting a linear support vector machine (SVM) as the replacement of Softmax in the final output layer of a GRU model. Furthermore, the cross-entropy function shall be replaced with a margin-based function. Empirical results present that the proposed GRU-SVM model achieved comparatively better results than the baseline approaches BLSTM-C, DABN.


INTRODUCTION
With the quickly and continuous production of informations technologies in digital format, the rapidly increasing in the number of electronic text massage, on the world wide web (www), such as important news sites, forums, blogs and soon text classification has become very serious problem for the big organization and companies to manage and organize the textual data. Text classification is a technique of classifying unorganized text automatically according to the predefined classes or categories based on the text contents through a given classification system. Nowadays, Text classification becomes an important task for automatically classifying the documents to their respective categories. Furthermore, text classification technique have to use in many applications of NLP, such as spam filtering [1], user intention analysis [2], text classification [3,4], personalized news recommendation [5], email categorization [6], sentiment analysis [7], and sentances classification [8], in which require to allocate pre-defined categories to a sequential of text. A most of general and basic technique of representing texts is bag-of-words. While, in bag-of-words approach has an issue to lose the words sequence and ignore the semantic of words [5]. Therefore, in this context, how to organize and use these large amounts of text information becomes particularly important. Furthermore, at present most of the common and important classifier has been adopted in machine learning-based algorithms to initiated in traditional text classification are: naive Bayes, support vector machine, neural network, k-nearest neighbors, fast text, and decision trees algorithm [9]. Comparatively, SVM has greater implementation requirements to fulfill the theory, but it has been used to produce better results in many fields [10,11]. However, their performance depends generally on the quality

3735
of hand-crafted features [12]. Bin [13] proposed the latest type of RNN as known GRU, which utilizes the advantages of GRU and CNN jointly, to recognize medically relations in clinical records, with only words embedding features. Over the years, various deep learning approaches have been presented to address this issue and have extended to implement of RNNs to real-world issues. However, based on the literature reviews and our research experiment, we found that some deficiencies of standard RNN are the gradient vanishing and exploding issues. It makes the training of RNN difficult, in two ways: (i) it cannot process very long sequences if using hyperbolic tanh activation function; (ii) it is very unstable if using the rectified linear unit (ReLU) as an activation function. In this research, for manage to overcome these issues we have to use the latest type of recurrent networks such as GRU network, initially introduced by Kyunghyun Cho 2014. GRU model indicates a strong performance on real-world sequential tasks. In fact, the influence of this research we have to combine two most of the strong state-of-the-art approaches, gated recurrent unit (GRU) and support vector machine (SVM), both of which can be traced back [14]. GRU has the capability of a dynamical system where the network output will be decided not individually by the current input but also by decide current network state. The output is base on the previous computations. It is characterized by the good model of sequential data and completely utilizations of sequence information.
Furthermore, in recent year gated recurrent unit (GRU) have used because of the short-comings of standard RNNs are addressing with long text. The advantage of the GRU model is that they determine in the process of synthesis to control how much information should be received in the current synthesis steps, how much is forgotten, and how much information is passed back. Through these gate controls, GRU has a proficient learning ability for long text. Given the existing literature, these models have been implemented all memory free. Consequently, the RNN networks with memory operation are obviously more appropriate for this type of task. In this research paper, we mainly perform a threefold contribution.
Firstly, we concentrate on gated recurrent unit (GRU) network, a type of gated RNNs, which greatly minimizes the vanishing gradient and exploding issues of RNNs over gating mechanism that constructs the simple architecture while preserving the impact of LSTM (is a another type of RNN). Secondly, we proposed to replace the hyperbolic tangent activation function (tanh) with leaky rectified linear unit (LRelu) activation in the state to the modified equation. LRelu units have been demonstrated to be better performance than sigmoid non-linearities for feed-forward DNNs. Thirdly, we present a revising to this standard by implementing linear support vector machine (SVM) as the replacement of traditional Softmax in the final output layer of a GRU model.

SUPPORT VECTOR MACHINE
The support vector machine was introduced by Vapnek in1990s and its popularity has increased quickly. It is essentially due to its achieved excellent "state-of-the-art performances in various real-world applications and generalized better on invisible datasets. Furthermore, the alternate significant component is that, dissimilar neural networks, SVM generate consistent results. SVM was originally designed for problems of two classes and contains both positive and negative objects. The basic idea of SVMs is assign the input vector to an entity space with a complex dimensional feature and find the best possible linear hyperplane that splits into two target classes at the maximally the margin [15]. Furthermore, many support vectors are used to estimate the generalization performance, the number decreases as the number of support vectors increases. The features space mapping from the input space is achieved through the kernel tricks, which allows mapping to a higher space dimensionality without the essential to explicitly find this space. The kernel can be determined as any functionality that satisfies Mercer's theorem [16]. The radials base kernel and linear kernel are between the most extensively applied kernels with SVM.
However, development of the latest kernels that shows the similarities with various applications is an active exploration field and several kernels have been developed in recent years to handle particular application. SVM was initially designed as binary classifiers so several methods have used to increase SVMs to the multiple classes issue [17]. The principle methods" are "one against all" (OAA) and "one against one"(OAO) where in the 1st technique n SVM classifiers will be construct, one for each class, and ( − ) binary "SVM classifier is applied; the mainstream voting between these classifiers have used in prediction to new points cite-multiclass support vector machines. The basic equations of the SVM that are applied to estimation the decision function from a training dataset [18] are as follows: where b is the term of bias and is presenting the number of the support vectors, yn ∈ {−1,+1} is the class signs to which the support vector belong and α is got as the solution of the subsequent quadratic optimization issues:" in the dataset, the number of data points cannot increase the number of support vectors; preferably it should be a comparatively small portion of the dataset [19].

GRU-BASED MODEL
The GRU was recently introduced by [20]. GRU is one of the newer generations of gated RNNs that have applied to solve the basic issues of vanishing gradient and exploding in standard RNNs when capturing long-term dependencies. This GRU approach is based on dynamically analysis that suffers from gradients explosion due to their non-linear dynamic. It was designed to properly update or reset its memory contents and is a lightly more simplify variation of LSTM [21]. It is combination of the input and forget gates into a single "update gate" and has an additional "reset gate". The GRU model is simple and has fewer parameters as compare to traditional LSTM models and are gain progressively popularity. Figure 1 illustrating, the multiple neurons are composed in a single input layer, the numbers of a neuron are decided by the scope of the features space. Equivalently, the output layer neurons are resembles to the output space. However, distinct the LSTM, the GRU completely exposed it's memory content each time step and balances among the new memory contents and the previous memory contents severely use of the leaky combination, although with its adjustable time continuously control by update gate. The activation ℎ of the GRU at time t is a linear interpolating among the previous activations ℎ −1 and the candidate activation ĥ : The update gate ƶ determine how much the unit updates its activation: The candidate activation ĥ is calculated similar to the update gate: where is the input vector, ℎ is the hidden state vector, are reset gates and * denoted by element-wise product. When the reset gate is off ( == 0) it permits the unit to forget the previous information. This is similar to allow learning of input sequential symbol. The reset gate is evaluated using the subsequent equation.
Update gate ƶ decides how much the previous information (from the previous time step t) should throw away and what new state to add. Units of reset gate are properly active on short-term dependencies. Units with long-term dependencies have an active update gate.

THE PROPOSED GRU-SVM ARCHITECTURE
In the exiting system mostly researchers have used machine learning and data mining algorithms to classify the text. But in our proposed model we concentrate on the deep learning algorithms to classify the text. In several studies and researches verified that deep learning algorithms produced better accuracy than machine learning and traditional algorithms. In this paper, we discussed the RNN algorithm to acquire better accuracy. The architecture of our proposed system is shown in Figure 2. Similarly, to the work of Alalshakmubarak [22] and shows that the propose to apply SVM as the classifier in a deep learning architecture especially GRU. Then, the parameters are trained across the gating mechanism of GRU architecture [23]. However, gated recurrent unit GRU overcomes such deficiencies of existing RNNs by adding the RNN with an update and reset gates that take as an input , ℎ −1 , and generates ĥ , by the following: -Update gate: -Reset gate: -Candidate hidden state: The second modification consists of replacing the standard hyperbolic tangent with Leaky-ReLU (LRelu) activation function. In particular, we modify the calculation of candidate state ĥ in (9) as follows: where W, U are weight matrices, σ is a logistic sigmoid activation function, and ℎ −1 is the hidden state (from previous time step t). But as a final layer introduced by SVM and its replacement of traditional softmax layer in a GRU" the optimum separating hyper-plane parameters are also solved the subsequent regularization by improving the principal function of SVM as shown in equation 12: Ɉ (ῶ, ) = min 1 2 ῶ + ∑ =1 (12) (ῶ (ẍ ) + ) ≥ 1 − and ≥ 0, = 1, … Consider a training data set G = {[ẍ , ]} =1 , ẍ ∈ is the i th input design and is its equivalent observed result ∈ {+1, -1}. In test classification model, denoted by attributes of text vector, is class label, where c is a constant representing a tradeoff among the edge and the calculation of total errors. "ϕ(ẍ) is a non-linear function that plots the input space into a superior dimensional feature space. The margin between the two parts is 2/w.
The L2-SVM has to appied for the proposed GRU-SVM structure. As for the prediction, the decision function f(x) = sign (wx + b) generates a score vector for each classes. So, to develop the predict class label y of a data x, the arrmax function is employed: The proposed GRU-SVM approach is summarized as follows: a. Input the dataset features {xi | xi ∈ R m } to the GRU network. b. Initialize arbitrary values with the learning parameters weights and biases (they will be adjusted through training). c. The GRU network cell state is calculated based on the input features xi, and its learning parameters values. d. At the last time step, the prediction of the model is computed using the decision function of SVM: ƒ(x) = sign (wx +b). e. An optimization method is employed for loss minimization (Adam optimizer was applied in this research). Optimization adjusts the weights and biases based on the computed loss. f. This process is repeated until the recurrent neural network achieves the appropriate accuracy" or achieves the highest possible accuracy.

EXPERIMENT SETUP
In this section, we explain the experimental settings and empirical results of the proposed model.

Collection of datasets
In this section, we adopted the ordinary experimental datasets that contain Chinese corpus and collected by Fudan University Dr. Li Rongla. In the automatically text classification system, the experimental dataset is typically divided into 2 parts: the training set and testing set. We randomly selected 10 categories from the corpus and deleted some error documents. Finally, the corpus consists of 10 categories and 1885 documents in the training sets and testing sets contain 940 documents. According to the text categorization system setting, each class should include a particular amount of training text. Each one of these texts was concluded using the classifier," and then the classification results distinction to the accurate determination. Therefore, we can determine the model impact with the SVM classifier. Detail description of dataset is presented in Table 1.
In "addition, there are three evaluation index measures the efficiency of text classification:

Hyperparameters and software
Data preprocessing and manipulate have performed in Python 3.6, basis on the sklearn, numpy and pandas packages. Deep learning GRU network and traditional DNNs have to executed with TensorFlow, an open source software library for numerically computations through data flows graph. Performance of all methods was based on pre-defined assessment measures. Completely simulations were implemented on Intel Core i7-3770XPU @3.40" GHz, and 4GB of RAM machine. The detail of hyperparameters of the proposed model is presented in Table 2.

RESULTS AND DISCUSSION
To evaluates and compared the performance of GRU-SVM text classification algorithm with baseline DABN and BLSTM-C classification algorithms, we hav experimented on well-known Chinese corpus datasets. we have concluded that the GRU-SVM approach has best text classifications capability in the term of the recall, precision and the F1 compare with the existing approaches of deep autoencoder belief networks (DABN) [24] and Bidirectional Long Short Term Memory with convolutional layer "(BLSTM-C) [25]. The proposed technique performed better text classification on the datasets such as computer, art, entertainment, politics, education, sports, and transportation. While proposed improved GRU-SVM model performance is not good to classify accurate class in the kind of business, economy, and medicine. This may be because in eliminating related features of a test result, and lost several information. Therefore, that the recall rate criteria have affected. Consequently, it is also require for further improvement. Therefore, the proposed GRU-SVM model can provide a powerful algorithm in performing text classification tasks. Figure 3 to Figure 5" shows the performance of the proposed GRU-SVM model with the traditional baseline approaches.

CONCLUSION
In this paper, we have proposed a gated recurrent unit network and replace its traditional softmax output layer with SVM for text classification. We have performed an experiment on the "well-known Chinese corpus datasets to evaluate the performance of the GRU-SVM. The dataset is publicly available and included 1885 documents for training and 940 documents for testing. The results have been compared with two state-of-the-art DABN and BLSTM-C models. Through the practically data experiment, the proposed GRU-SVM model achieved the best text classification accuracy/performance rate of 94.75%, which demonstrated the potential of implementing GRU-SVM on multi-class classification issues with a small number of classes. Furthermore, our proposed method achieved much better performance in the term of precision, recall and F1 than DABN and BLSTM-C, particularly when the size of the storage limited.