Using deep learning models for learning semantic text similarity of Arabic questions

ABSTRACT


INTRODUCTION
Semantic text similarity (STS) is a measurement used to determine how linguistic terms are equivalent to each other. Linguistic terms that usually studied are documents, sentences, words, and questions [1]. The STS improves the understanding of the semantic similarity between linguistic terms and increases the accuracy of several knowledge-based applications. This understanding gives the STS a great impact on many applications in artificial intelligence and the computational linguistics such as information retrieval, word sense disambiguation, knowledge acquisition, and natural language processing (NLP) [2,3].
Regarding to the NLP field, STS applications can be extend from paraphrase identification and question similarity to machine translation [4]. The research in the STS has been greatly increased during the past few years, most of them driven by the annual SemEval competitions [5]. SemEval is an international workshop for semantic evaluation driven by the SIGLEX [1].
With the advent of Web 2.0 and social computing advancements, platforms for question answering has become widely used. According to the Quora, a well-known platform for question answering and knowledge sharing, over 10 million users visit Quora every month. With such a large number of visitors, similar questions definitely would be asked and answered by several users, hence confusing other users in finding the right answer and causing the writers to feel like they have to write the same answer several times responding to similar questions.
Two questions asking about the same thing can be formed using a different set of vocabulary and syn-tactic structure. This makes detecting the semantic similarity between the questions is a challenging task. This research proposes three models for analyzing the semantic similarity of Arabic question pairs; i) a supervised-machine learning model using XGBoost [6] trained with pre-defined features, ii) an adapted Siamese deep learning recurrent architecture based on the work of [7], and iii) a pre-trained deep bidirectional transformer based on BERT model [8].
This paper presents several new non-trivial extensions to our preliminary work described in [9]:  Our preliminary work [9] contains only traditional machine learning models such as XGBoost, SVM, and decision tree. In this manuscript, we have designed and implemented various deep learning models using transfer learning technique.  We enlarged the dataset used for training and testing. In [9], we trained our models on 9,568 pairs of questions whereas in this paper, we trained our models on 15,712 pairs of questions, i.e., 31,424 distinct questions.  Similar to our preliminary work in [9], two of our models trained using pre-engineered features including character-level features, word-level features, morphological features, semantic features, and word embedding features. Unlike our preliminary work, our best-achieved model, the BERT-based model, was able to learn the semantic similarity among pair of Arabic questions without the need for pre-engineered features. Hence, increasing the generality and the applicability of our approach.  On top of the previous technical contributions, we discussed our work in light of other related research effort in the area of Arabic text similarity detection using deep learning. Moreover, the paper provides detailed description of the models along with the used parameters for training our models to give the best results. The rest of this paper is organized as follows: Section 2. presents a brief survey of the literature for STS Then, section 3. describes our method for detecting similar Arabic questions. Section 4. presents the results of our intensive experiments. In section 5, results are analyzed and discussed. Finally, section 6. concludes the paper with avenue of future work.

RELATED WORK
Many researchers from various fields utilized semantic text similarity (STS) on different applications. This section compares and contrasts our research contribution in light of other research work in the field. Our work is related to the research body that applied machine learning and deep learning techniques to solve STS problems including [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]. However, all of the previously mentioned approaches designed their STS models for English language texts. Even though some of their models can be applied to Arabic texts, they will not produce high accuracy since their models are not designed nor trained on Arabic text. Therefore, these approaches cannot solve the problem we are trying to solve, that is accurately and efficiently detecting similar Arabic questions.
Although the majority of the researchers in the STS field developed techniques for the English languages, few of them developed STS approaches for the Arabic language. Next we discuss the main research efforts for detecting the semantic similarity of Arabic texts. Mohammad et al. [27] proposed an enhanced approach for paraphrase identification (PI) and STS in Arabic tweets. Sagher et al. [28] proposed a CNN deep learning model to classify Arabic sentences into three categories. [29] used and compared different STS methods to measure the cross-language semantic similarity for short sentences and phrases. Various approaches used STS to detect plagiarism in Arabic texts such as [30][31][32]. Ferrero et al. [33] proposed two different approaches to measure the STS of cross-language sentences for Arabic-English text. Moreover, [34] proposed a query-based Arabic text summarization approach that accepts Arabic document as well as user queries. Finally, [35] adopted morphological word2vec method for Neural machine translation for low resource settings

RESEARCH METHOD
This section describes our method in developing a machine-learning approach for accurately and efficiently detecting if two Arabic questions are similar or not.

Arabic questions pair's dataset
In order to evaluate our models, the Arabic questions pairs dataset provided by mawdoo3.com company is used. The dataset was manually annotated by mawdoo3's data annotation team. As shown in Table 1, the dataset consists of around 15k pairs of Arabic questions annotated as "similar" or "not". The data was divided into two files, training data with 11.997 pairs of questions and testing file with 3.715 pairs of questions.  Table 2 shows an example of two pairs of questions selected from the dataset and representing the two main categories: similar shown as "Yes" or not similar shown as "No". In the first row of the Table 2, Question1 asks about the birth city of the comprehensive thinker, Al-Razi, and Question2 asks about the city of Al-Razi museum. Clearly, those two questions are not similar since they are asking about two different things. On the other hand, Question1 and Question2 in the second row of the Table 2 are asking about the first country where communism political ideology was started. Those two questions were written in two different ways but they still have a similar meaning.
The utilized dataset is nearly balanced with 55.01% labeled as "not similar" and the rest, 44.99%, labeled as "similar" In order to further analyze the dataset, we computed the common words among each question pair. Figure 1 shows that the number of overlapping words between pair of "similar" and "not similar" questions is around 2 words. Therefore, relying on the overlapped words between pair of questions to know if they are similar will not give good results. Thus, Figure 1 shows that our problem is very challenging to solve.

Data pre-processing
In order to prepare the dataset for further processing and to enhance the accuracy and reduce the noise in the data, various Arabic pre-processing steps were applied such as:  Removal of non-Arabic words.  Removal of hyperlinks and hashtags in all posts.  Removal of Arabic diacritics such as  Removal of punctuation and symbols such as "?, (, ), , ' ! @ $ # -".  Normalization, which is used to remove "HAMZA/ " from the "ALEF/ " (i.e., the replaced with the abstract version of the letter (((( ))).  Removal of Arabic stop words.
The NLTK library [36] written in python was used to implement the data pre-processing and data cleaning phase.

Features extraction
As mentioned earlier two out of three models developed in this research were trained with preengineered features. After the data was cleaned, the following features were extracted:  Character level features: This set of features includes: the total number of characters for the pair of questions, the number of different characters among the question pairs, the ratio of the different characters, the number of similar characters among the question pairs, and the ratio of the similar characters.  Word level features: This set includes: the total number of words for the pair of questions, the number of different words among the question pairs, and the ratio of the different words, the number of similar words among the question pairs, and the ratio of the similar words. Moreover, the type of similarity of question pairs is used as another feature. This feature is computed as a binary feature and depending on the question interrogative particles (i.e. similarity of the first word in each question). Also, the text overlap features were computed on the word level and based on our previous research [27]. The text overlap features include the number of overlapping words divided by the number of words in question1, the number of overlapping words divided by the number of words in question2, and the harmonic mean of the previous two features.  Morphological features: Stemming was used to represent each question. The Arabic language is morphologically rich [37], therefore, representing the pair of questions using their stem words increases the chance of word similarity on the surface level.

The developed models
This research implements three models for analyzing the semantic similarity of Arabic questions' pairs: i) a supervised-machine learning model using XGBoost [6], ii) an adapted Siamese deep learning recurrent architecture based on the work of [7], and iii) a pre-trained deep bidirectional transformer based on BERT model [8].
We have extracted features from the dataset to train the first two models, the XGBoost and the Siamese neural network. However, the BERT model is adapted directly without any feature extraction step. We have carefully selected these three models among many other models for many reasons. The XGBoost was the best performing model in [9], the Siamese deep learning model works well in semantic text similarity [7,39], and the Google BERT model is the state-of-the-art model used for several natural-language processing (NLP) applications.

Supervised-machine learning model using XGBoost
XGBoost [6] is a short standing for eXtreme gradient boosting. XGBoost is a scalable machine learn-ing system for tree boosting and it is available as an open-source package. In the machine learning competition published by Kaggle in 2015, among the 29 winning solutions, 17 solutions adapted XGBoost. Among these 17 solutions, 8 solutions used XGBoost to train the model, while the rest 9 combined XGBoost with the artificial neural network as ensembles.
XGBoost approach provides a parallel tree boosting known as gradient boosted regression tree (GBRT) or gradient boosting machine (GBM) which is a scalable and efficient implementation of gradient boosting framework proposed by [40,41]. XGBoost algorithm combines weak base learning models into a stronger learner in an iterative manner. It is available in several languages such as Python, R, and Julia. XGBoost can be integrated with several language data science pipelines as scikitlearn. The XGBoost model is trained in an additive manner. As shown in (1), needs to be added to minimize the objective ( ) . Where −1 is the prediction of the − ℎ instance at the − ℎ iteration.
In this work, we used the XGBoost Python package introduced in [6] to train the model with the predefined features in order to enhanced approach for learning semantic similarities in Arabic questions. The XGBoost classifier was trained using the extracted features as explained in 3.3. The XGBoost model was trained with a maximum tree depth of 6 and a learning rate (eta)=0.06, 0.04, and 0.02 for 6.000 epochs on each.

A Siamese deep learning recurrent architecture
Siamese neural network implements two symmetric neural networks with shared wights to learn semantic similarities among inputs. Siamese neural networks is used in many semantic similarity applications such as: face verification using symmetric convolutional networks [42], speech understanding and speakerspecific information extraction [43], and in semantic text similarity [7,39].
In this research, we utilized the Siamese neural network architecture to develop an enhanced approach for learning semantic similarities in Arabic questions. The model consists of two symmetric layers each contains an embedding layer followed by a bi-directional long short term memory (LSTM) layer and then an LSTM layer. Being Siamese-based the weights used to train both bi-LSTEM and LSTM layers for both questions are shared see Figure 2. The output of the symmetric layers are then concatenated with the output of the features layer and fed to fully connect dense layers with batch normalization [44] and Dropout [45] layers. The final layer is dense classification layer with activation of 'sigmoid' to get binary classification value whether questions are similar or not.
The batch normalization [44] and dropout [45] layers are used to regularize the output of the Siamese layers and to avoid common problems in deep learning such as: i) Internal covariate shift (i.e. "the change in the distribution of network activations due to the change in network parameters during training." [44]) and ii) Overfitting during neural network training. The term "dropout" refers to "randomly dropping out units (hidden and visible) in a neural network. By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections" [45]. The L2 regularization with value=0.001 was used with the fully connect dense layers.

A Pre-trained deep bidirectional transformer based on BERT model
In our work, we also used bidirectional encoder representations from transformers (BERT) model [8]. BERT is a state-of-the-art model used for several natural-language processing (NLP) applications. BERT is a language representation released by Google in October 2018 utilizing the encoder-decoder transformer architecture to train the model representations using unannotated data. The BERT model representations are built over contextual representations like Semi-supervised Learning [46], ULMFit [47], and ELMo [48].
The BERT model has many versions, see Table 3 for more details. The Uncased versions of BERT model means that the text has been lower-cased before tokenization and the Cased versions means that the case of the text is preserved. For the sake of this research, the BERT-Base, Uncased is used. The BERT-Base, Uncased is built out of 12 layers with 768 hidden layers, with 12 heads, and 110 M trained parameters. As shown in Figure 3, in the BERT model, the input represents a pair of sentences, which is here a pair of questions, in one token sequence. As shown, two questions packed together to the input token sequence. The first token of the sequence is a special classification token called "[CLS]". To differentiate the two questions in the token sequence, a special token called "[SEP]" is used to separate them. Then, a learned embedding is added to every token in order to indicate whether it belongs to Question1 or Question2. The input embedding is denoted as "E". The final hidden vector for the special "CLS" token is denoted as "C" and the final hidden vector for the i th input token is denoted as Ti. The input embeddings are represented as the summation of the token embeddings, the segmentation embeddings, and the position embeddings. As presented in Figure 4, the BERT-based model utilizes the encoder-decoder transformer architecture to learn the semantic similarity of the input questions. Transformers [49] implements different layers of multi-head self-attention with feed-forward and skipping mechanism. In contrast to traditional attention mechanism [50], the multi-head self-attention attends only to the input sequence of text and the multi-head functionality enables each layer to attend to different words within the input sequence of text. The positional encoding mechanism represents the input sequence order, words position within the sequence, and the distance between words as a vector which is then added to the embedding layer. These vectors help in capturing the contextual information within the input sequence. Each self-attention layer is followed by residual connection represented by a normalization layer that adds the input vector of the self-attention layer to the output vector from the same self-attention layer helping to carry forgotten information to the next layer. For more information the reader is redirected to [49].

EXPERIMENTATION AND RESULTS
In this study, the three developed models are evaluated using the Arabic questions dataset provided by mawdoo3.com. These three models are XGBoost [6], Siamese neural network [7], and BERT model [8]. The F1 measure is used to evaluate the performance of the models. The F1 measure is the harmonic mean of precision and recall.
The XGBoost classifier was trained using the pre-engineered features computed on the training dataset with max tree depth of 6, learning rate (eta) of 0.06, 0.04, and 0.02 for 6.000 epochs. On the other hand, the Siamese-based model was trained using the pre-engineered features. The shared Bi-LSTM and LSTM layers had 100 hidden layers and an input size of 100. The model was trained for 100 epochs with early stopping on the 98 epoch. The early stopping is used to avoid training overfitting is based on monitoring the validation loss value. Only the model with the best weights was saved and then used for evaluating the test dataset. We have trained the model with the following hyper-parameters: hidden=100, embedding size=100, batch size=512, learning rate=0.001, and number of epochs=98.
Finally, the BERT-based model was trained for 20 epochs with data embedding size of 100, batch size (BS)=16, a learning rate (LR)=(2e-5-1e-5), a warm-up proportion (WP)=0.1, and number of iterations per loop (IPL)=(1000-250000). The model was trained on the stemmed version of the questions' pairs without using the other pre-engineered features, as shown in Table 4. obtained in this research is the best results on this dataset including our preliminary work in [9] in which the best model in [9] achieved F-measure of 82.61%. Figure 5 depicts the model accuracy on both the training and validation datasets during the training phase. Figure 6 shows the loss value for both the training and the validation during the model training phase. As depicted in both figures no model overfitting can be seen during the training phase.

DISCUSSION
Having a closer look to the experimentation results, it can be seen that the BERT-based model is outperforming the other two models in terms of F1 results with 3% higher than the Siamese-based model and 6% higher than the XGBoost one. The results go in line with literature as the Transformers based models such as BERT [8], ULMFit [47], and ELMO [48] are revolutionizing the NLP research. These techniques are leading the developed models in many NLP tasks such as text classification and sequence-to-sequence labeling.
In contrast to the other two developed models, the BERT-based model was able to learn the semantic similarity among input questions' pairs without the need for pre-engineered features. This explains the power of transformers in handling NLP tasks more efficiently when compared to CNN and RNN-based models. Computing features can reduce the applicability of the developed model for production services. Users may get negative experience waiting for the model to compute the features and then classify the input text.
Focusing on the pre-engineered features, selected features boasted the results of the developed models. The Siamese-based model achieved only an F1 value of 78.186% without the pre-engineered features (relying only on the embedding features). This indicates how powerful the features selected to train the models with a margin of results' enhancement reaches 10% for the Siamese-based model. This also emphasizes how powerful our BERT-based model when compared to the Siamese-based model without features with a difference of around 15% in terms of achieved results without features.

CONCLUSION AND FUTURE WORK
This research proposes three different approaches to analyze the semantic similarity between a pair of Arabic questions. The first model is a supervised-machine learning model using XGBoost trained using a set of pre-engineered features, the second is an adapted Siamese-based deep learning recurrent architecture also trained using a set of pre-engineered features, and finally, a pre-trained deep bidirectional transformer based on BERT model. The proposed approaches were evaluated using a dataset collected by mawdoo3.com (see section 3.1.). The evaluation results show that the BERT-based model outperforms the other two proposed models with 6% of enhancement in the F1-score (see section 5).

3527
In this research, we have only considered detecting if two questions are similar or not. Detecting similar questions to a given question using our approach is an interesting avenue of future work. Besides that, we plan to enhance the BERT-based model architecture by combining the pre-engineered features to it, and investigate their impact on the model results. Moreover, we are planning to extract features from the BERT model and feed them to other machine learning approaches utilizing the flexible architecture of an encoderdecoder architecture in a transfer learning mechanism.