Arabic named entity recognition using deep learning approach

Most of the Arabic Named Entity Recognition (NER) systems depend massively on external resources and handmade feature engineering to achieve state-of-the-art results. To overcome such limitations, we proposed, in this paper, to use deep learning approach to tackle the Arabic NER task. We introduced a neural network architecture based on bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) and experimented with various commonly used hyperparameters to assess their effect on the overall performance of our system. Our model gets two sources of information about words as input: pre-trained word embeddings and character-based representations and eliminated the need for any task-specific knowledge or feature engineering. We obtained state-of-the-art result on the standard ANERcorp corpus with an F1 score of 90.6%.


INTRODUCTION
The Named Entity Recognition (NER) task aims to identify and categorize proper nouns and important nouns in a text into a set of predefined categories of interest such as persons, organizations, locations, etc. [1] NER is a mandatory preprocessing module in several natural language processing (NLP) applications such as syntactic parsing [2], question answering [3] and entity coreference resolution [4]. Achieving the best performance on NER task requires large amounts of external resources such as gazetteers, plenty of hand-crafted feature engineering and extensive data pre-processing. However, developing such task-specific resources and features is costly and needs a lot of time. For morphologically rich languages like Arabic, this task becomes even more challenging due to its unique characteristics. The highly agglutinative nature of Arabic allows for the same word to have different morphological forms which generate a lot of data sparseness. Also, the absence of diacritics in most modern standard Arabic texts creates a lot of ambiguity since many words can share the same surface form without diacritics but have different named entity (NE) tags. Furthermore, unlike most European languages, there is no capitalization in Arabic. Therefore, it is not possible to use capitalization as feature indicator to detect named entities. Finally, there are a very limited number of linguistic resources such as gazetteers and NE annotated corpora, freely available for researchers to build decent Arabic NER systems.
Mainly, the researchers interested in NER for the Arabic language follow three approaches: rulebased [5], [6], machine learning(ML)-based [7], [8] and hybrid approaches [9]- [11]. These three approaches suffer from the same issues since it needs a lot of language-specific knowledge and an extensive feature engineering to obtain useful results. This is even more accentuated by the lack of linguistic resources and the complex morphology of the language. learning has proven to be very effective yielding state-of-the-art in various common NLP tasks as sequence labeling [15], sentiment analysis [16], [17] and machine translation [18] for the English language. Unlike traditional approaches, DL is an end-to-end model that did not rely on data preprocessing, manual feature engineering or large amounts task-specific resources and can be adapted to various languages and domains. This makes it a very attractive solution for complex and low resource language like Arabic.
Motivated by the success of deep learning in several NLP applications, we introduce an Arabic NER system based on deep neural networks. In the DL literature two neural network architectures are widely used: convolutional neural networks (CNN) [19] and long-short-term memory (LSTM) [20]. Thus the neural network architecture that we introduce on this paper embraces both models. We employ CNN to induce character-level representations of words and we feed it in conjunction with word embeddings to a bidirectional LSTM network (BiLSTM) that perform the training. Finally, we use a conditional random fields (CRF) [21] layer to do the decoding of the input sequence.
Since the careful selection of optimal parameters can often make a huge difference in the performance of neural network architecture, we thoroughly investigated the impact of diverse hyperparameters on the overall performance of the chosen neural architecture and selected the best ones for our final model.
Our main contributions of this paper are as follows: a. Proposing a deep learning approach to address the Arabic NER task. b. Evaluating and selecting the optimal hyperparameters for the proposed neural network architecture. c. Confirming the advantage of integrating character-based representations for morphologically rich languages like Arabic. d. Achieving state-of-the-art results on the standard ANERCorp corpus without the need of any feature engineering or domain-specific knowledge.

PROPOSED APPROACH
In this section, we outline the deep learning approach that we adopted to tackle the NER task for the Arabic language. We propose neural network architecture composed of a BiLSTM layer and a CRF layer. First, we compute the character representation for each word using either CNN or BiLSTM (see Section 2.5 for details), then we concatenate it with the word embeddings before feeding into the BiLSTM layer. This layer is composed of two LSTM networks. The forward LSTM reads the word sequence from the beginning when the backward LSTM reads it in opposite order. Finally, the output vectors of both LSTM networks are concatenated and sent as input to the CRF layer to generate the tags prediction for the input sequence. The architecture of our neural network is illustrated in detail in Figure 1. We briefly describe the layers of our model in the following sections.

LSTM
Long-short term memory (LSTM) networks are variants of recurrent neural networks (RNN) specially designed to address some well-known issues related to exploding and vanishing gradient by appending an extra memory-cell. LSTMs are very effective to capture long-distance dependencies. They take as input a sequence of vectors (x 1 ,x 2 ,…,x n ) of length n and return an output sequence of vectors (h 1 ,h 2 ,…,h n ) called hidden states. The LSTM implementation used is represented by the following formulas at time t: where σ denotes the element-wise sigmoid function and ʘ the element-wise product. i t is the input gate vector, c t the cell state vector and o t the output gate vector. All W and b are trainable parameters.

BiLSTM
Despite their capability to capture long-distance dependencies, standard LSTMs are not very effective on sequence tagging tasks like NER. In fact, an LSTM unit can take information only from past context, but for sequence tagging is very useful to retrieve both past and future information. To overcome this constraint we use bidirectional LSTM. The basic idea is that we will use two separate LSTM units. The first one is a forward LSTM that reads the sequence of words and induces a representation of the past context. The second one is a backward LSTM that takes the same sequence but in reverse and induces a representation of the future context. The final representation of a word is the combination of its past and future context representations.

CRF layer
To predict the final tag sequence for the input sentence, we feed the output of the BiLSTM layer to a classifier. A very simple example of classifier layer is softmax. It is suitable for simple tasks where the output tags are independent. For more complex sequence tagging tasks like NER, where we have strong dependencies between output tags, the independence assumptions are not valid. Actually, in NER with IOB2 format I-LOC cannot follow B-PER. Hence, instead of decoding each tag independently, we jointly decode the tag predictions utilizing a conditional random field component which maximizes the tags probabilities of the whole sentence.

Word embeddings
Word embeddings are dense low-dimensional real-valued vectors learned over unlabeled data using unsupervised approaches. Each word in an input sentence can be mapped to a pre-trained word embedding. For unseen words, word embedding has a very good generalization since it potentially captures useful semantic and syntactic properties between words. These interesting characteristics, allow it to significantly boost the performance of various NLP tasks [15], [22]. For our neural network architecture, we use pretrained word embeddings as input to efficiently initialize the lookup table of our model.

Character representations
The use of word embeddings is usually sufficient to get the best performance for the English language. For morphologically rich languages like Arabic, the richness of the morphological forms make the vocabulary sizes larger and the out-of-vocabulary (OOV) rate relatively higher. Hence the needs of another representation of word based on its characters to effectively capture the orthographic and morphological information such as pre-and suffixes of words and encode it into neural representations that can be used by our model. Mainly, there are two ways to learn character representations. We can use convolutional neural networks [15] to encode a character-based representation of a word. Figure 2 shows the CNN architecture used. On the other hand, we can also use bidirectional LSTMs [22] to generate a character-based representation of a word from its characters. Figure 3 describes the BiLSTM architecture.

EXPERIMENTS
This section provides details about the training of our neural network. Since the achievement of state-of-the-art results using neural networks requires the selection and optimization of many hyperparameters, we will also study the impact of the hyperparameters and the parameter initialization on the overall performance of our models. We will precisely evaluate the impact of the following hyperparameters: pre-trained word embeddings, character representation, dropout, and optimizers.

Network training
Our neural model is implemented using Keras API with the Theano library as a backend [23]. The training is done using the back-propagation algorithm with the Adam optimizer. We use gradient normalization of 1 to deal with "gradient exploding". For all our experiments, we run the training with the mini-batch size of 8 for 50 epochs and apply early stopping of 5 based on the performance on the validation set. The remaining default settings of the hyperparameters are summarized in Table 1.

Pre-trained word embeddings
We employ pretrained word representations to initialize our lookup table. We learned our own word embeddings using the Arabic Wikipedia dump of December 2016 with a dimension of 50. To assess if the choice of the learning algorithm is relevant, we experiment with 5 models namely, SkipGram [24], CBOW [25], GloVe [26], FastText [27] and Hellinger PCA (H-PCA) [28]. We also assess the impact of the vector size by varying it for the best performing algorithm between 50 and 500.

Character representations
In this experiment, we check if the use of character representation is helpful and can really have a tangible impact on the performance of the network. Additionally, we compare the CNN and BiLSTM approaches of learning character-based representations and analyze which one to be preferred in regard to performance.

Dropout
Dropout is a key method to regularize the neural model and mitigate overfitting. In this experiment, we evaluate three setups: No dropout, naive dropout, and variational dropout [29]. The dropout rate is selected from the set {0. 25

Optimizer
The optimizer is an algorithm that helps us to minimize the objective function of the neural network. The choice of an optimizer can influence both the performance and the training time of our model. We experiment with 6 popular optimizers. Namely, Stochastic Gradient Descent (SGD), Adagrad, Adadelta, Adam, Nadam, and RMSProp.

Data sets
To evaluate the impact of hyperparameters we use the Arabic Wikipedia named entity corpus (AQMAR) [30]. It is a small annotated corpus of 74K token that we choose it for convenience due to the limited computation power that we have to run our experiments. The corpus statistics are depicted in Table 2. For the comparison with previously state-of-the-art Arabic NER systems, we use the ANERCorp corpus. It is a publicly available dataset and considered the standard benchmark for the Arabic NER task. The corpus statistics are summarized in Table 3.
The training of neural networks is a very non-deterministic process as it typically depends on the random number generator to initialize the weights of the network [30]. To mitigate the impact of this observed randomness in the evaluation of our neural network, we execute all the experiments 5 times and use the average of F1 scores as the comparison metric.  Table 4 shows the impact of various pretrained word embeddings on the Arabic NER task. Despite that we run all the five learning algorithms using the default setting on the same unlabeled data, we can see that FastText has consistently outperformed the other models with an average F1 score of 70.86%. The second best model is SkipGram with a 61.91% in F1 score. In fact, FastText is an extension of SkipGram, but instead of using words directly, it learns word embedding using character n-grams. This simple trick allows it to take the morphology of words into account and helps to deal with rare and out of vocabulary words which is always the case for morphology rich language like Arabic. Hence, our empirical results show that the FastText is more suitable for these types of languages in comparison with other learning algorithms. In Table 5, we vary the size of the FastText word embeddings to see if it influences the performance of our system. Surprisingly, increasing the vector size did not further enhance the performance even with bigger values as 500, rather it decreases it. So vector dimension 50 was optimal in our case. Actually, while intrinsic tasks like word similarity have usually clear tendency to prefer higher vector dimensionality to effectively capture semantic relationships between words, the extrinsic tasks like NER usually require more careful tuning to find the optimal dimensionality and tend to favor lower vector size [31]. Therefore, the  Concerning character representations, Table 6 shows that using it yields to significantly better performance on the Arabic NER task. Precisely, the CNN approach was superior to the BiLSTM one in all the 5 runs of our setup. Thus, we adopt it as the default setting for all upcoming experiments due to its superiority and its higher computational efficiency. Interestingly, recent studies [31] suggest that there is no statistical difference of using character representations when applied to the English NER task. Indeed, for languages like English which did not exhibit morphology richness, the use of character representations is no mandatory to get the best results, but for morphology rich languages like Arabic it is crucial to use it to deal with the complexity and the higher number of rare and out-of-vocabulary words observed. In Table 7, we study the impact of dropout. We evaluate three options: naive dropout, variational dropout, and no dropout and select the dropout rates from the set {0.25, 0.5, 0.75}. We observe the best performance with a dropout rate of 0.25. The naive dropout produces the best results with an average F1 of 71.08%. The variational dropout yields a competitive result of 70.52%.  Table 8 depicts the results for the different optimizers applied to our neural network. We used the settings recommended by the authors of each optimizer. Adam shows the best performance, yielding the highest score for 70.86%. Nadam which is a variant of Adam (Adam with Nesterov momentum) achieves a very competitive performance of 70.57%. Remarkably, SGD produces the worst score of 33.82%. Actually, SGD is very sensitive to the choice of the learning rate and since we did not fine tune it manually, it failed to converge to a minimum. Furthermore, applying early stopping did not help as SGD needs usually more epochs to find the global minimum of the objective function.

RESULTS AND DISCUSSION
In order to compare our neural network with the best performing Arabic NER systems, we apply our BiLSTM-CRF model to the standard ANERcorp dataset using the best hyperparameters evaluated and selected during our previous experiments. Since the performance of both naive and variational dropout was quite close and it is also the case for the Adam and Nadam optimizers, we decided to experiment with different settings of these hyperparameters in combination with the other best hyperparameters to be sure that we have the optimal setup for our model. Table 9 summarizes the results. The best performance of our BiLSTM-CRF model is achieved using Nadam as an optimizer and variational dropout with an average F1 score of 90.60%.  In Table 10, we present the results of our system in comparison with three previous top performing systems for Arabic NER. Our system achieves significant improvements over [7] and [9] on the standard ANERcorp dataset with an F1 score of 90.6%. We obtain state-of-the-art result in comparison with [10]. Our model is slightly lower with 0.06%.
In fact, the system introduced by Shaalan and Oudah [10] is a hybrid model that combines machine learning-based component and rule-based component. It relies heavily on the task-specific and language dependent knowledge provided by the rule-based component and uses a lot of handcrafted engineered features including morphological features, POS tags, capitalization features and gazetteers to achieve stateof-the-art performance. On the other hand, Our BiLSTM-CRF model has the advantage of being a true endto-end system that does not require any feature engineering, data pre-processing or external resources and therefore can be easily extended to other domains with minimal tweaking.  [7] 75.66 Abdallah et al. [9] 88.33 Shaalan and Oudah [10] 90.66 Our system 90.60

CONCLUSION
This paper proposes neural network architecture for Arabic NER based on bidirectional LSTMs. We evaluated different commonly used hyperparameters for our BiLSTM-CRF architecture to assess their impact on the overall performance. Our best model obtains state-of-the-art results with an F1 of 90.6% using FastText pre-trained word embeddings, CNN Character Representations, a variational dropout and Nadam optimizer.
In comparison with previously state-of-the-art Arabic NER systems, our neural model is truly endto-end and does not depend on any data preprocessing, external task-specific resources or handcrafted feature engineering. It is also very flexible by allowing the effortless addition of another type of named entities as numeral and temporal NEs, facilities and geo-political NEs, etc.
Our ongoing work is to explore multi-task learning approaches and see if it can further improve our model. Also, we hope that we can extend this work to other domains like noisy user-generated text which is more challenging.