ATAR: Attention-based LSTM for Arabizi transliteration

A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two signiﬁcant contributions in this direction. The ﬁrst one is to collect and publicly release the ﬁrst large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49). This is an open access article under the CC BY-SA license.


INTRODUCTION
As stated by many researchers [1][2][3], social media users express themselves in ways different from standard format. Social media content exhibit frequent use of informal vocabulary, non-standard abbreviation, typos, and many idiosyncrasies such as repeating letters for emphasis and writing out non-linguistic content like emojis and sound reactions [4][5][6]. For several reasons, these issues are more complicated for Arabic content. Examples of these reasons include the prevalent use of dialectal Arabic (DA) and its grave deviations from modern standard Arabic (MSA) [7]. Another reason is the common use of a non-standard romanized way of writing Arabic words known as Arabizi. There are many reasons for the widespread of Arabizi such as the lack of support for Arabic script on some devices/platforms, the existence of some difficulties in using Arabic script, the relative ease of code-switching between Arabizi and English or French compared with Arabic script. Even though Arabizi is not known to all social media users, it is common enough to warrant studies focusing solely on it [8][9][10][11][12][13][14][15][16][17].
In this work, we are addressing the problem of Arabizi transliteration by presenting ATAR, an ATtentionbased encoder-decoder model for ARabizi transliteration. This novel neural network-based approach follows the celebrated attention-based encoder-decoder model of [32]. To evaluate ATAR, we present a "first of its kind" dataset consisting of 21.5 K words from the Jordanian dialect.
The rest of this paper is organized as: The following section gives a high-level view of the related work while section 3 presents our ATAR model and discusses its details. Section 4 discusses the dataset we create and section 5 presents our evaluation of the proposed model on the collected dataset. Finally, the paper is concluded in section 6.

2.
RELATED WORK Due to the importance of Arabizi-Arabic script transliteration problem, several companies, such as Google and Microsoft, have invested money and effort into developing tools for this problem. Examples of such tools include: Google Ta3reeb, in http://www.google.com/ta3reeb; Microsoft Maren, in https://www.microsoft. com/en-us/download/details.aspx?id=20530; Facebook's automatic translation services, in https://engineering. fb.com/ml-applications/expanding-automatic-machine-translation-to-more-languages/; Rosette Chat Translator, in https://www.basistech.com/text-analytics/rosette/chat-translator/; Yamli, in https://www.yamli.com/;. However, these tools are mostly closed-source, and very little is known about the approaches they follow or the resources they employ. On the other hand, the effort within the Arabic NLP research community to address the Arabizi-Arabic script transliteration problem has been rather shy. The existing resources are limited and are not publicly available and the proposed approaches do not follow the new and exciting approaches in the field of sequence learning [33].
Existing work on Arabizi transliteration, such as [2, 22-24, 34, 35], followed basic approaches that used character-to-character mappings in order to generate lattices of multiple alternative words. The approach proposed by [36] combines a rule-based model and a discriminative model based on conditional random fields (CRF) for transliterating Tunisian dialect Arabizi texts to standard Arabic. A further selection from these words is done using language models. As for the dataset they used, only that of [23] is reported to be publicly available [2], however, it is very small with only 2.2 K word pairs. It was used in the development of [24]'s system in addition to 6,300 Arabic-English proper name pairs from [37]. The reported accuracy of [24]'s system is 69.4% and it was later used by [2].
Another interesting effort in creating useful resources for the Arabizi transliteration problem is the work of Bies et al. [2]. The authors discussed how the linguistic data consortium (LDC) collected and annotated a huge parallel corpus of Arabizi content and its Arabic script counterpart as part of the DARPA broad operational language translation (BOLT) program (Phase 2). The corpus consisted of more than 408 K words and it mainly focused on the Egyptian dialect.
Few papers [27][28][29][30] discussed the use of deep learning for the problem of Arabic transliteration. In [27,28], the authors claimed to use a standard RNN encoder-decoder model for transliterating sentences written in Algerian dialect, but they did not provide any details of the model. Moreover, the dataset they considered is rather small (1.3 k sentences). In a mode detailed work, Younes et al. [29] used a standard RNN encoder-decoder model for transliterating words in Tunisian dialect. Their dataset was relatively big with 45.6 k word pairs. In a follow-up work [30], the expanded their work and discussed how to adapt three well-known models in machine translation for the problem of transliterating Tunisian dialect. The first one was a CRF, while the second one was a Bidirectional RNN with Long short-term memory cells (BLSTM). As for the third one, it was a BLSTM with CRF decoder. The results show the superiority of the latter approach over the former two approaches.
Transliteration systems have been proposed for many languages other than Arabic. However, such systems are usually designed to transliterate between two closely related languages. Examples include the work of Musleh et al. [38] on transliterating Urdu to Hindi, the work of Nakov et al. [39] on transliterating Portuguese and Italian to look like Spanish and the work of Nakov et al. [40] on transliterating Macedonian to Bulgarian.

ATAR: ATTENTION-BASED LSTM FOR ARABIZI TRANSLITERATION
Over the past decade, deep learning approaches have made a ground-breaksaing impact on many fields such as NLP, image processing, computer vision [41][42][43][44]. A particularly interesting and challenging set of problems, known as sequence learning problems, has been heavily studied by deep learning researchers. A special kind of neural networks, known as recurrent neural networks (RNN), has been shown to perform very well for many sequence learning problems in natural language understanding (NLU) and natural language generation (NLG). However, RNN suffers from some issues like the vanishing gradient problem. To address this problem, Hochreiter and Schmidhuber [31] proposed to equip RNN with memory cells creating what they called LSTM networks.
For sequence-to-sequence problems (like the one we have at our hands), a general approach known as the encoder-decoder approach was found to be very successful. The approach is based on the idea of learning efficient representations of the input using an RNN (or LSTM) as an "encoder network" and using another RNN (or LSTM) as a "decoder network" to take this feature representation as input, process it to make its decision, and produce an output in https://www.quora.com/What-is-an-Encoder-Decoder-in-Deep-Learning. In the rest of this section, we present the details of our attention-based LSTM model for Arabizi transliteration, which we call ATAR.

Model architecture
our transliteration model is inspired by the attentional sequence-to-sequence (seq2seq) model proposed by [32], which is based on the encoder-decoder architecture as shown in Figure 1. The seq2seq architecture consists of an RNN encoder to learn representations of input sequence X = {x 1 , x 2 , . . . , x n } of varying length and an RNN decoder, which reads the hidden representation produced by the encoder and generates output sequence Y = {y 1 , y 2 , . . . , y m } of varying length. The model takes input from the embedding layer that maps a one-hot encoding vector of vocab size, which in our case is the number of letters consisting of different 47 letters in Arabizi and 36 in Arabic, as an input and generates a fixed size dense vector that represents the semantic features of the input letter. It is worth mentioning that there is no <UNK> token in our case because each word (i.e., sequence) is a combination of limited predefined letters. In our architecture, each unit in the encoder and decoder is an LSTM cell which solves the problem of vanishing gradients with its memory cells [31].
Instead of relying on one thought vector from the encoder, many researchers [32,45] proposed the encoder-decoder architecture with attention. The idea behind the attention mechanism is to link each time step Ì ISSN: 2088-8708 of the decoder with the most "convenient" time step(s) of the encoder input sequence. This is done by utilizing the idea of global attentional model which takes all the hidden states of the encoder h s and the current target state h t into consideration to calculate the attention score. In this paper, the dot product function is used in order to perform the attention score calculation.
Following the previous step, the alignment vector a ts is computed for each state by applying a softmax function to normalize all scores; therefore, a probability distribution based on the target state will be produced.
The decoder then computes a global context vector c t as a weighted average, based on the alignment vector a t over all the source states. c t = s a ts h s Therefore, the decoder will take the context vector as an additional input vector at the next time step s t .

DATASET
We use Arabizi-Arabic script parallel words in order to perform our Arabizi transliteration experiments. Due to the lack of such available parallel data, we have crawled only Arabizi data written in the Jordanian dialect from different resources, such as Twitter, Facebook and ASK. These crawled words are regularly used on daily life basis. We were able to collect 21.5 K unique Arabizi words, which were then translated to the Jordanian dialect using only Arabic letters. A group of native speakers validated the parallel data by correcting any spelling mistakes, removing redundant letters and omitting any unneeded special characters.
One of the contributions of this work is to make this "first of its kind" dataset publicly available. In https://github.com/bashartalafha/Arabizi-Transliteration Table 1 shows samples of our parallel data. The average length of the collected words is about 5 letters per word, maximum word length of 12 letters and the minimum is 2 letters. Table 1. Examples of our parallel corpus It is worth mentioning that the same word in Arabizi could have different representations in the Jordanian dialect since not all people would write it in the same way but still they are all correct. Table 2 shows few such examples. This issue was faced by earlier work on Arabizi transliteration such as [2,9] and it is discussed in details therein. As stated by these researchers, such things could penalize the model and give it lower score considering some transliterations are right but the reference is different.

EXPERIMENTS AND EVALUATION
To evaluate the performance of our proposed model, we implement it using TensorFlow (We select TensorFlow for its efficiency and ease of use. For a comparison of different deep learning frameworks, the interested readers are directed to [46]), and perform several experiments using our dataset. After shuffling and lowercasing the data, we use the first 80% of the dataset as the training set, the next 10% as the validation set and the remaining 10% as the testing. As for the evaluation metric, we use the two most common measures for the Arabizi transliteration task: accuracy and bilingual evaluation understudy (BLEU) [47]. Finally, to aid the reproducibility of our results, both the dataset and the model are made publicly available, in https://github.com/bashartalafha/Arabizi-Transliteration.
Using an attentional encoder-decoder sequence-to-sequence translation model, we have to worry about the many hyperparameters that can affect its performance. This issue is so important that complete studies have been dedicated for it such as [48], which reported the use of more than 250 K of GPU hours for experimentation. For our work, we use the work of Britz et al. [48] as well as Ruder's blog in http://ruder.io/deep-learning-nlpbest-practices/ and Brownlee's blog in https://machinelearningmastery.com/configure-encoder-decoder-modelneural-machine-translation/ to guide us in our experiments to search for the best values for the hyperparameters. The ones that give the best performance are listed in Table 3. For this configuration, the accuracy is 79% and the BLEU score is 88.49. ATAR does achieve good results. However, it does have its limitation such as the lack of support for the various Arabic dialects. To address this, one might benefit from existing multi-dialect parallel datasets [49][50][51][52][53][54] or build new ones (perhaps, by benefiting from unsupervised approaches for dialect translation [7]). Another issue that can be addressed before adopting ATAR in real-life scenarios is trying to increase the model's accuracy. This can be done by either considering other sequence-to-sequence models, such as Facebook's convolutional sequence-to-sequence model [55] and Google's attention-only Transformer model [56] or by combining it with a neural diacrization model [57,58].

CONCLUSION
In this paper, we addressed the Arabizi transliteration problem. This work has two significant contributions to this problem. The first one is to collect and publicly distribute the first large-scale Arabizi-Arabic script parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure the highest quality. In the second contribution, we presented one of the first detailed and reproducible efforts to employ the celebrated attention-based seq2seq model for Arabizi transliteration. The presented model, which we called ATAR, performed very well in the experiments we conducted. It reached an impressive level with an accuracy of 79% and a BLEU score of 88.49. Future directions include experimenting with other sequence-to-sequence models, such as Facebook's convolutional sequence-to-sequence model and Google's attention-only Transformer model. We are also thinking of ways to expand our work to other Arabic dialects. Finally, we will explore the generation of more accurate MSA text from the transliteration by looking into combining our model with a neural diacrization model.