Arabic spellchecking: a depth-filtered composition metric to achieve fully automatic correction

Digital environments for human learning have evolved a lot in recent years thanks to incredible advances in information technologies. Computer assistance for text creation and editing tools represent a future market in which natural language processing (NLP) concepts will be used. This is particularly the case of the automatic correction of spelling mistakes used daily by data operators. Unfortunately, these spellcheckers are considered writing aids tools, they are unable to perform this task automatically without user’s assistance. In this paper, we suggest a filtered composition metric based on the weighting of two lexical similarity distances in order to reach the auto-correction. The approach developed in this article requires the use of two phases: the first phase of correction involves combining two well-known distances: the edit distance weighted by relative weights of the proximity of the Arabic keyboard and the calligraphical similarity between Arabic alphabet, and combine this measure with the Jaro-Winkler distance to better weight, filter solutions having the same metric. The second phase is considered as a booster of the first phase, this use the probabilistic bigram language model after the recognition of the solutions of error, which may have the same lexical similarity measure in the first correction phase. The evaluation of the experimental results obtained from the test performed by our filtered composition measure on a dataset of errors allowed us to achieve a 96% of auto-correction rate. This is an open access article under the CC BY-SA license.

INTRODUCTION Over the past several years, theoretical linguistics, computational linguistics and new information and communication technologies have evolved remarkably. As a result of these advances, thousands of electronic documents such as newspapers, emails, blogs and personal and professional documents (thesis, final study projects, and scans) are produced every day. Often we type the text by rushing without revising what has been typed, where the error takes a privileged place in the typed text.
Therefore, the existence and necessity of spelling correction systems in word processing applications that are of paramount importance to improve and assist an effective and unambiguous writing. Given this need, automatic spelling correction applications are currently ubiquitous and are integrated into all computer tools such as word processing, email, social media and information retrieval, search engines, which are frequented used every day by millions of people in the world. This necessity is for more effective writing and to remove Int J Elec & Comp Eng ISSN: 2088-8708 ❒ 5367 the degree of ambiguity in the text because the error is economically expensive.
In the field of natural language processing, the research axis of automatic spell checking and correction remains the most important and the oldest among the other axes of natural language processing (NLP). Research in this area dates back to the 1960s [1], [2] and continues to the present. The first studies in the field of automatic correction had as a first goal to modelize the notion of spelling error, according to Damerau [3] and later Levenshtein [4], an error is considered as a simple or multiple combinations of the elementary editing operations relative to the insertion, deletion and permutation of characters inside a lexicon word. Based on this modeling, several methods and algorithms have been suggested for spelling correction. We distinguish between two large categories of correction methods: combinatorial methods and metric methods. Combinatorial methods [5], [6] consist in generating all the possible sequences from which the erroneous word could be derived by applying elementary editing operations and yielding only the sequences belonging to the lexicon [7], [8].
Metric methods consist in comparing the erroneous word with the entries of a given dictionary while calculating a lexical similarity measure [9]. The solutions to the erroneous form are those with a minimum metric. Metric methods are qualified as the best methods because they yield better results and are implemented in spelling correction systems [10]. Another different method uses the dictionary search and morphological analysis module for the Indonesian language as a spelling correction strategy [11]- [14]. The major limitation raised in metric-based correction methods is that they do not differentiate between several solutions having the same lexical similarity measure. This finally requires these spellcheckers to get the assistance of the user to negotiate the nearest solution to the erroneous word, so that the user can finally select the correct one.
Our research target and main goal in this article is how to improve the correction process in order to reach the stage of achieving a fully automatic spell checker. In other words, is it possible to develop a fully automatic spelling correction system without negotiating solutions with the writer?. So that the first solution suggested in the solutions list is the one desired by the user. To reach this goal, we launched a learning test on a training corpus in order to identify and analyze the nature of the spelling errors committed, and based on this analysis, we integrated these estimated parameters, probabilistic weights related to elementary editing operations, in our learning corpus to improve the spelling correction process [15].
After collecting our corpus of typed texts, we proceeded to identify and analyze the errors committed by the operators and arranging the misspelled words with their corresponding correct words in a database. Generally, the spelling errors committed include: insertion, deletion, and permutation known as elementary editing operations. Table 1 gives the statistics of the error rate according to the type of elementary editing operations calculated from our training test. − Analysis of permutation errors The first question raised according to the Table 1 was to know the reason why data entry operators commit enough errors of the type of permutation. The analysis of the permutation errors led us to conclude that they are mainly caused by two main factors. The first factor is the proximity between Arabic keyboard keys and the second factor is the calligraphic similarity between some Arabic characters [16].

− Analysis of deletion and insertion error
If the analysis of permutation errors led us easily to find a direct link between the permuted characters in a lexical word, this was not the case for the analysis of deletion and insertion errors. Nevertheless, after consulting the distribution of the keys in the Arabic keyboard, we concluded that there are two different interpretations of the results of analysis of these two types of errors: 85.30% of the erroneous insertion and/or deletion of characters in the lexical word depend on the proximity on the left or on the right of the keys of Arabic keyboard. However, 14.70% of the analyzed characters do not depend on the proximity of the keys of keyboard, such as the insertion or deletion of blank spaces.
As shown in Table 2, we illustrate our interpretations with examples of insertion and deletion errors. We have added two columns to try to demonstrate whether there is a proximity relationship on the keyboard between the inserted or deleted characters (see Arabic keyboard Figure 1). For example, for the first erroneous word and corresponding to the database, we led that the erroneous insertion of the character " " is due to the ❒ ISSN: 2088-8708 proximity on the left and on the right between the characters " " and " ".  According to these statistics and the different interpretations deduced, we can confirm that the majority of the editing errors made (insertion, deletion, permutation) are caused by the proximity and similarity of the character keys on the Arabic keyboard [17], [18]. In the rest of this paper, we will modelize these interpretations as matrices probabilistic weights. These weights will be assigned for each elementary editing operation for calculating lexical similarity during the spelling correction process [19].

THE PROPOSED METHOD
In this article, we propose to assign probabilistic weights, which are related to the proximity between the keys of the keyboard and to the calligraphic similarity between the Arabic characters. These probabilistic weights will be assigned to the different editing operations during the calculation of lexical similarity between the erroneous word and arabic dictionary entries for a spelling correction method based on edit distance. The analysis of permutation errors allowed us to conclude that permutation can be caused by two factors: the proximity between the keys of the Arabic keyboard or the degree of calligraphic similarity between some arabic characters. Subsequently, We are modeling these factors as a proximity matrix and a calligraphic similarity matrix between Arabic alphabets in order to introduce them into the Levenshtein algorithm [4]. Then, these will be tested to examine whether this weighting edit distance [20] will help improve the correction rate so as to better refine the scheduling of the closest solutions to the detected erroneous word in order to achieve an effective auto-correction [21], [22].

Definition of the weighted edit distance
The lexical similarity calculation between two sequences X = x 1 x 2 . . . x m of length m and Y = y 1 y 2 . . . y n of length n, is done using a new measure,called weighted edit distance and noted Ed wei . The calculation is given by the following recurring relationship: We note Ed wei (i, j) the weighted edit distance With As a result: − Cost del (x i−1 )= the cost of deleting the character x i−1 given by the P roxim(x i−1 /y j ) − Cost ins (y j−1 )= the cost of insertion the character (y j−1 ) given by the P roxim(x i /y j−1 ) − P ermut(x i−1 , y j−1 ): cost of permutation between characters x i−1 and y j−1 . − P roxim(x i /x j ): cost relied to the proximity between the keyboard keys characters x i and x j .
To test and evaluate the interest of our new weighting edit distance, we launched a test on a dataset of errors of different editing operations kinds. We randomly select 547 errors extracted from our learning corpus. Figure 2 summarizes the the scheduling rates of the different ranks of exact solutions according to the erroneous word/correct word database.

Figure 2. Comparison of correction rates between edit distance and weighted edit distance
According to these results, we found that our weighting has remarkably improve the rate of scheduling with a score of around 67.50% to achieve autocorrection. In fact, even if we integrate probabilistic weights to improve the correction rate, we have not succeeded in reaching our objective of auto-correction. Our second step is to look for another filter to increase the correction rate.
After a deep analysis of the nature of spelling errors committed in typing Arabic texts (learning test), we discovered that the majority of elementary editing operation errors are committed either at the beginning of the word or in the middle and are rarely committed at the end of the word. This valuable information discovered greatly helped us to choose another composition with our previously defined weighting edit distance.
Among the distances used in the similarity measure, we find that of Jaro-Winkler [23], [24] which is more adapted to our new situation as well as the n-gram language models [25], [26]. This composition will give very good results and can be combined with our weighting to further optimize the weighting, and finally achieving our end objective. The choice of the Jaro-Winkler distance is justified because the majority of editing errors are made in the beginning of the lexical word.

The filtered composition approach
To summarize all that has been deduced in this paper, we define in this section our filtered composition (composition of the three metrics: weighted edit distance, Jaro-Winkler distance and the bi-gram probabilistic language models). More formally, the similarity measure, called D cf , between X and Y is obtained by the following relationship: Where: − Ed wei (X, Y ): the weighted edit distance between strings X and Y − J wink (X, Y ): the Jaro-winkler distance between strings X and Y − P r(X/Y ): bi-gram language model probability of a word X appearing after a word Y estimated on a corpus of correct forms. Let's w err an erroneous word, D ict a dictionary of a given language and S ols a set of the proposals closest solutions of the error w err where S ols ={ s i According to the new measure, filtered composition, the best corrections are those that check: As shown in Figure 3, our spelling correction procedure consists of: − Check if W err is in the Lexicon − Otherwise, calculate the edit distance with the words of the lexicon, and return only 10 solutions having a minimum distance. − The list of solutions will then be passed as parameters to our filtered composition to weight solutions relative to others. − Finally return another list generated by our probabilistic lexical measure and test if the word corresponding to the error is in the first position of the list. Example: In our training corpus, the data entry operator commited the following error, " ". The corresponding correct word is " " in our learning dataset. The following example in Table 3, shows how our new measure floated the corresponding word to the commited error to the first position by the other suggested solutions.

RESULTS AND DISCUSSION
In order to apply and evaluate the relevance of our new approach in terms of correction rate (accuracy rate) and in order to validate our design choices, we launched a spelling correction test using our new measure: the filtered composition. These tests were done on a set of 547 erroneous words extracted from our error corpus. This test set of errors are of different kinds of elementary editing operations.
In this test, we evaluate more precisely the performance rate of our new measure in order to see if this measure will correct exactly the errors by ranking the corresponding correct words in the first position of the suggested solutions list. At the end of this test, we compared the accuracy rates of the different distances mentioned in this study. According to the obtained results in Figure 4, we can confirm that our filtered composition achieved a rather high accuracy rate compared to the other measurements cited in this article.
Our proposed similarity measure, filtered composition, achieved an precision rate up to 96%, which means that 96% of the returned solutions, ranked in the first position of solution lists, are the correct words that correspond to the erroneous words in our training corpus. While for the other measures studied in this article their precision rates do not exceed 68%, respectively: 67.70%, 67.31% and 48.36% for Jaro Winkler, Weighted edit distance and edit distance. However, our D cf measure ranked 3.41% of the solutions (correct words) in the second position of the list. This is can be interpreted by the fact that some solutions have the same weighted edit distance metric and the same probabilistic bigram model langauge. As an example, the two solutions " " and " " for the erroneous word " ", have the same probabilities of appearance in front of the word " ", and the same measures returned by Jaro-Winkler and weighted edit distance.

CONCLUSION
In this paper, we presented a new approach to the spelling correction of errors in Arabic texts. This approach is based on a filtered composition of robust lexical similarity distances and probabilistic language models. From a training phase, we were able to assign probabilistic weights to the different elementary editing operations in the edit distance, the fact that the data operators committed editing errors either at the beginning or in the middle guided us to introduce the Jaro-Winkler distance and finally the use of the bi-gram language model to better weigh, refine and filter some solutions against to the others.
The experimental results carried out on a corpus of errors made it possible to achieve our objective in this work: it is to reach a highly effective auto-correction. The attribution of probabilistic weights to the different editing operations during the calculation of the editing distance allowed us to reach a very high rate of auto-correction, which validates the choices of our conception. The obtained results are very encouraging and show the interest to embed our new measure in a spelling correction system.