An exploratory research on grammar checking of Bangla sentences using statistical language models

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.


INTRODUCTION
The field of study that deals with processing natural languages is called Natural Language Processing (NLP) which investigates how computers can be used to recognize and operate natural languages [1]. NLP is an important brach of Artificial Intelligence (AI), which has plenty of applications as other branches of AI do like rice grain classification [2], anomalous sound event detection [3], robotic navigation [4], recommendation system for buying house [5], and so on. One such application of NLP is grammar checking [6]. Though there are a lot of tools and techniques, as described in [7][8][9][10], developed for grammar checking in recent years, but, grammar checkers have quite a lot of limitations still now.
There are mainly two approaches to implement a grammar checker, namely rule-based approach [11] and statistical approach [12]. In rule-based grammar checkers, a set of manually developed grammatical rules are used to decide the correctness of the given text and developing such rules requires time and high-level linguistic expertise of the target language. Whereas, in statistics-based approach, the grammar rules are built from a text corpus of the target language using statistical methods where common sequences that occur often can be considered correct and the uncommon ones incorrect. Language model (LM) is a widely used statistical technique that builds a statistical machine from a text corpus of the target language that can estimate the distribution of the language as accurately as possible. A central issue in LM estimation  Rahman) 3245 is data sparseness, in which case LMs fails to approximate accurate probabilities due to limited training data. Smoothing [13] is a technique that resolves this problem by adjusting the maximum likelihood estimator to compensate for data sparseness. In practice, LMs are usually implemented in conjunction with smoothing techniques for better performance. There are many smoothing techniques available out of which Witten-Bell (WB) [14] and Kneser-Ney (KN) [15] are by far the two most effective and widely used smoothing techniques. A number of good works is done in Bangla in different problem domains of NLP, e.g. autocomplete [16], autocorrection of spelling [17], word prediction [18]. Furthermore, there has been much development in grammar checking research in many different languages. Nevertheless, being one of the top ten spoken languages in the world [19], there has been little development in the Bangla language processing specially in grammar checking. Though some efforts have been made, there are still plenty of rooms for improvement. In [20] the authors presented an -gram LM to design a Bangla grammar checker, where the -gram probability distributions of parts-of-speech (POS) tags of words are used as feature. A sentence is detected as grammatically correct if the product of all the -grams in the sentence is greater than zero otherwise incorrect. Due to this, their method suffers from the data sparsity problem, which severely degrades the performance of the system. Moreover, they used a very small corpus of only 5000 words to build the -gram model and tested the model on a test set of simple sentences. The authors in [21] presented another -gram based statistical technique for grammar checking. Rather than using probability of POS tags of words this time -gram probability distribution of words is used to train and test the system. To deal with sparsity problem of -gram models, they used WB smoothing with their -gram model. They trained their statistical -gram model with a small experimental corpus of 1 million words with a test set of 1000 correct and 1000 incorrect sentences. However, their approach did not clarify how the threshold between correct and incorrect sentences is determined which is not a practical approach. Moreover, in our previous work [22], a statistical method was proposed which used -gram based LM combined with WB smoothing and backoff technique to determine the grammatical correctness of simple Bangla sentences, which presented promising results. Nevertheless, there are still room for improvement and further analysis are required to find an enhanced, robust and well performing statistical grammar checking system for Bangla.
The issues mentioned above and facts motivated this work where a comprehensive comparative study on the performance of WB and KN smoothing based LMs for the purpose of grammar checking of Bangla sentences has been performed to find the best possible LM, settings and methods for the development of a more accurate and robust grammar checker for Bangla. The presented technique was trained on a large Bangla corpus of 20 million words collected from various online newspapers. An improved strategy is proposed to determine appropriate threshold to distinguish between grammatical and ungrammatical sentences. The threshold was finalized by performing cross validation on the training set and testing on a separate validation set in two stages to ensure maximum optimality. The proposed method was tested on an updated realistic and challenging test set of 15000 correct and 15000 incorrect sentences consisting of all kinds of simple & complex sentences with varying lengths. The rest of the paper is organized as follows; section 2 presents some theoretical background on -gram based sentence probability calculation. Whereas section 3 describes the methodology used for developing the system. Section 4 presents the experimental results while section 5 concludes the paper.

STATISTICAL LANGUAGE MODELING
N-gram statistical LMs are very popularly used statistical methods for solving various NLP problems.

N-gram language models
A language model (LM) is a probability distribution over all possible sentences or strings in a language. Let's assume that S denotes a sentence consisting of a specified sequence of words such that S = w 1 w 2 w 3 … w k . An n-gram LM considers the word sequence or sentence to be a Markov process [23]. Its probability is calculated as, where refers to the order of the Markov process. When = 3 we call it a trigram LM which is estimated using information about the co-occurrence of 3-tuples of words. The probability of ( | − +1 … −1 ) can be calculated as, The probability of this sentence can be calculated using bigram LM with (1) as, For the same English sentence, P(Kader ate a mango) = P(Kader |<s>) * P(ate | Kader) * P(a| ate) * P(mango |a) * P(</s>| mango) In practice, to calculate the probability of a sentence a start token <s> and an end token </s> are used to indicate the start and end of the sentence respectively.

Data sparsity problem
For any -gram that appeared an adequate number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable word sequences are bound to be missing from it. That means, there will be many cases in which correct -gram sequences will be assigned zero probability. For example, suppose in the training set the bigram একটি(ekti) আম(aam) occurs 5 times but although correct there is zero occurrence of the similar bigram একটি(ekti) আদেল(apple Since the bigram একটি(ekti) আদেল(apple) has zero count in the training corpus, in the bigram model the probability will be zero as P(আদেল(apple)|একটি(ekti)) = 0. Consequently, the probability of the sentence will be, P(কাদের একটি আদেল খেদেদে) = 0. This probability will be zero since according to (1) the sentence probability is calculated by multiplying the constituent -gram probabilities and if one of them is zero then total probability will be zero. Therefore, these zero-frequency -gram sequences that do not occur in the training data but appear in the test set poses great problem for simple -gram models in accurate probability estimation of the sentences.

Smoothing
Smoothing techniques are used to keep a LM from assigning zero probability to unseen word sequences, and has become an indispensable part of any LM. In this work, we utilized the two most widely used smoothing algorithms for language modelling namely Witten-Bell (WB) smoothing and Kneser-Ney (KN) smoothing. Smoothing techniques are often implemented in conjunction with two useful strategies that take advantage of the lower order -grams for the calculation of higher order -grams that yields zero or low probabilities. These are backoff [24] and interpolation [25] strategies.

Witten-bell smoothing
Witten Bell (WB) smoothing compensates the counts of word sequences occurring once to estimate the counts of zero frequency word sequences. Originally, WB smoothing algorithm was implemented as a linear interpolation instance taking advantage of lower order -gram counts.
Here, 1 − ( − +1 … −1 ) is the total probability mass that is discounted to all the zero -grams and ( − +1 … −1 ) is the the leftover probability mass of for all non-zero count -grams. With a little adjustment the WB smoothing can be implemented as an instance of backoff language model. The backoff version of WB smoothing can be written as:

Kneser-ney smoothing
In Kneser-Ney (KN) smoothing the lower-order distribution that one combines with a higher-order distribution is built on the intuition that rather than calculating the probability of a word proportional to its number of occurences, it should be calculated based on the number of different words it follows. In its original definition, Kneser and Ney defined KN smoothing as a backoff language model combining lower order models with higher order model using backoff strategy as: where ( | − +1 … −1 ) represent the backoff weights assigned to the lower order -grams which determine the impact of the lower order value on the result. The discount represents the amount of counts that are discounted from each higher order -grams. can be estimated based on the total number of -grams occurring exactly once ( 1 ) and twice ( 2 ) as = 1 1 +2 2 . The probability for the lower order -grams can be calculated as where, With a little modification the interpolated version KN of can be defined as follows:

PROPOSED GRAMMAR CHEKCING METHODOLOGY
In this section we present the grammar checking methodology that we used to evaluate and analyse the performances of smoothing algorithms. It is an updated version of the grammar checker we presented and described in our previous work. The overall framework or workflow of the system is depicted in Figure 1. The working procedure of the grammar checker consists of three main phases: Training phase, validation phase and testing phase. The training process in the proposed system starts by accepting the training corpus and the value as input. After accepting the input text and the value, possible -gram patterns of words are extracted and frequencies of -grams are then calculated. Using these -gram frequencies LMs are trained based on the algorithms discussed in the previous sections. In the validation phase, a best possible threshold is calculated for separting the correct or incorrect sentences. The validation process starts by accepting a validation or heldout set consisting of a set of correct and incorrect test sentences. Then the probabilities of these test sentences are calculated and a threshold value is determined that best separates the grammatical and ungrammatical sentences. To do so first we need to define a method to calculate the sentence probability properly which is discussed next.

Calculation of sentence probability
The sentence probability in -gram LMs is usually calculated using (1) by first finding the constituent -grams in the sentence as shown in section 2.1. Since probabilities are by definition less than or equal to 1, the more probabilities we multiply together, the smaller the product becomes. Due to that sentence length (i.e. the number of word tokens in the sentence) has a negative effect on the probability of a sentence. With larger length a sentence tends to have smaller probabilities even though having higher probability constituent -grams. So, a larger length correct sentence might have smaller probability than a smaller length incorrect sentence because of this effect. To deal with this impact of sentence length on sentence probability calculation a new sentence probability scoring function is introduced in this work defined in (8) by normalizing the sentence probability in (1).

Optimal threshold calculation
In the validation phase, optimal threshold for the -gram based classfier is calculated in two stages. In the first stage, we used 10-fold cross validation on the training set which consists of only grammatically correct senetcnes. Since, a correct sentence typically has a higher probability than an incorrect one, in each fold we selected the lowest probability score among the sentences of training part as the threshold and used that threshold to classify the test sentences and find the misclassification error with that threshold. The threshold that has the minimum misclassifcaton error is finally chosen as the final threshold. The process is an improved version to the process we used in our previous work. The process is explained in Algorithm 1.

Algorithm 1. Priliminary threshold selection from training set in stage 1
Input: S= training data set; L = corresponding true labels of positive and negative sentences in VS LM = language model to be used 1.
Set MCR min = 1 //the minimum misclassification rate and Set T = final threshold 3.
Set S test = S i and S train = S -S i 5.
Train the LM on S train . 6. t = Find the minimum probability in S train and set it as current threshold 7.
probs = Test the LM on S test using t as threshold..

8.
mcr = Find the misclassification rate for the current threshold. 9.
If mcr < MCR min then Set MCR min = mcr and T= t 10.
End For 11.
return T //T is the final threshold selected Though methods in the first stage work well but they introduce a lot of false positives in the final classification. Since we are using the minimum probability score of correct or positive sentences as threshold it ensures high true positives but it adversely overlaps with a substantial number of incorrect sentences in the probability distribution. Hence, the high false positive rates. To reduce the unwanted high number of false positives and to improve the classification performance overall in the second stage we used a method that gradually increases the threshold to reduce the number of false positives but also ensures the balance between false positives and false negatives. This method is applied on a separate validation set consisting of equal number of positive and negative sentences to finalize the optimal threshold. This process is explained in Algorithm 2.

3249
In the testing phase, the classification LMs are tested on a separate test set consisting of grammatical and ungrammatical sentences using the optimum threshold calculated in the validation phase. If any senetence has a probability less than the optimum threshold then it is classified as ungrammatical otherwise grammatical. Set th = t 0 // th is the final threshold Divide the range [t 0 , 1] into k equal sized thresholds in THS = { t 1 , t 2 ,…., t k } 3.
For each threshold t in THS Do, 4.
Calculate [TP,FP,TN,FN] using t as threshold on VS and hence calculate the fpr t and fnr t for t.
End For 9 return th //th is the final threshold selected

RESULTS AND ANALYSIS
The main focus of this section is to investigate the performance of the grammar checking system based on certain factors such as the smoothing algorithm used, -gram orders, length of the target sentences etc. To train and test the LMs we used a large corpus of 20 million words containing 181820 grammatically correct sentences. Around 80% of the corpus is used for training purpose. The validation set consists of 20000 correct sentences and 20000 incorrect sentences. The grammatically incorrect sentences are artificially created by inseting, deleting or replacing words in the correct sentences in the set. The test set contains 15000 correct and 15000 incorrect sentences. In our previous work, we only tested the methods on a test set containing only simple senetnces of length of 5-10 words. This time we tested the techniques on a more difficult and practical test set consisting of all kinds of simple, complex and compound sentences with lengths ranging from 5 to 20 words. The experiments have been tested on a machine with 2.40GHz Intel Core i3 processor and 12 GB of RAM, running on Microsoft Windows 8. The experimental system has been developed using python programming language. The comparative performances of the LMs were evaluated by precision, recall and f-scores. The overall performances of the different LMs based on the smooting techniques and -gram order used are presented in Table 1. Table 1 represents the results of different LMs for each metric (precision, recall & f-score) in two columns. The gray shaded column represents the results obtained using the threshold selection method used in our previous work [22] and the other column represents the results attained using our two stage threshold selection procedure explained in Algorithm 1 and Algorithm 2, which is proposed in this work. Our newly proposed two stage optimum threshold selection approach clearly provides significantly improved results for all the LMs compared to the previous approach. It significantly increases the precision and hence the overall f-score for all the LMs with the cost of small or insignificant reduction in recall values for grammatical sentences. Similarly, for ungrammatical sentences the recall scores are significantly improved resulting in much improved f-score with the negligible loss of precision values. This improved performance is due to the reduction in false positives and also keeping a balance between false positives and false negatives. These results prove the superiority of our proposed method compared to the previous one. From the newly found results in Table 1 it is evident that, KN-interp with its 5-gram model clearly outperforms all the other LMs in terms of precision, recall and f-score for both grammatical and ungrammatical sentences achieving highest f-scores of 72.92% and 68.51% respectively. In terms of f-score as we can see from the Table 1, WB-backoff produces the second best results for both grammatical and ungrammatical sentences with KN-backoff model providing the third best performance. The models rank similiarly in terms of precision and recall with one or two exceptions such as for recall metric KN-backoff performs slightly better than WB-backoff.  [22]. T2 is the threshold calculated using the two-stage threshold selection technique introduced in this work.
Performances of the LMs inprove with the growing order of -gram and the performance improvement gets lesser with each higher order. Though the performances of most of the LMs tend to increase from 4-gram order to 5-gram order, the performance differences are very insignificant. Figure 2 and Figure 3 depict this effect where the f-scores of the LMs varied by the -gram order are presented for both grammatical and ungrammatical sentences. Though not presented here, similar effects can be observed in terms of precision and recall.
Since we are using a data set consisting of varied length of sentences, next we try to find out whether sentence length has any effect on the performances of LMs. Figures 4 and 5 present the f-scores of two of our best performing LMs, KN-interp and WB-backoff varied by the length of sentences tested for both grammatical and ungrammatical data respectively. From Figures 4 and 5, we find that the performances of the LMs gradually decrease with the increasing sentence length for the sentences. This is understandable since sentences with more words or higher length will tend to be more complex in structure and difficult to be judged. But this degradation in performance is linear not exponential and changes are very small. This shows the robustness of our sentence probability calculation function defined in (8). Though not presented here, performances of other LMs (KN-backoff and WB-interp) and on other metrics shows similar characteristics for the dependency of the method on sentence length. So, we can conclude that KN LM with its interpolated version i.e. KN-interp outperforms all the other LMs in terms of all performance metrics. With higher -gram order the performances of the LMs improve with 4-gram and 5-gram models showing similar performances with negligible diffrences and the length of the sentence does not affect the performance of the LMs significantly.

CONCLUSION
The goal of this research was to design and develop a robust grammar checking system for Bangla language which can accurately judge realistic, simple and complex sentences for grammaticality. To attain that extent, a statistical grammar checking system based on -gram language modelling has been designed and developed. To achieve robust performance with -gram models two most widely used smoothing techniques namely Kneser-ney and Witten-bell were used and compared to find best performing system. Furthermore, the LMs' performances were tested on a newly developed challenging test set containing 30000 all types of simple, complex and compound sentences to attain realistic performance results. Our experimental results show that Kneser-ney interpolated smoothing based 5-gram LM outperforms others in terms of all the metrics achieving f-scores of 72.92% and 68.51% for grammatical and ungrammatical data respectively. For further this research work, more features such as parts of speech tags and other linguistic features can be added to improve the performance of the system.