Online handwriting Arabic recognition system using k-nearest neighbors classifier and DCT features

Received Aug 24, 2020 Revised Jan 19, 2020 Accepted Mar 5, 2021 With advances in machine learning techniques, handwriting recognition systems have gained a great deal of importance. Lately, the increasing popularity of handheld computers, digital notebooks, and smartphones give the field of online handwriting recognition more interest. In this paper, we propose an enhanced method for the recognition of Arabic handwriting words using a directions-based segmentation technique and discrete cosine transform (DCT) coefficients as structural features. The main contribution of this research was combining a total of 18 structural features which were extracted by DCT coefficients and using the k-nearest neighbors (KNN) classifier to classify the segmented characters based on the extracted features. A dataset is used to validate the proposed method consisting of 2500 words in total. The obtained average 99.10% accuracy in recognition of handwritten characters shows that the proposed approach, through its multiple phases, is efficient in separating, distinguishing, and classifying Arabic handwritten characters using the KNN classifier. The availability of an online dataset of Arabic handwriting words is the main issue in this field. However, the dataset used will be available for research via the website.


INTRODUCTION
Online approaches of recognizing handwritten scripts remain an ongoing research issue over the last four decades [1,2]. Nevertheless, this research area is getting more attention recently due to the widely use of touch screen notebooks, handheld computers, and smart phones around the world [3,4]. In contrast, in the offline scripts recognition field, the efforts have been more dedicated to comparing the online handwritten recognition approaches for most languages [5][6][7]. Online text recognition refers to handwritten texts on touch screen or tablet type devices. In essence, online recognition deals with the coordinate pairs of handwriting in real-time while the writing is taking place. Offline recognition is fairly simple, such systems take a complete image of the script from a digital input source, such as a scanner, and binarize it using a threshold technique, so that the image pixels are either on (1) or off (0) for typewritten [8,9] or handwritten text [10,11]. Online recognition of Arabic script is a challenging problem for many reasons. Handwritten characters suffer not only from scale, location, and orientation variation, but also from person-dependent deformations that produce a large set of handwriting samples per character [12]. Additionally, writing on a digital platform is not as accurate as writing on paper. There are many applications that require real-time automatic recognition  Abuzaraida) 3585 of handwritten scripts. With the growth in popularity of handheld devices, the number of applications that require automatic text and digit recognition is also increasing. Though keyboard applications exist in these devices, there are many application areas that require users to write on the screen or tablet of the device either through a stylus pen or using the tip of their fingers [13,14]. The main motivation of this research paper is to design and implement a methodology for real-time Arabic handwriting recognition. Our objective is to achieve a high classification accuracy by proposing a novel approach of Arabic handwriting text preprocessing and segmentation. A lot of research focusing on online recognition has concentrated on English, whereas concentration on Arabic script recognition has been less focused on in previous researches. This paper aims to propose enhanced solutions to the major challenges of the Arabic handwritten form, which represents size variations, dimensional variations, and irregularity of stylus points. In this paper, a novel methodology for online recognition of Arabic handwritten words in real-time is presented. The technique is based on a directional-based segmentation, enhanced structural features and an efficient classifier. The system is tested on dataset consisting of more than 2500 handwritten Arabic word samples. These words were passed through four consecutive phases to reach the state of separating each character of each sample and classifying them correctly. An average 99.10% accuracy is achieved by the proposed system. This paper is organized as follows: Section 2 introduces the methodology used for the online Arabic handwriting recognition system, Section 3 shows the results and discussion of classification process, Section 4 is the conclusion and future work.

PROPOSED METHOD
The architecture of the proposed system is illustrated in Figure 1 and consists of five main phases: In the first phase, raw data is collected accurately using a touching device as coordinates of X and Y. In the second phase, pre-processing phase is applied for removing noise or distortions that are present in the raw data. This phase includes four steps: size normalization, centering, point resampling, and pen or finger movement's direction calculation. In the phase of segmentation, data can be represented by characters or sub-stroke levels. Thus, the nature of every character or sub-stroke can be analysed in individual processes [2,15]. In the feature extraction phase, features that define each character that are obtained were: Discrete cosine transform (DCT) coefficients that determine each character form, character position within the word and character dots with their position. Finally, the classification phase is done using the k-nearest neighbors (KNN) classifier to recognize the handwritten samples and verify the accuracy of the proposed method. The 5 phases are described below.

Dataset
The data acquisition phase is the first step of the pattern recognition system [16,17]. It is used for feeding raw data to be used in subsequent phases to train and test the proposed system [18]. For this phase, a sample of 2500 words were obtained using a software interface to convert each handwritten word to coordinates (x, y) of pen trajectory. This dataset was collected by another research project [19].

Pre-processing phase
The pre-processing phase is applied to reduce noise and aberrations that may occur in acquired texts because of the limitations of the hardware or software used while writing the text. The noise or aberrations include irregular text size, non-centered text, missing some points of the lines and curves of the texts paths because of the high speeds of writing and uneven distances of points from neighbouring positions [11]. In this phase, several steps are included. Every step performs a specific function for filtering the raw data. Also, it may help to improve the recognition accuracy rate overall and it is considered an important phase of pattern recognition systems. The following steps are applied in the pre-processing phase of the proposed system:  Size normalization is performed for recognizing any type of script and avoid ambiguity by exchanging the acquired text size into a standard size. The step is done by the (1).
where S is the coordinates series of x and y.  After resizing, the coordinates have to be transferred into the centre (x0,y0) to be sure that text is transferred to the same spot relative to the origin. The resizing step is set based on (2).
 Then, the resampling process removes redundant points acquired by the digital device and adds missing points in text trajectories. The resampling procedure is summarized in Figure 2. Finally, the Freeman Chain code [20] is performed to represent the direction of movements for every word. This code represents the directions of the pen movements which are listed from 1-8 codes as illustrated in Figure 3. Figure 4 shows the results of performing the pre-processing steps on the word ‫."له"‬

Segmentation
In this phase, data will be presented in isolated characters or sub-strokes. Thus, the nature of every character or sub-stroke is analysed individually. For this purpose, a new segmentation algorithm has been created and developed. First, the handwritten word is divided into a set of strokes that are gathered in groups representing the characters of the written word. In such an operation, Freeman Chain codes were grouped in three sets to describe three main directions: Up, Down and the horizontal direction for left and right. These sets have been labelled as A, B, C areas respectively as shown in Figure 5.  a specific threshold (t = 2.5), it will be a character dot or a transition from a character to another one. In this case, a new stroke will be created to store the sequence of points that starts from the second point. Figure 6 explains the initial strokes of the word ‫"له"‬ after performing the first part of the segmentation algorithm. Figure 6. The initial strokes after applying the first part of the segmentation phase In the second part of the segmentation algorithm, the resulting strokes are filtered and merged or removed according to a set of specific principles explained below. The baseline (writing line), that is a hidden line, describes the position of characters or dots in the word as calculated. By finding out the baseline, final forms of each character can be determined.
First rule, if a stroke contains three points or less and does not relate to any other stroke (in other words, it is not a complementary stroke of other strokes), the stroke will be dots belonging to a character. Otherwise it will be a character or a part of a character. Then, the remaining strokes are passed through a set of tests to identify the form of each stroke and determine its relationship to its adjacent strokes or sometimes distant strokes. For example, a stroke that consists of points in the vertical direction (Class A or B) will be a character Alif ‫)'ـا'(‬ or character Lam ‫)'لـ'(‬ if the distances between these points are vertical and not far apart. In this case, this stroke has reached its final form. Except for these cases, any stroke is considered a part of a character. Moreover, if a stroke intersects with another one, these strokes are combined together in a single stroke that will be a loop in a character as in Fa ‫,)'فـ'(‬ and Sad ‫.)'ـصـ'(‬ Also, if a stroke has a connection point with the next stroke and they have opposite parts horizontally or vertically, they are grouped together. Examples of characters that have opposite parts such as Noon ‫,)'ن'(‬ Ba ‫,)'ب'(‬ Jeem ‫,)'ج'(‬ Lam ‫,)'ل'(‬ and Lam-Alif ‫.)'ال'(‬ These steps are repeated for each stroke until it satisfies all the principles and cannot be merged any more. Figure 7 illustrates the results of performing segmentation algorithm on the previous example ‫."له"‬ After obtaining the final form of characters, the number and position of dots will be calculated. Dots and the state of each character are recorded in text files as numeric codes as explained in Table 1.  The initial form of the character (connected only from the left), i.e. " ‫لـ‬ ". 2 The middle form (connected from the both, left and right), i.e. " ‫ـبـ‬ ". 3 The character is at the end of the word or connected only form the right, i.e. " ‫ـه‬ ". 4 The isolated form (not connected), i.e. " ‫و‬ ".

Dots 1
Single dot above the character. -1 Single dot below the character. 2 Two dots above the character. -2 Two dots below the character. 3 Three dots above the character. 0 No dots in the character.

Character features extraction
This phase aims to realize that just some features of data points are important to be used in the pattern recognition tasks [21][22][23]. For this study, frequency domain features and signals are used. By performing the segmentation phase, a sequence of strokes which represent characters of a word, will create a result. The sampling points of multiple strokes would look like: In which, each parenthesis indicates a stroke assuming the whole word would have strokes. The sample point's number in each stroke can be more or less samples of the same stroke. In other words, they are not necessarily equal. The discrete cosine transform (DCT) allows it to reconstruct a sequence of accurate from only a few DCT coefficients. It is used for converting the corresponding character signals to frequency domain coefficients [24]. The feature vectors will be constructed using the derivative cosine transform coefficient. This will be considered for each class which was found to be actual and exact for describing the characteristics for every Arabic letter. In fact, signals representing characters typically have the lowest frequency components, so the eight lowest coefficients for all of the axis signals were used for determining the feature vector [24]. Totally, there will be 16 coefficients in the features vector to represent the shape of every character: Additionally, these coefficients are combined with features of state and dots of each character. Therefore, the final number of features used to describe and classify each character would be 18 coefficients.

Classification
In this study, k-nearest neighbor (KNN) classifier was applied by using Lazy-IBK Weka classifier. This technique is used as the recognizer for classifying and recognizing the Arabic handwritten characters. Before classification, a dataset of features was prepared by 115 class labels that identify each character. Dataset records were 6650 in total represent characters' features of the collected handwritten words. The dataset that was used to train and test the KNN classifier will be described in the next sub-sections.
Any new character can be classified by discovering the k-nearest neighbors among the training data. This process is done by categorizing the candidates of the 115 class after getting each of their weight. Here, a minimum distance function is required for comparing the similarity of writing points. Euclidean distance is used in this classifier within the testing points and all the training points to conclude K smallest Euclidean distances as the nearest neighbors. Equation (3) was used to find the smallest Euclidean distances between two features factors. d(C testing ,C training )= √ ∑ (fv testi -fv traini ) where C testing is the testing character and C training is the matched character from the training set. Also, f v testi and f v traini are the feature vectors of the testing and training process.
Weka software (Waikato environment for knowledge analysis, version 3.8) was used for implementing the KNN classifier by applying the Lazy-IBK classifier. This environment platform gives an easy way for comparing among a variety of machine learning techniques. The training and testing process can easily be performed on datasets using a collection of a number of visualization methods and tools. KNN is considered as an instance-based learning example. Here, in the stored training dataset the classification process for any new unclassified case can be classified basically by matching it to similar cases in the training set [25,26].
In this study, two different test modes were performed. In the first mode, called full training Set mode, the whole dataset is used to train and test the model. In the second mode, called split criterion mode, the classifier was trained on a specific proportion of the records and tested on the remainder. The splitting proportion ranged from 90% to 10% each time. The number of nearest neighbors used by the KNN classifier was set as 1-nearest neighbor. The results and observations of using each of the two test modes of the KNN classifier are explained below.

RESULTS AND DISCUSSION
The classification process determines the efficiency of the recognition system and the validity of processes that each phase of the system are comprised of. Therefore, the k-nearest neighbor (KNN) classifier was used as the primary recognition engine. Results obtained in this study were based on two test modes: Full training set and split criterion. In the full training set mode, testing is implemented on the same set the classifier is trained on. Whereas in the split criterion mode, the classifier is trained in a specific proportion of the data initially to build the model, and then to test the model using the rest of the records. Table 2 illustrated the accuracy results of the KNN classifier for each mode.
In the full training set test mode, the whole dataset was used to train and test the model and therefore the classification ratio was high (99.10%). In the split criterion mode, the dataset was divided by different percentages at each time. The best results were obtained when the records were divided into (90%) for training and (10%) for testing. The classifier was able to classify (95.64%) of characters correctly, while the lowest percentage was (91.29%) when the records were divided into (10%) for training and (90%) for testing.
The classifier was able to obtain high classification percentages of not less than (90%) in all cases, and the characters that were incorrectly classified were less than (9%) in the split criterion mode. This indicates that the proposed approach can classify most of the characters properly and also indicates that the phases of pre-processing, segmentation, and features extraction had succeeded in representing the characters properly which enabled the system achieving these results.
The speed of the model building did not exceed (0.02) seconds, whereas, the test period ranged from (0.33) seconds to (0.89) seconds in the split criterion mode and (4.11) seconds in the full training set mode. The precision (P) rates of the classifier for each test mode were above (0.9) as shown in Table 3. The highest rate was (0.993) in the full training set mode, while it was between (0.962) and (0.916) in the split criterion mode. Overall, it can be seen that the KNN classifier achieved a high precision rate and accuracy within a short period of time.

CONCLUSION
The complexity of online handwritten text recognition is high compared with offline recognition. Therefore, fewer studies and methods have been developed for online handwritten text recognition compared with offline recognition. Online character recognition is used in many applications like, translation using handheld devices, smart boards, and writing on touch screen devices. Many smart devices are currently available in the market which supports applications with handwriting texts. However, Arabic language needs more attention and focus due to it having a few number of applications that support the language as the main reason for having limited applications is the structure of writing Arabic words.

3591
In this research, a dataset contained 2500 handwritten samples represents all Arabic characters' cases. These samples were processed and filtered during the pre-processing phase, and divided into their constituent characters during the segmentation phase. In the feature extraction phase, features of characters were gained using discrete cosine transform coefficients (DCT). Finally, the feature vectors of 18 coefficients were used to determine the characters' class and classify them to one of the predetermined classes using knearest neighbors (KNN) classifier.
After obtaining the final results, the KNN classifier was able to classify up to (99.10%) of 6650 characters. Thus, it is possible to say that the proposed system and its algorithms had been able to prove their effectiveness and efficiency in recognizing the characters of the handwritten words and thus recognize these words as a single unit. In regards to this finding, most of the studies test the performances based on private dataset which could not be available publicly. In this study, dataset were obtained from previous research. However, the mentioned studies were conducted without the segmentation step. In this matter, the output of the current study with these studies are not comparable.
After conducting this study on online Arabic text recognition and implementing the system, it is recommended for future studies to improve the segmentation algorithm so it includes different writing modes such as slant writing and overlapping characters. Finally, it can be said that the proposed system through its multiple phases had been able to achieve the desired objectives of this study and can be exploited in other future studies based on its concepts.