The impact of the image processing in the indexation system

ABSTRACT


INTRODUCTION
Old handwritten Arabic documents are part of the richest cultural heritage and contains a wealth of information. The repetitive manipulation of these manuscripts should be avoided as it could destroy them.
To exploit this wealth of information contained in these manuscripts, digitalization is a convenient solution to preserve them. The recent advances in pattern recognition, storage, and network technology have paved the way for many digitization projects, which treat Latin scripts, such as manuscripts Better Access to Manuscripts and Browsing of Images [1], Electronic Access to Medieval Manuscripts (EAMMS) [2], etc. This paper deals with the problem of word-by-example spotting in handwritten Arabic documents. From the survey of word spotting system, we found that a few researchers treat the handwritten Arabic documents, where a million documents had been writing in various disciplines. In the Arabic handwritten case, the recognition system is faced with various problems, which can be summarized as follows: -Cursivity of the Arabic script -Arabic language contains many diacritic marks -Form of the same letter at the beginning and end of a word can be change -People write with their own script The word spotting process needs enough time and effort to be performed by manual inspection. To facilitate the search in numerical document images, numerous word spotting researchers had based on text line or word segmentation steps [3][4][5][6]. First, an initial step is performed to segment text into word candidates [7]. Then, candidates are represented by their sequences of features [8,9]. In the end, to compare the query word and these candidates, a similarity measure based on Dynamic Time Warping [8] or Hidden Markov Model [10] is used. The main problem with these approaches is that they are very sensitives and need to perform a costly segmentation step to select candidate regions. When, the segmentation step is not usually easy and any error affects the representation's word, therefore the matching steps. This explains why research on word spotting and retrieval is oriented towards segmentation-free methods over the last few years.
Late research on word spotting architecture has proposed approaches that do not need any segmentation step. In [11], Leydier et al. avoid the segmentation step in the word retrieval system, by using features fitted to any type of alphabet by computing local key points using a simple descriptor based on information's gradient. In the same approach, Zhang and Tan use features based on the Heat Kernel Signature [12]. The drawback of these proposed methods is that not scalable to large datasets, because the use a costly distance computation in the matching phase. In this way, Rothacker et al. propose in [13] to exploit the use of bag-of-visual-word representation with Hidden Markov Models to avoid segmentation step.
In [14], Almazán et al. avoid the segmentation step by representing documents with a grid of HOGs descriptors, where a sliding-window paragon is used to locate the locations that are most similar to the query in the dataset. Then, they use SVMs structure to get a better representation of the query. To solve the problem of memory, lately, the authors moved to use the concept of bag-of-feature representation. The method proposed by Marçal [15], use query-by-example [16] paradigm where the local patches are describing with a bag-of-visual-words model powered by Scale-invariant-feature-transform descriptors. Then, the spatial pyramid-matching framework is used.
In [17], Rodriguez uses a Model-based approach to measure the similarity between sequence's vector, where several features are extracted for all the images by using sliding window such as local gradient histogram [18], the zoning features [19] and the column features [20]. This sequence is mapped to a HMMs and a similarity measure is computed between them. In [21], Petitjean explains how the template matching influence in the context of pattern spotting in historical document images by integrating and evaluating different template matching methods.
The degradation on the handwritten document and can take different forms, such as the discoloration of ink, interfering patterns like ink bleed-through, show-through [22], etc. Therefore, before any process, as feature extraction or text segmentation, appropriate pre-processing is essential in order to correct the degradation [23]. In the present work, the document images are pre-processed, in order to enhance them and to eliminate the strongly interfering background, this step improve the extraction phase. To enable an efficient feature extraction, finding effective and robust features is an important task, which affects to the word retrieval performance [24]. In this case, the scale invariant feature transform algorithm (SIFT) has been applied to extract and to characterize interesting points in the document. This algorithm has shown their efficiency in previous research [25,26]. To solve the problem of computer memory caused by descriptor's dimension, we propose to use bag-of-features approach [25], the SIFT descriptors have been used to create the histograms, and K-Means clustering has been applied for clustering to create the bag-of-visual-descriptors [27,28]. Then, we represent the image's regions as histograms by using the bag of visual words [29,30] method.
At this stage, and using data in high-dimensional spaces, the codebook size or the cluster's number became a very important task, which affects to the region's representation, subsequently, the matching phase. For this, we chose the best size of codebook by analyzing the curse of dimensionality curve [31,32]. The last option is the histograms distance computation [33], a presentation of a proposed floating distance to measure the similarity score will follow.
The remainder of this paper is organized as follows. In Section 2, we first present the pre-processing stage to enhance the degradation on the handwritten document. Section 3 describes the proposed word spotting system. Afterward, Section 4 study the influence of the processing step in the proposed system where we report experimental results and analysis. Finally, conclusion and further research are drawn in section 5.

POST PRE-PROCESSING
The digitization of Arabic handwritten documents appears today as a necessity to preserve the integrity and rarity of space, However, the digitization is the first step in a process of classification and indexing to exploit all wealth of information. For this reason, we have adopted an indexing method for scanned Arabic handwritten documents. Images documents, and especially scanned handwritten documents, are complex and contain a large amount of relevant information. Most of this data is connected by relations of colors or intensities. Analysis and pre-processing of document images are avoided in some scenario of word spotting [14,15], where the variation between colors or intensities in the document is not large. For this, and in order to overcome this existing variation in other documents like as Ibn Sina database we propose to start the indexing system by pre-processing step.
The text separation from image background is a very vast domain, where many research address this problem by a rough estimation of the text and background regions [34][35][36][37]. In [36], to identify the text and The impact of the image processing in the indexation system (Elfakir Youssef) 4313 background classes, a global binarization thresholding is used. Then, to adjust the threshold value, a noise model is building and used. In author work [23], and in order to identify the text, background, and uncertain pixels, a binarization paragon is presented, then, the uncertain pixels it binarized using a classifier trained based on the text and background classes.
In this paper, we address the enhancing and restoring problems of Arabic handwritten document images by using a series of multi-level classifiers [23]. We have modeled the pre-processing images step as shown in Figure 1 based on these classifiers, which can be used to an enhancement or restoration method. These multi-level classifiers are maps that extract relevant information for different levels: local, regional and global. And provide a value for each pixel in the image. There are several classifiers. In this work, we use the estimated background, the stroke grey level, and edge profile.
-Estimated backgroun Is a high-level classifier [38], use many other classifiers to arrive at an estimate background as near as possible to the true background of the image as shown in Figure 1 This classifier provides a gray value for each pixel [23], the estimated intensities for a stroke is calculated by averaging the intensities of the pixels, an interpolated value will be assigned for the non-text pixels (background, figures interference, etc.).As shown in Figure 1 The edge profile is a classifier used to overcome the interference problem of the information. The calculation of the edge profile is based on the gradient of histogram in each region in the image as Figure1-d.  Figure 2 shows the process of pre-processing step. We apply the restoration method to the handwritten Arabic documents. Figure 3 shows the pre-processing results:

PROPOSED SYSTEM
We address the words spotting problem by using a Bag of Visual descriptors model powered by scale-invariant feature transform, which consists to describe each detected interest point. As we can see in [27], the performance of this model depends on the number of visual descriptors extracted from the image. In order to represent each region in the images by a histogram and taking account different words size, we densely divided the image into a set of local regions. For this, we define three widths H*H, 2H*H and 3H*H to be synchronized with different locals region's sizes. The aim of this multi-scale representation is to capture all different words size as shown in Figure 4. The proposed method it has been applied to different handwritten document images. Figure 5 shows the process of the proposed word spotting system: Figure 5. The process of the proposed word spotting system

Images representation
In [39], Llados et al. showing the influence of word representations for handwritten words potting in historical documents, and how a bag of visual words representation using SIFT descriptors can effectively perform the classical approaches, such as DTW based on sequence features. Here, we use the Scale-Invariant Transform Feature algorithm, due to canonization; descriptors are invariant to translations, rotations and scaling, and we show the impact of the pre-processing stage in the indexation system, then, we show how we can perform the results by using floating distance similarity. The Sift detector extracts the interest points from the images and then we describe them, the taken algorithm in our implementation is inspired by the one taken by Lowe et. [40].
The main drawback of this approach at this stage is that they use a costly distance computation and the need a great computer memory, which is not scalable to large datasets. The colossal number of detected key points in the document cause this problem. i.e. the average number of the key points at each region is 94, which is represented by a descriptor of 94*128, each document image having in average more than 10 000 regions, in this case, we require approximately 114 MB of RAM to store each image. To solve this problem caused by dimension of these descriptors, instead to represent the image's regions by their descriptors we encoding each region by a histogram using a bag-of-visual-descriptors framework, which is inspired by models used in natural language processing, this technique is based on a sparse histogram of occurrence counts of visual words. The main steps of this technique are: -Features extraction -Classification "codebook" -Quantification -Construct the region's histograms using codebook In the second step, local regions are represented by using histograms of Bag-of-Visual-Descriptors. To achieve this representation, we use 10% of all descriptors to quantize them into K different clusters using the k-means algorithm. In the end, all regions are described with their histograms by assigning each descriptor in the region to the nearest visual descriptor in the codebook. So, each region is represented by a histogram of accumulates frequencies 1, = 1, , 2, , 3, , 4, , 5, , … … , , where k represent the number of the visual descriptor in the codebook and , represent the cumulative frequency of k visual descriptor in the j eme region.

Images representation
The information extraction from handwritten document images is one of the major challenging topics in the field of document image analysis. In this part, we show how the pre-processing images step influence in the indexation system scenario. The word ‫يوم'"‬ " written in three different ways in the same document as shown in Figure 6, which are written with different colors and different diacritical marks, while some regions in these words are degraded due to the antiquity of these manuscripts and the manual manipulation. For the reason above, and in order to compare the pre-processing impact, we have tested the proposed system in the grayscale and pre-treated images. We extract the interest points from each word, and for each one in the first word, we search the similar one in the second words as shown in Figure 7. As we see in Table 1, the number of the matched points in the grayscale image is high than the pre-treated image due to the faults detected. The most of this faults keys points are coming from non-text regions and does not describe the word trait in the image.   To examine the influence of these faults detections and compare the results between pre-treated and no pre-treated images in the word-spotting framework, we apply the proposed system Figure 5. In the similarity step, we choose to fix threshold (Tf), and we return each distance similarity less of Tf. Then we calculate the test's accuracy of the system Fscore, which is a harmonic mean of precision and recall that means the ability of the system to provide all relevant solutions and reject others.
The F-score is calculate as follows: F Score = 2. To evaluate the performance of the approach in handwritten Arabic documents, we change the size of codebook. As shown in Figure 8, the F-score results depending on the codebook size and the approach based on pre-processing provide a good performance in term of F-score Table 2, the best mean F-score (0,778) is obtained for 300 code words.

The curse of dimensionality impact in word spotting system
In this section, we explain the curse of dimensionality impact in the bag-of-visual word system, this term was invented by Richard Bellman in [31,32], the goal is to see diverse phenomena that appear when using data in high-dimensional spaces and analysing them . For this, we use two different Arabic handwritten documents form Gallica, which is the digital library of the French National Library, in open access. It brings digitized handwritten documents, magazines, images... First, we extract the interest points from each region in the pre-treated images using SIFT algorithm. Then, in the learning step, the descriptors of the 4 first images of the document are grouped to provide the cluster centres (codebook). In this stage, we use k-means algorithm with different number of k (centres), 100 to 900 centres. Then, to calculate the similarity between the histograms of each regions in the image's document and the query's histogram, we use the cos distance Where H i,j represent the occurrence of the i eme centre of the codebook in the j eme region, and R( i ) represent the occurrence of the i eme centre of the codebook in the query. To judge that a region j is similar to the query, the cos distance S should be less than a certain threshold. For this, for each codebook size, we use various threshold between 0: 0.05: 0.7 and we calculate the recall and precision measures. The recall and precision curves show that the result depend on the threshold and codebook size as shown in Figure 9. To evaluate the impact of the codebook size, we calculate the F_sorce measure by using choosing the best threshold of each size as shown in Figure 10. We remark that the best result is given for k=300, the Fsorce decrease beyond this size due to the impact of the dimensional spaces. Now, we search the influence of the codebook size on the threshold, from the (3), when the size increases (N), the probability to not find all visuals descriptors in the codebook for a given region's descriptors increase, that's mean the probability P( ) = 0 increase, and subsequently P( , * ) = 0 increase.Where R is the histogram of the query and Hj is the histogram of j eme region in the document.

So D=
Decreased, Thereafter, when the size N increases, the similarity distance S will increase (S= 1-D). Figure 11 shows the curve of the best threshold for each codebook size, as we see, the threshold depend on the size of the codebook. For this reason, and to overcome the problem linked to the curve of dimensionality, we use a codebook with 300 visual descriptors that give the best result. At this stage, we have demonstrate that the Fscore measure depend on the codebook size. To perform the experiments, we should search the impact of the descriptor's number (n) on the threshold. Form (3), when the number of interest points n increases, the probability to find all visuals descriptors for in histogram for a given region's descriptors increase, that mean probability of P( ) ≠ 0) increase too. Thereafter, the probabilityP( , * ) ≠ 0 increase. So D increases, with 0 ≤ ≤ 1. Thereafter, when n increases, the similarity distance S will decrease (S=1-D): ↗ ⇒ ↗ ⇒ ↘, which explain that the number of interest points influence on the threshold, or each word has a certain number of descriptors. For this, the distance similarity should be taken account the number of descriptors by using a floating threshold. Figure 12 represents the threshold curves according to the number of points of interest. By analysing these curves, we can use the relation of the progression line as a floating threshold for each query. Or, if two different regions having even number of points of interest will have the same threshold, but different histograms, because they do not have the same interest points, and thereafter, different distances of similarity between these two regions and the query.

EXPERIMENTAL RESULTS
We tested our methodology on Arabic handwritten document from the digital library Gallica, Figure 13 presents qualitative results for two different documents. The used system is based on SIFT-BoVW descriptors with 300 visual words, and floating threshold. We report here some queries where the system yields automatically, which are similar to the query and without chose the K best similar results, which is a problem in other system [14,41]. Then, we use a filtering step to select one best result when confusion regions are returning, based on their similarity scores and positions.
In comparison with state-of-the-art, we show the retrieval performance in terms of mAP as shown in Table 3, we can see how pre-processing by keeping the informative interest points in each word and discriminating the others. In addition, to overcome the problem linked to the various number of interest points in different regions by using floating threshold. Figure 13. The retrieved images for some queries in the two evaluated documents

CONCLUSION
In this paper, we have presented an efficient free-segmentation word spotting approach for Arabic manuscripts. The proposed method present an excellent result when comparing other methods in literature. We have shown how the processing step can improve the result, in Bag-of-descriptors of SIFT system, by informative and discriminative features. Then, we have shown how we can improved the result by, first choosing the best size of codebook by analysing the curse of dimensionality curve, and secondly, the use of floating cos distance. Finally, we have presented a comparison with other methods, we tested our method using experimental setup based on MATLAB code applied to Gallica database