Combined cosine-linear regression model similarity with application to handwritten word spotting

The similarity or the distance measure have been used widely to calculate the similarity or dissimilarity between vector sequences, where the document images similarity is known as the domain that dealing with image information and both similarity/distance has been an important role for matching and pattern recognition. There are several types of similarity measure, we cover in this paper the survey of various distance measures used in the images matching and we explain the limitations associated with the existing distances. Then, we introduce the concept of the floating distance which describes the variation of the threshold’s selection for each word in decision making process, based on a combination of Linear Regression and cosine distance. Experiments are carried out on a handwritten Arabic image documents of Gallica library. These experiments show that the proposed floating distance outperforms the traditional distance in word spotting system.


INTRODUCTION
In the pattern recognition fields, the objects can be represented as sequence of features, where a similarity measure can be used as a tool to judge the similarity or dissimilarity between two sequences of features. From literature, similar is defined as looking or being near between two objects, but not the same. In this field, several similarity and distance measure have been used widely in various fields of the area: text similarity [1], document similarity [2], handwriting recognition [3][4][5], handwritten character [6,7], speech recognition [8][9][10], video analysis [11]. So use a good similarity measure is fundamental step for many types of application and domain such as information retrieval and recognition, chemistry, clustering or classification.
The term similarity in the pattern recognition is mean a score that represents the strength of relationship between two sequences of data items. The pattern recognition or the word sporting is done based on a similarity measure to search a similar image that are looking or being near together. Mathematically, similarity distance is used to calculate how far between two sequences of data, is also named dissimilarity in other domains, or the concept is though the same and inversely [12]. Figure 1 shows the chronological overview of the similarity and distance measure that is commonly used in various fields, based on feature sequences of data [13,14], which are divided into three different groups, one group distance based, the second and the third groups of similarity measure are respectively correlation and non-correlation based.
Similarity measure is an important topic been extensively applied in some fields such as decision making, pattern recognition, machine learning and market prediction based on distance function such as Euclidean, Manhattan, Minkowski and Cosine distance similarity, etc. Or, many limitations associated with distance measures which have been have mentioned in this paper and are aiming to overcome in our research. The organization of this paper is as follows. First we describe on some of chronological overview of the similarity and distance measures used for similarity searching and retrieving systems. Then, we investigate the limitations of the existing distances in word spotting system in detail. Finally, we propose a floating distance based on the combination of Linear Regression and cosine distance and we apply them on a handwritten Arabic image documents of Gallica.

OVERWIEW OF SIMILARITY AND DISTANCE METRICS
For years, different distance measures are used to calculate the similarity between two data objects, the aim of these metrics is to found a specific distance function that allow a separation or classification between elements in a set of data [15,16]. In fact, these distances operate directly in various fields as similarity searching and retrieval systems where the performance depends upon choosing function. Here, we present several similarity distances commonly used in pattern recognition and word spotting systems.

Euclidean distance
The Euclidean distance (1) is defined as the general and the standard metric for used in geometrical cases, the mathematical formula of this distance is defined as : Which is a simple metric between two points of data ( , ) that determines the root of square difference between them, is the default metric in the k-means algorithm and the most used in different clustering problems.

Manhattan distance
Manhattan or Taxicab distance (2) is a metric that calculates the absolute difference between two points of data, is also known as L1 distance, L1 distance or rectilinear distance, is defined as:

Minkowski distance
The Minkowski distance (3) known as the generalized metric, can be used in case of data that are ordinal and quantitative. The distance of order p between two variables is defined as: This distance is similar to the Euclidean distance when p=2 and the Manhattan distance when p=1.

Cosine distance
Cosine distance (4) is the cosine of the angle between two vectors in an n-dimensional space given by the following formula: This distance represents the dot product between two vectors divided by the product of the two vectors' lengths, and is often used in information retrieval and text mining.

Jaccard distance
Jaccard distance (coefficient), a term coined by Paul Jaccard, measures similarities of the two data items. The Jaccard distance (5) is measured by the following formula: The function is best used when calculating the similarity between small numbers of sets, and is often used to in the recommendation flied.

Chebyshev distance
Chebyshev distance (6), is also called Tchebychev or the maximum value distance. This metric calculates the absolute magnitude of the difference between two points of data by following:

COMPARISON OF DISTANCES METRIC IN WORD SPOTTING SYSTEMS
The aim of Query-by-example word spotting is search and localize they regions that are similar for a given query. Or to get the best performance still as a challenge, because several tasks influence on this system, as image processing, feature extractions, similarity measures, classification. While our work focuses on the influence of similarity measure distances in the context of a handwritten document images retrieval where many existing metrics can be adapted. We underline that the presented measure of similarity is applicable to any problem where the size of query is different one from the other.
In this part, we present a comparison of distance measures in word spotting system. For this, we use the Arabic handwritten document images form Gallica, which is the digital library of the National Library of France, in open access. First, we divide the document images into a set of local regions, densely sampled; these local regions are the basic structure used to spot the words in the document, and we extract all interests points from the images using SIFT [17] detector, and we use SIFT descriptor to represent each interest point in the images by their descriptor. Then, in the learning step, we select the 4 first images of document to group all descriptors to provide bag of visual descriptors and cluster centers, in this stage, the k-means algorithm is used with different number of centers (100,200,300,400). In the end, we encoding each region by a histogram using a bag-of-visual-descriptors [18,19] instead to represent them by their descriptors. All regions are described with their histograms by assigning each descriptor in the region to the nearest visual descriptor in the codebook. So, each region is represented by a histogram of accumulates frequencies j = 1, , 2, , 3, , 4, , 5, , … … , where N represents the number of the visual descriptor in the codebook and , represents the cumulative frequency of k visual descriptor in the j eme region.
The process of the used system is presented in Figure 2. To evaluate the effect of similarity measure and determine the appropriate distance in the context of word spotting, we change this metric and we calculate the detection accuracy represented by F-score for different queries. The F-score is calculated as follows:   Table 1 shows the F-score results for Gallica dataset, we observe that the F-score depending on the codebook size and also the metric distance used. For the cosine metric and code book size 300 we obtain F-score=0,78. While for the other metric we reported F-socre=0, 62 for Euclidean distance, F-socre = 0,52 for Chebyshev distance. So, the best result is obtained for the cosine metric and 300 clusters.

LINEAR REGRESSION MODEL SIMILARITY WITH APPLICATION TO HANDWRITTEN WORD SPOTTING 4.1. Analysis of cosine distance measure for word spotting
As shown in the previous part, the metric that gives the best result is the cosine distance. However, Table 1 shows that the size of the codebook influence directly on the performance. So we believe it is therefore necessarily to analyses diverse phenomena that appear when using codebook in high-dimensional spaces, this phenomena can be explain by the curse of dimensionality, the term was invented by Richard E.Bellman in [20,21]. For this, the system presented in Figure 2 is therefore used to evaluate in detail different sizes of the codebook from 100 to 900 centers.
Then, to calculate the similarity between the histograms of each region in the image document and the query's histogram, we use the cosine similarity defined by: Where H i,j represents the occurrence of the ieme center of the codebook in the jeme region, and R i represents the occurrence of the ieme center of the codebook. Or, to select them regions that are similar to the query, the cosine similarity "S" should be inferior to a certain threshold. For two regions perfectly similar = 1, so S=0. Therefore, as much as the threshold is less, the regions are more similar.
For this, for each codebook size, we calculate the F-Score by choosing the best threshold of each size as shown in Figure 3. We observe that the size influence on the performance and best result is given for k=300 and decrease beyond this size. Now, we search the influence of the codebook size on the threshold. Form (2), when the size of the codebook increase (N), the probability to find zero descriptor in each cellule in the histogram for a given region increase as shown in Figure 3, that mean: When N increase, the probability P(R i )=0 increase When N decrease, the probability P(R i )=0 decrease Subsequently, Where Ri represents the histogram of the query, Hi,j represents the histogram of jeme region in the document and ∑ P(H i,j ) = =1 is the number of interest points in jeme region. So when the size of histogram (N) increase, the number of zero cellule increase too. Consequently, the threshold should take a count the size of the codebook (number of cluster). Figure 4 shows example of query and region histograms. For the experiments step, Figure 5 shows the curve of the best threshold for each codebook size. The threshold depends on the size of the codebook and can represented by : y=0.000265x+0.3993.  At this stage, and to perform the experiments, we should search the influence of the interest points number (number of descriptors n) on the threshold. Form (2), when the number of interest points n increase, the probability to find zero descriptor in each cellule in the histogram for a given region decrease as shown in Figure 3, that mean: When n increase, the probability P(R i )=0 decrease When n decrease, the probability P(R i )=0 increase increase When n decrease, the probability ∑ P(H i,j )*P(R i ) =1 decrease So, when n increase, In fact, when the number of interest points (n) increase, the number of zero cellule decrease too. Consequently, the threshold should take a count the number of interest points (number of descriptors), because this number change from region to other. Experimently, Figure 6 shows the threshold curve according to the number of descriptors where we can see that the threshold depend on the descriptors, this dependence can be represented by: y=-0.0009x+0.5219. Figure 6. Threshold according to the number of descriptors Given this condition, two regions are similar if the threshold S is less than a given value, or if two having even number of interest points will have same threshold, but this not mean that they are similar, because maybe they have different histograms and thereafter different distances of similarity. To conclude this part, each word/region in document has a certain number of interest points/descriptors, so a compromise between the number of descriptors and the threshold must be sought.

Combination of cosine-linear regression similarity
Regression focuses on the relationship between the outcomes and the inputs. It also provides a model that has some explanatory value, in our case, the inputs are size of codebook and number of interest points, the outcome is the threshold value that define if two regions are similar or not. Linear regression is a commonly used technique for modeling when the outcome variable is expressed as linear combination of the input variables, for a given set of input variables, the linear regression model provides the expected outcome value.
where is the threshold variable are the input variables, for j=1,2,…,p-1 0 is the value of when each equals zero {Offset} is the change in based on a unit change in {Coefficient} ~N(0, 2 ) and the 's are independent of each other This random error ( ), denoted by ɛ, is assumed to be normally distributed with a mean of zero and a constant variance ( 2 ). So, the used Equation is: T=-0, 0009 * n+0, 5219 for N=300.
However, the performance of regression analysis methods in practice depends on the type and form of the used data, and how it relates to the used regression method. As is shown in Figures 4 and 5, the threshold depends on the codebook size and the number of descriptors at each word, or the proposed method is a generic similarity measure that take a count these parameters and it does not make any specific condition to handwritten documents. The proposed similarity measure is evaluated in the context of word spotting task. The aim consists in querying Arabic of handwritten documents with query-by-example anagram and in selecting/reporting regions that similar to the query.
We report here some queries where the system yields automatically as shown in Figure 7, which are similar to the query and without chose the thresold , which is a problem in other system [22,23]. Then, we use a filtering step to select one best result when confusion regions are returning, based on theirs similarity scores and theirs positions. Table 2 shows a comparison of the proposed method with state-ofthe-art in term of Precision, tseted in the Gallica dataset. An improvement is observed with the proposed similarity distance over other methods.  Figure 7. The retrieved regions for some queries

CONCLUSION
In this paper, we present a generic similarity measure for word spotting system that take a count the codebook size and the number of descriptors at query, and it does not make any specific condition to handwritten documents,a comparison of the proposed method with other methods shows that generic similarity measure gives an excellent result in term of Precision. We tested our method using experimental setup based on MATLAB code applied to Gallica database