Geographical queries reformulation using a parallel association rules generator to build spatial taxonomies

Geographical queries need a special process of reformulation by information retrieval systems (IRS) due to their specificities and hierarchical structure. This fact is ignored by most of web search engines. In this paper, we propose an automatic approach for building a spatial taxonomy, that models’ the notion of adjacency that will be used in the reformulation of the spatial part of a geographical query. This approach exploits the documents that are in top of the retrieved list when submitting a spatial entity, which is composed of a spatial relation and a noun of a city. Then, a transactional database is constructed, considering each document extracted as a transaction that contains the nouns of the cities sharing the country of the submitted query’s city. The algorithm frequent pattern growth (FP-growth) is applied to this database in his parallel version (parallel FP-growth: PFP) in order to generate association rules, that will form the country’s taxonomy in a Big Data context. Experiments has been conducted on Spark and their results show that query reformulation using the taxonomy constructed based on our proposed approach improves the precision and the effectiveness of the IRS.


INTRODUCTION
Most human activities are well located in the geographical area. Thus, it is not surprising that a big amount of web documents contain geographical references. A study that was done on the Excite search engine shows that between every five queries there is one query which have a geographical context [1]. Web users searching for information that are spatially located often require information, that are geographically specific, such as geographic terms in Web pages and user queries or even user location [2]. However, retrieval systems currently have limited support to operationalize a user's geospatial queries. Geographic information deals with physical objects that are in some cases hard to express with words and that contain most of the time ambiguous terms. These arguments prove the fact that it will be very useful for search engines to take into account the spatial scope of geographical queries. The current search engines generally handle queries by adopting a keyword matching approach without inferring the geographical scope of the spatial terms. Moreover, the hierarchical structure that form the geographical context and the personalized relationships between geographical objects are also ignored by most of the search engines, which provides access to a very large number of heterogeneous and distributed resources of information. Thus, when the Int J Elec & Comp Eng ISSN: 2088-8708  Geographical queries reformulation using a parallel association rules generator … (Omar El Midaoui) 2587 name of a place is typed into a typical search engine associated with a spatial preposition (e.g. "near"), web pages that include that name in the text will be retrieved, instead of places that are close to that specified place and that represent the intent of the user. In order to do a spatial analysis of text, the first step is the annotation of spatial named entities. Several techniques have proved their ability for carrying out this annotation, such as the works of [3,4] that has elaborated this task using external resources named "gazetteers". As defined in the literature, a gazetteer is a dictionary or geographic directory whose inputs are names of places. Each entry in this dictionary may be associated with information such as belonging to one or more administrative structures (town, region and country), Physical characteristic (mountain, river and road), with its statistical data and geometric representation expressed in a geographic referential.
Other works proposes the categorization of these spatial named entities after identification. Such as, Bouamor that exploits a document structure [5] extracted from the collaborative encyclopedia "Wikipedia". The identification of named entities is done using the title and their categorization is based on the analysis of the first sentence of the description part or the category part at the end of the article. Buscaldi and Rosso has also proposed a technique for spatial named entities categorization using the thesaurus Geo-WordNet [6].
In the other hand, some approaches aim more particularly to the disambiguation of recognized places names [7]. The ambiguity can be understood as a word or a phrase that has many meanings [8]. In this case, two types of ambiguities are to treat [9]: A geo/non-geo ambiguity is when the entity has a nongeographical meaning like the term "Turkey", and a geo/geo ambiguity that occurs when the named entity refers to two different places as Rabat in Malta and Rabat in Morocco.
A hybrid approach is proposed in [10] which, first landmark names of places but also searches for these terms in ontological resources to identify related terms, potentially geographic. Domain-Specific taxonomies [11] are also playing an important role in many applications for improving search results [12] or helping with query reformulation [13,14]. More recently, different external resources have been also created and used. Such as the geographic markup language (GML) data used in the geographical information retrieval model, proposed by Fang and Zhang [15], which simplifies the process of information acquisitions by the extraction and analyzes of the attribute features, spatial features and structure features of the exploited GML data. This data that is being generated by special services and stored in semi-structured documents using the geographic mark-up language, which is a coding specification for geographic information that has been implemented based on the extensible mark-up language (XML) and standardized (ISO 19136-2007). However, geographic taxonomies' and ontologies' hierarchical structure made their strength specially when considering the link of adjacency between places [16].
In this paper, we propose a geographical taxonomy builder using the parallel FP-growth algorithm (PFP) which inputs are text documents and we complete the process by suggesting a query reformulation approach for geographical queries. The thematic part of the query is improved using a query expansion approach. A specific type of query reformulation methods [17], which expand each search query with meaningful related concepts that can be captured from a manually or automatically constructed knowledge structure to enrich the query so it can represent its intent more clearly.
The expansion technique used for improving the thematic part of a geographical query in this work, is proposed based on the work of Nakade et al. [18] which proposes a semantic query expansion method that retrieve relevant tweets. This method uses a thesaurus (from thesaurus.com) to search for synonyms for original search topics and reformulate a query by adding synonyms and search topics using parenthesis and OR operators. According to an evaluation of this approach based on a corpus of 35000 tweets, the overall retrieval performance was improved.
In this paper, the proposed approach is tested using a collection that has been created during our experimentations. This collection contains 50 queries and 2500 documents. We used 1500 documents, considering 30 retrieved documents per city for the taxonomy building step, as we used a list of the 50 most popular cities. In addition of, 20 retrieved documents per submitted query in the reformulation step evaluation (10 before and 10 after reformulation). Thus, 2500 documents have been used in total. The collection's documents were retrieved automatically using the Google web services whenever there was a need.
The main contributions done in this approach are the use of the parallel FP-growth algorithm in a geographical context by filtering the input document's terms to the absolute spatial entities of the country for which the taxonomy is to be created. It includes also the step of classification of a submitted query (geo/nongeo) and the separation of geographical and thematic entities giving a geographical query to reformulate in order to reformulate the two entities in two different manners.
The section 2 is introducing our proposed approach for the construction of a geographical taxonomy of adjacency using the PFP algorithm, while we explain our query reformulation technique in section 3. The results of our experimentations are presented in section 4. Finally, section 5 draws conclusions and future works.

THE GEOGRAPHICAL TAXONOMY BUILDER
A taxonomy consists of a number of names arranged in a hierarchical system that describe a specific domain [19] by a hierarchical structure. A taxonomy starts from a general concept of a domain, and associate to it the terms that describe this specific domain more precisely while moving down in the hierarchy. In this work, we introduce an automatic approach that builds a geographical taxonomy of adjacency. In this aim, we exploit the best-ranked documents retrieved using the search engine when submitting a spatial part of a query. This spatial part contains a spatial relationship of adjacency and a noun of a city for which we are constructing the branch of the taxonomy. The proposed approach is based on the parallel FP-growth algorithm.
The geographical query model used considers two types of spatial entities: the absolute and the relative spatial entities (ASE and RSE). The geographical named entities such as the city of "London" are well-known named entities and are defined as an absolute spatial entity (ASE). While complex spatial entities as "near London" are labelled as an relative spatial entities (RSE).

The FP-growth algorithm
The FP-growth technique [20] is an association rules machine learning algorithm, where "FP" is the acronym of frequent pattern. Given as input a dataset of transactions, the first step of this algorithm is to compute item frequencies and identify the most frequent items. Different from Apriori algorithm [21,22] designed for the same aim, by its second step that uses a suffix tree structure, called FP-tree, to encode transactions without the explicit generation of candidate sets, which are usually expensive to generate. After this step, the frequent item sets are extracted from the FP-trees.
The FP-growth is a two phases algorithm. The first phase consists of the construction of FP-trees and the second mines frequent patterns from the generated FP-trees. The construction of an FP-tree requires two scans on the used database. The first scan permits the selection of the frequent items that are then sorted based on their frequency in descending order to form a new structure caller F-list. The second scan constructs the FP-tree. First, while reordering the database tuples according to F-list, the non-frequent items are removed, based on the calculation of the value of support for every item (1) and considering the frequent items as the items for which the support value is greater then the minimum support threshold (minsup). Then the reordered transactions are inserted into the FP-tree. The input of the growth part of the algorithm is the constructed FP-tree. Considering a set of items A, N is the number of transactions in the database, f(A) the frequency of A in a database and P(A) the probability of occurrence of A in the same database, The support of A is calculated using the expression (1): Then, the FP-growth algorithm traverses nodes in the FP-tree beginning from the least frequent item in F-list. While traversing each node, FP-growth collects items on the path from the node to the root of the tree. These collected items constitute the elements of the conditional pattern base of the current item in F-list. The conditional pattern base of an item is defined as a small database of patterns that co-occur with this item. Then FP-growth creates small FP-trees based on the conditional pattern bases and re-executes the algorithm recursively on the new FP-trees until no conditional pattern base can be generated. Finally, association rules to use in decision support are generated from the resulted FP-trees. A rule is valid when its confidence (2) exceeds or equals a fixed minimum confidence "minconf". Considering A and B as sets of items the confidence that A can imply B is measured as (2):

The parallel FP-growth
The parallelized FP-growth work on distributed machines [23]. Its partitioning task is done in such a way that each machine executes an independent group of mining tasks. This method of partitioning eliminates computational dependencies between machines, and thereby communication between them. Given a transaction database DB, the PFP algorithm's steps are as: -Sharding: Splitting DB into successive parts and storing those parts on n different machines. Each resultant part is called a shard. -Parallel counting: Counting the support values of all items appearing in each shard. This step permits to discover the items' vocabulary implicitly, which is normally unknown for a huge Database. The result of this step is an F-list. -Grouping items: Considering I the set of vocabulary discovered, splitting the |I| items appearing in F-list into Q groups. The groups list is called G-list, where each group is given a unique group-id (gid). As Flist and G-list are both small, this step can be executed on a single node of the cluster in few seconds.

2589
-Parallelizing: Selecting group-dependent transactions on which the FP-growth algorithm is applied in order to build local FP-trees in parallel and growth their conditional FP-trees recursively. -Aggregating: Aggregating the results generated in the parallelizing step as our final result.
PFP distributes the growing FP-trees work based on the transactions' groups. thus, this approach is more scalable than a single-node implementation. PFP is implemented in the machine learning library (Mllib) on Spark and it takes three parameters: The minimum support threshold to identify frequent itemsets, the minimum confidence for generating association rules and the number of shards used to distribute the job.

Geographical taxonomy of adjacency
Considering a database whose transactions are documents and items are the cities of the country that contain the city of the user query. We propose to build a spatial taxonomy Figure 1 of adjacency based on the PFP algorithm. The documents that form the input transactional database are restricted to the absolute spatial entities contained in the documents. Thus, the items considered are the ASEs. After the application of the PFP algorithm, starting from the capital of the country for which we will build the taxonomy, the fusion of all the generated FP-trees is forming our geographical taxonomy. Figure 1. A two-level taxonomy for the ASE0 Validation step: In this contribution, we propose also a step of validation of each arc of the taxonomy. To validate each arc in the aggregating step, we verify if its two parts (the two ASEs that form this arc) mutually generate each other in the FP-trees.
As shown by the example in Figure 1, ASE1 involves ASE0 (ASE1 →ASE0) and ASE0 involves ASE1 (ASE0 →ASE1). That means that these two absolute spatial entities are highly related. Thus we assume that they are close to each other geographically. In this case, the arc between the two ASEs is kept and this taxonomy evolves to a two-level taxonomy as it is the case in Figure 1. Otherwise, ASE0 has involved ASE3 (ASE0 →ASE3), but ASE3 do not involve ASE0 (ASE3 ↛ ASE0) so this arc has not been validated. Thus, it has been removed from the taxonomy. The double involvement is considered as a validation of the information generated by the descending implication.

REFORMULATION OF A GEOGRAPHIC QUERY USING A TAXONOMY
In order to reformulate a geographical query, we first separate the different components of the query based on the approach of geographic information extraction (GIE) proposed in [24]. This approach utilizes a methodology of semantic annotation for the detection of geographical markers: first, the absolute spatial entity is detected and annotated. Then the spatial entity (SE) is constructed considering this ASE and a lexicon of spatial relations. What remains of the query is marked as its thematic entity (TE).
We also proposed a contribution in this step. We made some modifications in the GIE approach cited based on a hypothesis. Hypothesis. If the spatial relation is not present in the query, the occurrence of an ASE does not mean that the query has a geographical intent. For example, a query containing "George Washington". We can also consider the example of the query searching for "Hôtel de Paris". In this context, the noun "Paris" is the name of a hotel whose location is in Tangier, Monte Carlo or Monaco.
After the separation of the different entities of the user query, we continue applying the proposed approach by interpreting the spatial relationship contained in the spatial entity of the query. The interpretation is done using a lexicon of adjacency spatial relations. The process of reformulation depends on the result of this interpretation. If the spatial relation detected in the query is a relation of adjacency, we reformulate the spatial part of the query using the country's taxonomy [12], and the thematic part using a semantic expansion method. Our query reformulation approach, has been inspired from the work of Nakade et al. [18]. Logically, a query that contains a relation of adjacency means that the intent of the user is to retrieve places that are around the ASE of his query. Thus, we propose to eliminate the entire spatial entity, and to replace it by the direct childnodes items (CNIs) of the query's ASE in the geographical taxonomy as:  User new query = TE SR ASE  Reformulated query = expanded TE + "CNI1" or "CNI2" or … In the reformulated query, quotes are used to search for the desired place and not separately search for the words that the place's name contains if the ASE is composed of many terms (e.g. the submission of New York unquoted, can lead the search engine to search for New and York as two independent terms). Moreover, the Boolean operator 'or' is used, to ensure that the retrieval returns documents that include for example "CNI1" or "CNI2" or both of them and so on for all the child-nodes used to reformulate the query.

EXPERIMENTATION RESULTS
To apply the proposed approach, we used a lexicon of spatial relationships, and a database of validated ASEs associated with their countries. In order to test and verify the performance of the technique of taxonomy building proposed in this work, we took our country Morocco as an example. Thus, to be able to use the web pages created by Moroccans themselves we perform our tests in French. "Rabat", the capital of Morocco is the ASE that we considered as a root for our taxonomy. The search engine used in our experimentations is Google web service.
We apply our method using transaction database that is constructed by iterating on Morocco's ASEs list (a selected list of 50 cities and villages of Morocco). For every ASE, we selected the thirty first web pages retrieved when submitting a relative spatial entity containing the current ASE. As a pre-treatment step, we deleted accents from the extracted documents to minimize the matching gab between ASEs, due to different manners of writing cities' names by the persons who wrote the documents contents. Because, we have noticed before that the miss-matching problem arise particularly in the case of nouns that contain accents [25]. Then, we varied the SR of the spatial entities submitted to verify if the variation of the SR influences the performance of the proposed approach. The spatial relations used in this test step are as: First, the five top-ranked documents were extracted for the ASE Rabat associated with every spatial relationship of Table 1. A database (DB) containing 35 transactions is constructed based on these documents. The parallel FP-growth algorithm is applied to this DB and then the association rules are generated between Rabat and every Moroccan ASE that co-occur with it in the database. After that, we varied the minimum support from 0.2 to 0.8 without the validation step, while we fixed the minimum confidence to 0.6 as shown in Table 2. Later we computed the error rate and the number of rules generated in every case.

2591
From Table 2, we notice that using the minsup=0.8 the algorithm does not return any results in some cases otherwise it gives 1 or 2 answers. The same for minsup=0.6 that do not exceed 2 correct answers. Regarding the value 0.2 it generally gives a high error rate and sometimes returns a very high number of responses up to 22 resulting ASEs in the case of RS 1 with 6 correct adjacent ASEs only. Thus, we favored the value of minimum support equal to 0.4 because it is the one that gives the best ratio between a minimal error and an acceptable number of answers. The next step of experimentations is done in order to compare the cases where we use or not the validation step for aggregating the generated FP-trees in order to build the taxonomy of adjacency, based on a minimum support of 0.4 and a minimum confidence of 0.6.
Comparing the results using validation with the results without validation, we note that the error rate decreases when using the validation step, with the exception of the SR 4 for which from 3 results including 2 correct ASEs, validation has eliminated one of the correct ASEs and kept the erroneous one. Concerning the SR 3 we notice that the only ASE that was resulted without validation was eliminated with the step of validation. In general, we conclude that the validation step reduces errors sufficiently.
To minimize the error rate while keeping as much as possible of correct results (eliminate only the erroneous ASEs by the validation step). We propose to compute the average of the two supports of the opposite rules (e.g. ASE1→ASE2 and ASE2→ASE1). Table 3 shows that the result given by the case of the average support solves the problems mentioned for the SR 3 and SR 4. Table 3. The error rate and the number of correct rules generated using the step of validation or not and using the average of support between the two cases, varying the spatial relation used for item sets containing the ASE "Rabat" with a minsup of 0.4 Comparing the seven spatial relations, we promote the SR 7 "près de" which gives the most interesting result with 0% error and six correct ASEs as child nodes of Rabat's taxonomy of adjacency Figure 2. We also varied the value of minimum confidence, and noticed that the best results are given using the first fixed value 0.6. Using the favorable conditions represented we continue the construction of Morocco's taxonomy with 0.4 as a minimum support, 0.6 as a minimum confidence and using the average of support for validating links as shown in Figure 3. For building this taxonomy, we had used 50 Moroccans ASEs and we were searching for the 30 first retrieved documents while submitting every ASEs with the selected relationship. Thus, 1500 documents have been used in this test with a minimum support of 0.4 and a minimum confident of 0.6. We have been using multiple numbers of nodes also, in order to evaluate the effectiveness of our parallel algorithm. Thus, we took the baseline as the utilizations of one node with two cores of 2.3 GHz and 8 G of memory, this test has  Table 4 shows the total execution time of our technique while varying the number of used nodes from 1 to 4, in consequence the number of cores, has been varied from 2 to 8 and the RAM capacity from 8 G to 44 G. Table 4 shows that in term of execution time the effectiveness of our parallel technique increases while increasing the number of cores. Considering the baseline, we had a speedup of 3.13 when using 4 nodes. However, we notice in Figure 4 that this improvement is not proportional to the number of nodes used. Thus, we expect that at some level the efficiency of the algorithm will stagnate.  In order to evaluate the precision of the results of our approach and confirm the results of the precedent tests, we proposed 50 geographical queries that has been submitted to the Google search services with and without reformulation using the taxonomy built using the PFP algorithm and the same queries when reformulated using the geographical query expansion method (GQEM) proposed in [26]. Which is a natural language processing (NLP) method that modify and/or expand both the thematic and geospatial parts.
We compared the values of the Precision at 10, the Mean Average Precision and the execution time of the new proposed approach with the two cases: queries without reformulation and queries reformulated using GQEM. The performance is presented in percentage in order to show the enhancement of the precision and effectiveness of the proposed approach. The percentage of improvement of the execution time have been measured based on Amdahl's law.
From Table 5 we notice that the approach presented in this manuscript gives an interesting improvement in the precision of the geographical queries used in our experiments. However, the enhancement of performance between the new approach and GQEM given the execution time is quite normal, it is due to the parallelisation of the process. In order to evaluate our approach in other geographical places, we conduct tests on cities from France using the same parameter's values. We took the capital Paris as a start. Applying our approach of adjacencies taxonomy builder using 120 documents in order to retrieve adjacent places to Paris, gives the results illustrated on Table 6. As shown in Table 6, the percentage of correctness of the results is of 100. This is due to the height precision and the double validation of our approach.

CONCLUSION AND FUTURE WORKS
In this paper, we proposed a new method of construction of geographical taxonomies of adjacency using the parallel FP-growth algorithm, and a technique for reformulating geographical queries that contain a spatial entity of adjacency. We have conducted tests on the taxonomy builder method by forming Morocco's taxonomy of adjacency. During our experimentations, we varied the minimum support threshold and the used spatial relationship in order to search for the parameters of the approach that extract the most appropriate frequent item sets and association rules. Then we constructed the Moroccan taxonomy using a minimum support of 0.4 and a minimum confidence of 0.6 and the spatial relation "près de", because these conditions gave the best results during our experiments. The proposed technique of reformulation has been tested on 50 queries, with a geographical intent and thematic entities from different fields. These queries had been reformulated based on the spatial taxonomy of adjacency. Finally, we compared the results retrieved by the Precision at 10, the mean average precision and the execution time. The results show that the reformulation based on our proposed approach and using a small number of reformulation terms has improved the value of the used indicators significantly. Considering the experimental results, we conclude that the presented method is an efficient work that permit to interpret and improve the results and effectiveness of queries containing a spatial entity of adjacency. The enhancement resulted from the proposed method is due to the use of a data mining approach based on an association rules technique, the processing of the geographical data in a personalized manner, considering the hierarchic structure of this data using taxonomies and the parallelisation of the process in order to minimise the execution time. As future work, we intend to propose a new method of geographical query reformulation, based on big data technologies and an in-depth analysis of user's behaviours through a study of a search engine's traces.