An adaptation of Text2Onto for supporting the French language

ABSTRACT

 Formal specification: expressed in logical form, readable and understandable by machines, thus favoring unambiguous interpretations, which excludes the use of natural language.  Explicit specification: explicitly defines the concepts and axioms that bind them.  Conceptualization: abstract and simplified representation of the studied domain. According to Lee et al., a conceptualization is "an extraction of a domain vocabulary and is an abstract and simplified view of the domain that we wish to represent" [7].  Shared: it describes consensual knowledge. In fact, ontology does not represent the viewpoint of an individual, but it serves as a general consensus, accepted and shared between the individuals of a whole community.
In addition, we are currently witnessing the third industrial revolution: the digital revolution. The changes brought by this revolution are rapidly and radically transforming the modern era. The field of data analysis does not escape the explosive rise of this revolution. This is due, on the one hand, to the technological evolution accomplished in the field of computers, on the other hand, to the progress made at the level of information processing models such as business intelligence [8], the artificial intelligence [9], and big data [10].
However, much of operational information system data in organizations remains unstructured and non-transactional. Indeed, according to Tseng & Chou [11], 20% of an organization's data is transactional, while the rest of the data is non-transactional and often saved in textual documents under an unstructured form. As a result, they remain unusable in an appropriate and effective way in the era of information capitalization. Thus, a phase of valorization of this data consists of extracting a knowledge base in the ontological form. The manual process of building such a base is laborious, tedious and time-consuming. In order to support this construction, several research projects have been carried out to automate the tasks involved. This automation is designated as 'Ontology Learning,' which refers to all the techniques and methods whose objective is the automatic or semi-automatic construction of ontologies via the extraction, the acquisition, and the generation from structured, semi-structured, or unstructured resources.
Tools to perform this extraction are numerous. Text2onto stands out as the one of the best tools known in this field. It has served as a subject of numerous studies of comparative evaluation of ontology learning tools in which it has excelled such as [12,13]. The original version of Text2Onto is intended for the processing of the corpus written in English and is based on the Princeton WordNet (PWN) for English in order to perform some linguistic sub-processes such as the implementation of a variant of the algorithms intended for the discovery of Subclass of relationships. Hence, this tool is limited to the processing of textual resources written in English. To overcome this problem, we present in this paper our approach for the customization and adaptation of the English language version of the open-source Text2Onto tool to support the extraction of ontologies from French documents.

ONTOLOGY LEARNING
Ontology notion is at the heart of the semantic web domain. We note a proliferation of research addressing the notion of ontology as illustrated in Figure 1 that gives an overview of statistics related to articles indexed in "Scopus.com" containing the term 'ontology.' These works cover one or more phases of the management of an ontology, such as the construction of an ontology, its storage, and its exploitation. The construction of an ontology can be classified into two families: manual construction and semi-automatic/automatic construction. The manual construction of an ontology is characterized by its reliability. Nevertheless, it is very tedious, very expensive and requires more time.
On the other hand, the second family, referred as 'Ontology Learning,' refers to all approaches whose objective is the semi-automatic construction of ontologies via the extraction, the generation and the acquisition. As part of Ontology Learning, several approaches have emerged to promote the accompaniment and facilitation of this painful construction. Among these approaches we distinguish that of ontology building from textual documents, dictionaries, knowledge bases, semi-structured data and relational databases. So, these approaches are variant primarily according to the data sources considered.
Several methodologies have emerged from the perspective of defining a roadmap for such a construction, among the best-known are Methontology [14], On-To-Knowledge [15], DILIGENT [16] and NeOn [17]. The construction of an ontology brings together a considerable set of tasks. The automation of this construction amounts to automating tasks or some of these tasks. The approaches for ontology construction from textual documents are grouped into several categories. Among others, we mention approaches based on linguistic techniques, on statistical techniques and those based on learning concept.
In this context, one of the most popular tools of ontology learning is Text2Onto [18]. This tool has served as several research works. In fact, on Google Scholar, the search of the term 'Text2onto' gives result including 988 pages displaying 10 elements, taking into consideration that we have excluded patents and Moreover, our choice of the Text2Onto tool is justified by its performance confirmed by several research works, such as Toader Gherasim [12] work who made a comparative study of this tool with three other tools (OntoLearn, asium and sprat) and Jinsoo Park and al. [13] work who performed a comparison of the same tool with three other tools (DODDLE-OWL, OntoBuilder and OntoLT). Text2Onto tool provides to the designer, several algorithms dedicated to the different tasks of ontology generation [18]. The ontology designer chooses among these algorithms those to use for each task of this generation. The intervention of the designer allows the parameterization of the tool in addition to the refinement and validation of the intermediate result for the generation of the final ontology.
The operation of extracting an ontology from the text is based on analysis primitives. These primitives belong to the natural language processing and can be organized in a pyramidal model in the form of a 'Cake' as illustrated in Figure 2. The adaptation of Text2Onto tool for the French language goes, first of all, by the adaptation of these primitives.

TEXT2ONTO
Text2Onto [18] is a framework for learning ontologies from textual data according to a transparent process. This tool has several algorithms grouped into seven sets such as "Concept," "Instance," "Similarity," "SubclassOf," "InstanceOf," "Relation," and "SubtopicOf." Among the algorithms of the group "SubclassOf" is the algorithm WordNetConceptClassification that is based on the relationship of the structure of hypernym type in WordNet.
The designers of this tool implemented this algorithm in order to learn subclass-of relations by exploiting the hypernym relationship in WordNet. However, WordNet is an electronic lexical database for English. This constitutes a limitation for dealing with other languages with this tool, especially for learning subclass-of relations related to French.
In fact, P. Cimiano and V. Johanna have developed Text2Onto to overcome the other tools limitations such as TextToOnto [20], ASIUM [21], Mo'k Workbench [22], OntoLearn [23] or OnnearLT [24]. Among these limitations is the dependency of a specific or proprietary ontological model, the neglect of the interaction with the end user and the lack of consideration of changes in the data source. To overcome the first limitation, P. Cimiano and V. Johanna relied on a level of abstraction governed by the representation of an ontological structure by a meta-model based on modeling primitives. This gives their model the ability to translate the learned ontology into RDFS, OWL or F-Logic.
To overcome the second limitation, the authors adopted a probabilistic ontology model (POM) by presenting the results to the end user with probabilities of confidence. From the moment the results are ordered by certainty with the possibility of limiting them to those exceeding a confidence threshold, the interaction with the end user is made effective. To overcome the third limitation, the developers of this tool have developed it by exploiting the Data-Driven design pattern. That grants the tool the ability to handle changes in a corpus adequately. As a result, corpus changes are detected and POM changes are calculated based on these changes without making a recalculation of the POM for the whole corpus.

Linguistic preprocessing
To extract ontologies from a corpus, Text2Onto's authors have relied on the combination of the machine learning and natural language processing techniques. Indeed, Text2Onto is based on the general architecture for text engineering framework (GATE), which gives this tool a certain flexibility for adaptation and customization. Preprocessing begins with the preliminary stages of the Linguistic Preprocessing in Text2Onto, namely tokenization and sentence splitting. Then, the result of this step is used by a tagger POS by assigning each token its syntactic category. After, the phase of this pretreatment is completed by a morphological analyzer and a stemmer that perform lemmatization or stemming. So, basic preprocessing involves the following three steps:  Tokenization and sentences splitting.  Attribution of syntactic categories.  Lemmatization or stemming. The aforementioned steps constitute the basic pretreatment after which the annotated result thus obtained undergoes processing by a Java Annotation Patterns Engine (JAPE). This GATE module is used for surface analysis and identification of modeling primitives.

Concepts and instances
In the context of ontology learning, the extraction of terms that can represent concepts and relationships is based on two types of approaches: statistical approaches and linguistic approaches. Linguistic approaches are based on the analysis of the grammatical structure of the sentences that make up the text and on the identification of the syntactic (subject, direct object, etc.) and morpho-syntactic (noun, verb, adjective) roles of each word or group of words. While statistical approaches are based on calculating the relevance of a term in a corpus. In this context, Text2Onto developers have implemented algorithms to calculate:  Relative term frequency (RTF).  Term frequency inverted document frequency (TFIDF).  Entropy.  C-value/NC-value. These values are normalized for each term and used as probability at the level of the POM.

SubclassOf relationships
In the original version of Text2onto, the developers of this tool implemented three algorithms to extract SubclassOf relationships. Indeed, they implemented the pattern concept classification, vertical relations concept classification and word net concept classification algorithms according to the approach proposed by P. Cimiano et al. [25]. These algorithms are used for extracting SubclassOf relationships from text written in English. For the support of French, we had to implement variants of these three algorithms adapted to the specifications of the free word net for French (WOLF) and the particularities of French.

3747
 The implementation of the pattern concept classification algorithm is based on the techniques or methods of the natural language models, in particular on the Hearst-Patterns, illustrated in Figure 3, which is based on lexico-syntactic models based on hyponym and hypernym relationships. Figure 3. Hearst patterns  The implementation of the VerticalRelationsConceptClassification algorithm is based on the heuristic of "vertical relations" [26]. This algorithm is used for composite noun phrases.  The implementation of the WordNetConceptClassification algorithm is based on the hypernym relationships from Princeton WordNet. The standardized frequency of occurrences of a model represents the certainty of a relationship, in other words, the probability of a SubclassOf relationship.

Instance of relations
The creators of the Text2Onto tool have adopted an approach based on the notion of similarity. In the first place, they establish the computation of the contextual vectors for the instances as for the concepts. In the second place, they establish the computation of the similarity between the vectors of the instances and those of the concepts by using the divergence of Skewed [27]. Finally, they assign instances to the concepts with the greatest similarity as in [23].

ADAPTING TEXT2ONTO FOR THE SUPPORT OF THE FRENCH LANGUAGE
In order to adapt the source code of the Text2onto tool, theoretical challenges and technical constraints must be taken into account. Indeed, each language is characterized by its specific lexical and syntactic morphologies. We begin the natural language processing of a corpus of documents in French language by pretreatment for elementary decomposition whose purpose is the extraction of the terms that make up the sentences.
This phase considers that a sentence consists of symbols whose union represents the text as a set. The operation of this splitting of the sentences of the text into symbols is referred by the tokenization. For this, we have exploited, customized and adapted the processing resources offered by GATE in this area. In fact, GATE has several types of resources, including processing resources, language resources, and visual resources. In order to reach this phase, we have used and parameterized the default tokeniser, the default gazetteer and the SentenceSplitter resources from the collection of reusable objects for language engineering plugin (CREOLE).
The adaptation of the CREOLE tools of treatment requires the customization of their parameters. Indeed, the tokenization operation refers to the fragmentation of the text into tokens using grammatical rules. These tokens represent the words, numbers and punctuation constituting the elementary building blocks with which the text is constructed. In the context of this paper, we have implemented and used a French Tokeniser.rules file listing the grammatical rules for the tokenization operation of text written in French. These rules are composed of two parts: the left part consists of regular expressions of correspondence and the right part gives annotations to assign. The gazetteer is used to annotate the proper names of countries, provinces, cities, organizations, days of the week, etc. based on files containing lists of these names. In this case, the file 'country.lst' lists the names of the countries, in our context, written in French. The gazetteer uses an indexing file 'lists.def' to access all the files in the lists.
In order to segment a text into sentences, the exploitation of the splitting operation based on the 'Sentence Splitter' resource offered by GATE is used. This resource consists of a suite of finite state transducers. This processing resource uses gazetteer lists and JAPE patterns defining the grammatical rules for sentence morphologies that we have established for sources written in French. GATE provides a syntaxbased tagger (POSTagger: Part-Of-Speech Tagger) for identifying, independently of the language, the syntactic category as well as the lemma of a given token. As part of GATE, tree tagging is done using specific shell scripts named treeTaggerBinary. To execute these scripts, we use a Java wrapper shell script.
We have developed algorithms for the processing of textual resources in French, including the WolfNet concept classification algorithm whose purpose is to extract relationships of SubclassOf type. Indeed, in order to learn subclass-of relations for text in French, we implemented in Text2Onto variants of algorithms exploiting the hypernym structure in WOLF. To be able to exploit this hypernym structure in WOLF, we have developed JWOLF [28] an API in Java based on the Java Architecture for XML Binding (JAXB) API for the manipulation of XML files.

DISCUSSION
Initially, the adaptation of the Text2Onto tool for the processing of corpora of documents in French seemed an easy task. However, as we progress in this adventure, endless technical challenges arise continuously. We faced endless challenges that required painstaking efforts. It turns out that based on GATE, Text2Onto is closely related to this architecture. This strong dependency requires the learning and assimilation of concepts and notions on which GATE is built. The first lesson we learned is that sometimes it's better to build from the beginning than to try to adapt a pre-existing solution. In particular, if the solution no longer has support or at least contacts likely to shed light on the adaptation approach. The second lesson is that before attacking the personalization of a solution, we should master the theoretical notions and the technical tools on/with which it is established.
Armed with a great tenacity and determination, we have bravely managed the adaptation challenge of Text2Onto tool. At the end of this adventure, we managed to produce a version of Text2Onto able to generate an ontology from textual resources written in French. As it is shown in Figure 4, the results we have obtained are incredible and exceed our expected expectations. Indeed, by using this version for a simplistic document, we found that it allows the discovery of relationships whose existence is unsuspected and implausible. In Figure 5, we give the graphical representation of the instances of the SubclassOf relation extracted and illustrated in Figure 4. The most surprising is the ability of this version to discover instances of this relationship that we did not even imagine so that we relied on the features offered by the WOLF Browser [28] to ensure the veracity of the results. At this level, we should expect the unexpected.

CONCLUSION
In this paper, we presented our approach for adapting the Text2Onto tool for the extraction of ontologies from corpora of textual documents written in French. Despite the obstacles encountered due to lack of documentation and support, we managed to produce an enhanced version of Text2Onto able to generate an ontology from a corpus in French. Admittedly, with this version we are able to extract such an ontology. However, the adaptation of this tool is extremely overwhelming. Indeed, this version allows us to extract an ontology from a corpus of documents in French, nevertheless the performances do not satisfy our perfectionist expectations. However, this adventure has opened the way to set up a more sophisticated version.
In future work, more efforts are needed, on the one hand, to further improve the algorithms that we have adapted, on the other hand, to optimize the execution time and utilization of resources in terms of CPU consumption and memory occupation. Not to mention, evaluating this version on a large scale. The result of this contribution is the broadening of our horizons in the field of ontology learning. One of the auxiliary results is the investigation of the bases on which Text2Onto is built and the technical and methodological aspects of its development. This is an enlightened and informed way opening for the adaptation of this tool to support the other languages.