Ontology engineering of automatic text processing methods

Currently, ontologies are recognized as the most effective means of formalizing and systematizing knowledge and data in scientific subject area (SSA). Practice has shown that using ontology design patterns is effective in developing the ontology of scientific subject areas. This is due to the fact that scientific subject areas ontology, as a rule


INTRODUCTION
Ontologies are extensively used to formalize knowledge in the areas of scientific subjects.With the aid of ontology, it is possible to assure their uniform and consistent description as well as the convenient presentation of all the required ideas of the simulated domain.A scientific subject area (SSA) is understood as a subject area (SA) that covers a specific scientific discipline or knowledge, including its objects and subjects of research, characteristics and used research methods.Currently numerous strategies and approaches are suggested to speed up the time-consuming process of developing an ontology for any topic area [1]- [4].Accordingly, intensively developing an approach based on the ontology design patterns application (ODP) [5]- [8].According to this approach, ODP is documented descriptions of proven solutions to typical problems of ontology modeling [9].They are developed to assist and streamline the creation of 6621 ontologies and aid developers in avoiding common blunders in ontology modeling.Despite the fact that using ontology design patterns reduces the need for human resources and raises the standard of ontologies being created, currently only one method for building ontologies i.e. eXtreme design methodology [10], suggested within the NeOn project [11], openly declares the use of ODP.Note also that there are not many ontology development tools that support the use of ODP.These include, a plugin for NeOn project development tools of ontology, as well as a plugin for the web protégé ontology editor [12], [13].However, these funds cover only a part of possible problems associated with patterns.So, there are no instruments supporting the patterns searching, construction and extraction from ontologies, and very few instruments supporting the patterns collection, discussion and also dissemination.To some extent, the latter include of ontology design patterns catalogs [14]- [17], which have also actively developed now.
The paper considers an approach to the implementation of such kind of ontology design patterns as content patterns [18], which play an important role in the development of ontologies of modern methods of automatic text processing (ATP) proposed by the authors.The ontology of ATP modern methods includes both classical ATP methods and methods using machine learning.The papers [19]- [21] existing ontologies containing ATP methods were analyzed.At the moment, there is an ontology of machine learning [22], [23] which contains a small set of ATP methods based on machine learning.However, existing ontologies cannot give an idea of the whole variety of this type of method.In addition, many new methods and models have recently appeared that have not yet been reflected in previously developed ontologies.To systematize data and information resources, to organize meaningful access to them, the ontology of the subject area "automatic text processing" developed within the framework of this paper will be used, and software basis will be used as a standard and tool of semantic web technologies [24].

PROBLEM STATEMENT AND ONTOLOGY MAIN DEFINITIONS
Let be given the SA ontology, the replenishing rules of this ontology, the syntactic and semantic model of the SA language, the terms dictionary and input data in final text form in a natural language containing information for replenishing the ontology.We consider that ontology  subject area includes the following elements: i) a finite non-empty set of classes  that describe the subject area concepts; ii) a finite set of data domains ; and iii) a finite set of attributes with names from the set ∪, while the data attributes from  accept values from some data domain in , and the values of relationship attributes from  that model relationships between classes are instances of classes from . Every

DEVELOPMENT OF ONTOLOGIES SUBJECT AREA "AUTOMATIC TEXT PROCESSING"
The ontology of "automatic text processing" subject areas as shown in Figure 1 includes the systematization of modern ATP methods, a specification of properties, relationships between them, techniques and areas of their publications, and application.Systematization of all information on the specified methods can be carried out on the next basics: by purpose (solved applied problem types), and by areas of use.The core of the ATP ontology is formed by the ATP class, which defines the main properties of the ATP methods, and its subclasses, which are used to represent the types of solutions to problems using methods.Such classes are machine translation, abstracting, annotation, sentiment analysis, rubrication, classification and text pasteurization, and building knowledge bases.To build an ontology and its initial content, a technique was used to develop ontologies using basic ontologies that include only the most general entities that do not depend on a specific subject area and ODP [25], [26] which are documented descriptions of proven solutions to typical problems of ontology modeling in practice.The use of such patterns not only improves the quality but also greatly facilitates the development of an ontology since it can involve experts in the modeled area who do not have the skills of ontology modeling.To assess the quality of the ontology was developed a methodology [27], on the basis of which the involved experts carried out an experimental assessment of the created ontology, including an assessment of the degree of agreement of the experts.Metrics for evaluating various ontology properties that do not require expert work are also considered.As a result of the research, we propose a methodology for the development of intelligent information resource of automatic text processing (IIR ATP), it offers the architecture and algorithm for the development of IIR ATP.The principles and approaches underlying the methodology determine the following main features: i) focus on semi-formalized software; ii) independence from software; iii) focus on the maximum use of ready-made developments (both copyright and third-party); iv) use of semantic web technologies and service-oriented approach, information system supporting scientific and educational activities (ISSEA) development technologies; v) use of the ISSEA shell as a framework for 6623 the future IIR ATP; and vi) openness and scalability of the proposed tools; convenience and low entry threshold for the use of the proposed funds.The format for describing ATP methods is supplemented with elements that serve to describe the context development and use of ODP.For these purposes, the ontology of ATP methods includes the following classes: scope, activity, task, publication, person, organization, and information resource.To associate methods with instances of these classes, the ontology of ATP includes relations that allow link ATP with SA, persons, organizations, and projects in which they are used, as well as with publications and information resources where they are described.The ontology describes most fully the ATP methods implemented in the proposed IIR ATP system [28] using the following ODP templates: structural logical patterns, content patterns, presentation patterns, and lexico-syntactic patterns (LSP) [29]- [31].
Necessity of use structural logical patterns arose due to the lack of expressive means in the web ontology language (OWL) [32] for representing complex entities and constructions that are relevant in the construction of ontologies, in particular, many-place and attributed relations (binary relations with attributes), as well as ranges of valid values determined by the developer of the ontology.Pattern specialization can consist of renaming, in specifying the names and values of its properties (attributes and relations).Figure 2 shows the specialization of patterns on the example of the structural logical pattern "binary attributed relation".The central place in this pattern is occupied by the auxiliary class Relation with attributes, with which the base classes that model the arguments of a binary relation are associated, through the relationships "is an argument" and "has an argument".At the same time, in the pattern (in link labels) it is indicated that there should be one such argument.The attributes of a binary attributed relation are modeled by the properties of the relation class with the attributes "has an attribute" and "has an attribute from domain".In general, such a relationship may have no attributes, as reflected in the link labels that represent those properties.The concretization (meaning) of the pattern consists in substituting specific property values into it.

Figure 2. Binary attributed relationship patterns and its specialization
The pattern "area of allowable values" is intended to set the possible values of any property of the class, when is known in advance the whole values set (usually string) and can be stated at the stage of develop.Content patterns are designed to uniform provide and consistent of concepts representation used in ATP and their properties.Content templates are to provide a uniform and consistent representation of ATP concepts and their properties.Such patterns were developed for concepts that are typical for most SSA: subject of research, object of study, section of science, task, method, scientific result, project, activity, organization, person, publication, and information resource.For each of these patterns, a set of proficiency testing questions is defined.With these questions, the optional and mandatory compositions of pattern elements ontology are identified and requirements for them are described, which are presented in the restrictions and axioms forms.For each pattern representing the concept of SA, a set of key attributes has been compiled that uniquely identify concept specific instance.Figure 3 shows a pattern for representing "ATP methods" concept.The pattern description elements are represented by the obligatory classes of the ontology task, science section, organization and person, optional classes activity, and scientific result, and the relations "solves", "used in", "implemented in", and "has an author".In pattern representing the concept of "ATP methods", there is one key attribute "name".Competency examples assessment questions representing ATP methods content pattern: "what is methods name?", "who is methods author?", "when was method proposed?","what problems are solved using the method?","what activity uses the method?","in what scientific results is the method implemented?","who is using the method?","what organizations use the method?".

ARCHITECTURE OF AN INTELLIGENT RESOURCE BASED ON MODERN ATP METHODS
IIR ATP consists of the following components as shown in Figure 4; an ontology of ATP methods, a repository of ATP methods, basic ontologies repository, a dictionary of scientific lexicon, data and ontology editors, a subsystem to automatic replenishment of an LSP based ontology.The repository of ATP methods is built on basis of ATP methods ontology and includes realizations of ODP.At the same time, structurallogical patterns presentation patterns, content patterns are implemented by OWL language means, while LSP is presented in a description language on specialized template [33].
The automated ontology building system (AOBS) supports the building methods of SSA ontology based on basic ontologies that contain the most general concepts that are typical for most SSA.For this reason, the system consists of a repository of basic ontologies such as: scientific knowledge ontology, scientific activity ontology, the basic ontology of problems and basic ontology of information resources [34].All base ontologies have characteristics in OWL language.Content patterns have been developed and included in the ATP repository for the most important basic ontologies concepts.The developed ontology model was implemented in the Protégé 5.5.0 ontology editor, Figure 5.The system includes data editor for convenient use of ATP methods, that enables replenishing the ontology of the SSA by concrete definition of content patterns included in ATP methods repository.The dictionary of scientific lexicon contains semantically marked terms used in scientific texts to describe the essence of various ATP methods.It is used to extract subject vocabulary from texts and automatically generate an SA dictionary, as well as for subsequent automatic text analysis using LSP.The subsystem of automatic ontology replenishment is intended to enter information extracted from texts in natural language into SSA ontology.For this, LSP is used, built on content patterns basis and general scientific lexicon dictionary intelligent information resource is designed to systematize information about modern methods of automatic text processing and provide meaningful access to it.The work of the resource is organized on ATP ontology basics, which is its conceptual basis.
The left side of Figure 6 shows the class hierarchy of the ATP ontology.The right side shows a description of the ATP method, which includes the name of the method, a description of its purpose, a link to the OWL view, a link to a graphical representation, a set of questions for assessing competence, and links to projects in which it was developed and used.In addition, IIR ATP is an AOBS user interface that provides users with access to all repositories and editors that support the development of the SSA ontology, as well as the subsystem of automatic ontology replenishment based on LSP.

CONCLUSION
This paper describes the ontology model of an intelligent information resource developed by the authors according to modern methods of automatic text processing.The ontology systematizes information about the area of knowledge "Automatic text processing" and provides of IIR ATP with a single conceptual basis.The Ontology Design Patterns used in this approach appeared as a result of solving ontology modeling problems that the authors of the paper encountered in the process of developing ontologies for various scientific subject areas.The use of ontology design patterns makes it possible to provide a uniform and consistent representation of all the entities of the scientific subject areas of ontology, to reduce the number of errors in ontology modeling, to increase the "comprehensibility" of the ontology by developers, and thus to provide the possibility of collective development of ontologies.Since the use of Ontology Design Patterns greatly simplifies and facilitates the development of the ontology of the scientific subject areas, it can involve experts in a particular scientific subject area who do not have the skills of ontology modeling, which can significantly speed up the development of the ontology.Our further research is aimed at the full-scale implementation of the subsystem for automatic ontology replenishment based on lexico-syntactic patterns.

Figure 1 .
Figure 1.Ontology of the subject area "automatic text processing"

Int
Ontology engineering of automatic text processing methods (ZhannaSadirmekova)

Figure 4 . 6625 Figure 5 .
Figure 4. Architecture of the automated ontology building system

For attribute 𝛾 his class is denoted as 𝑐 𝛾 and his values set as 𝐷 𝛾 . Among class attributes, singled out non-empty of key attributes set 𝐴𝑡𝑟 𝑐
class  ∈   determined by set attributes:  = (  ,   ), where each data attribute  ∈   ⊆   mapped domain    ⊆   with values in     , and every attribute relationship  ∈   ⊆   accepts values classes   ⊆   .All attributes set in class  denoted as   =   ∪   . , which can be attributes of both data and relationships.Set  = (  ,   ,   ) is an instance of class   = (   ,    )( ∈   ), if and only if every attribute data in   has name   ∈    with values    from       , and every attribute relationship in   has name   ∈    with values    as instances of classes from   .Key attributes data are always unambiguous, i.e. every key attribute in each instance of ontology maybe have only one value.Key attribute relations correspond to bijective relations.We consider ontology without synonyms classes and attributes data, i.e. ∀ 1 ,  2 ∈   :   1 ≠   2 and ∀ 1 ,  2 ∈   :   1 ≠   2 .Class  2 inherits class  1 if and only if ∀ ∈  2 :  ∈  1 .Informational content   ontology ; this a set of copies classes of this ontology.Problem replenishment ontology is the calculation of informational content by given input data for a given ontology.There we define a set of  information objects (i: objects) retrieved from input data and relevant copies ontology.Every informational object  ∈  has a view (  ,   ,   ,   ,   ), where: a. Class   ∈   b.   is a set of attributes data   = (,    ), where − Name  ∈      is set of attributes relations   = (,    ), where − name  ∈    −    is i -objects set of class   ̅ ∈    d.   is grammatical information (morphological and syntactic signs); e.   is structural information (many positions in input data).Denote a set of all attributes i-object  as   =   ∪   .Every i-object natural way corresponds to some instance ontology: if  = (  ,   ,   ,   ,   ) is some i-object, that his corresponding copy ontology is ′ = (  ,  ′ ,  ′ ).Every attribute ′ ∈  ′ has values    .Every  ∈  ′ has values    .