Building a Domain Ontology in the Process of Linguistic Analysis of Text Resources

Ontology is a formalized representation of the problem area (PrA). Representation of the 1 PrA in the form of an domain ontology is often used in the process of development of intelligent 2 software systems and used as a knowledge base. The process of building an ontology is complex and 3 requires an expert in the PrA. A large number of researchers are working to solve this problem. The 4 basis of our approach is the use of a pipeline of different linguistic methods of text analysis. The set 5 of rules developed by us is used to build an ontology based on the content analysis of a text resource. 6 This article describes the method of building a domain ontology based on the linguistic analysis 7 of content of text resources, presents an example of the proposed approach, and also presents the 8 architecture of our pipeline. 9


Introduction
Currently, methods of artificial intelligence are used to solve various problems in the field of business process automation.The use of methods of artificial intelligence allows intelligent systems to solve intellectual tasks at a level close to a human.Intelligent systems must have knowledge about the PrA to successfully solve the intellectual tasks.The methods of knowledge engineering allow to describe the features of the PrA in the form of a domain ontology [1][2][3][4][5][6].
At present, ontologies are formed by experts in the problem area (PrA).The expert must have skills in the field of ontology engineering and have a good understanding of the specifics of a particular PrA.Building an ontology is a long and complex process.
The main lack of domain ontologies is the need for their development and updating due to PrA change.Knowledge extraction is carried out to extend the ontology.Knowledge extraction is carried out using semi-automatic methods for transforming unstructured, semi-structured and structured data into conceptual structures.Now there are several directions for building the ontology: • extraction of knowledge from Internet resources (in particular, wiki-resources); • analysis of dictionaries and thesaurus; • merging of different ontological structures; • extraction of terminology in the process of text processing using statistical and linguistic methods.
Thus, the task of automatically building ontologies based on the analysis of the contents of text resources is currently relevant.
A large number of researches are devoted to the automatic building of the domain ontology on the basis of the analysis of the content of wiki-resources.Wiki-resource -a website whose structure and content can be modified by using a special markup language.User do not need additional tools and IT skills to work with wiki-resources.So different wiki-resources may be used as data sources for the building of ontologies as they contain knowledge of various PrAs and freely available for use. 1. Formation of classes and relations of ontology on the basis of analysis of the structure of wiki-resources [7][8][9][10][11].
For example, in the YAGO project for automatic building of the domain ontology, data from Wikipedia and data from the semantic WordNet network were used.The ontology was built on the basis of a hierarchy of Wikipedia pages and information from info-boxes, and then expanded based on WordNet data.As you can see, the contents of the pages of wiki-resources are almost not taken into account, instead, various widely available thesauri are used.
We believe that the analysis of the content of the wiki-resources will increase the completeness of the description of the PrA in the form of a domain ontology.Also, an ontology can be built on the basis of an analysis of the contents of a set of text documents.The idea of our approach is to use the existing methods of linguistic analysis to construct a syntactic tree of sentence.Further, using a set of rules, you can translate a syntax tree into a semantic tree.Semantic representation of the text on native language (NL) is the most complete of those that can be achieved only by linguistic methods.
The domain ontology can be build from the semantic trees extracted from content of text resources.It is necessary to develop a method of translating a syntactic tree into a semantic tree.

A Method of Translating a Syntactic Tree Into a Semantic Tree
It is necessary to determine the syntactic structure of the sentence on NL for constructing the semantic tree.There are several parsing tools of texts in Russian, for example [21][22][23][24]: • Lingo-Master; • Treeton; • DictaScope Syntax; • ETAP-3; • ABBYY Compreno; • Tomita-parser; • AOT etc.
In our work, for constructing a syntactic tree the results of the AOT project were used.Consider the application of the algorithm of translating a syntactic tree into a semantic tree using the example of test sentence in Russian: " Онтология в информатике -это попытка всеобъемлющей и подробной формализации некоторой области знаний с помощью концептуальной схемы ".
The translation of test sentence into English is used to improve the perception of the algorithm: "Ontology in informatics is an attempt at comprehensive and detailed formalization of a certain field of knowledge with the help of a conceptual scheme".
The resulting syntactic tree of test sentence is shown in the Figure 1.
Formally the function of translating a syntactic tree into a semantic tree: where N Synt li is i-th node of l-th level of the syntactic tree.For example, the first node of the first level is the node "ontology", the second -"pg", the third -"is", etc. for the syntactic tree in Figure 1.The node of the syntactic tree can be a member of the sentence, for example, the node "ontology", or also can be a syntactic label that defines the constituent members of the sentence, for example, "pg" (the prepositional group); P j is j-h rule for translating the nodes of the syntactic tree.The nodes of the syntactic tree will be translated into nodes and relations of the semantic tree.The rule is a collection of several words (units) united according to the principle of semantic-grammatical-phonetic compatibility.Formally rule: } is the set of units of the rule corresponding to the set of nodes of the syntactic tree.The rule only works if all the units match.Examples of rules and the results of their use are presented in Table 1; K is number of units in the rule; {N Sem , R Sem } is set of nodes N Sem and relations R Sem of the semantic tree, obtained as a result of translation of the syntactic tree into a semantic tree.
Table 1.Examples of rules for translating nodes of syntactic tree into nodes of a semantic tree and the results of their application.

Initial data
Rule Result Formally R Sem : The algorithm of translating a syntactic tree into a semantic tree consists of the following steps: 1. Go to the first level of the syntactic tree.
2. Select the next node of the current tree level.If there are no unprocessed nodes, go to step 12.
3. If the node is marked as processed, go to step 2.
4. If the node is not a syntax label (not starts with "*"), go to step 10.
5. If the node is a syntax label (starts with "*") and does not have child elements, go to step 10.
6.If the node is a syntax label (starts with "*") and all its child nodes are not syntax labels, go to step 10.
7. If there is a temporary parent node, then replace it, otherwise create a temporary node.
8. If there is no connection between the nodes, create a temporary relationship between them and go to step 2.
9. If both nodes are not temporary and there is no connection between them, create an "associateWith" relationship between them and go to step 2.
10. Apply the rule for translation.
11. Mark the nodes as processed and go to step 2.
12. Go to the next level of the syntactic tree, and then go to step 2.

Example of the Algorithm of Translating a Syntactic Tree into a Semantic Tree
Let's consider an example of translating the syntactic tree of test sentence presented above into a semantic tree.The following nodes of syntactic tree (syntactic units) were identified in the first level of the syntactic tree of the test sentence (see Figure 1): • ontology; • *pg(in, informatics); • is; • *genit_pair(*genit_pair(attempt, *adj_noun(*homo_adj(comprehensive, detailed), formalization)), field of knowledge); • *pg(with the help, *adj_noun(conceptual, scheme)).
Figure 2 shows the semantic tree of test sentence at the beginning of the algorithm.As you can see from Figure 3, all syntactic units of the first level of the syntactic tree of the test sentence were processed.After applying the translation rules: • the syntactic unit "ontology" was included in semantic tree; • from the syntactic unit "is" the relation "isA" was formed between the node "ontology" and the temporary node "*genit_pair(...)"; • from the syntactic unit "*pg(in, informatics)" the node "informatics" and relation "dependsOn" between the nodes "informatics" and "ontology" were formed; • from the syntactic unit "*genit_pair(*genit_pair(...)), field of knowledge)" the temporary node "*genit_pair(...))" and the node "field of knowledge" were formed that are connected by the relation "associateWith"; • from the syntactic unit "*pg(with the help, ...)" the temporary node "*adj_noun(conceptual, scheme)" and relation "dependsOn" between that node and the temporary node "*genit_pair(...))" were formed.
All syntactic units of the first level and all syntactic units of the second level that are related to the syntactic units of the first level were marked as processed in the syntactic tree of test sentence.
Figure 4 shows the semantic tree of test sentence at the second iteration of the algorithm.As you can see from Figure 4, all syntactic units of the second level of the syntactic tree of the test sentence that not marked as processed were processed.After applying the translation rules: • from the syntactic unit "*genit_pair(attempt, *adj_noun(*homo_adj(comprehensive, detailed), formalization))" the node "attempt" and temporary node "*adj_noun(...)" were formed that are connected by relation "associateWith".In the genitive pair, the second node is the main node, so the existing relationships refers to the second node; • from the syntactic unit "*adj_noun(conceptual, scheme)" nodes "conceptual" and "scheme" and relation "hasAttribute" between them were formed.All syntactic units of the second level and all syntactic units of the third level that are related to the syntactic units of the second level were marked as processed in syntactic tree of test sentence.
Figure 5 shows the semantic tree of test sentence at the third iteration of the algorithm.As you can see from Figure 5, all syntactic units of the third level of the syntactic tree of the test sentence that not marked as processed were processed.After applying the translation rules: • form the syntactic unit "*adj_noun(*homo_adj(comprehensive, detailed), formalization)" the node "formalization" and the temporary node "*homo_adj(...)" were formed that are connected by the relation "hasAttribute".In a pair adjective-noun a noun is the main node, so the existing relationships refers to a noun; • also between the nodes "attempt" and "formalization" a relation "associateWith" was created.
All syntactic units of the third level and all syntactic units of the fourth level that are related to the syntactic units of the third level were marked as processed in syntactic tree of test sentence.
Figure 6 shows the semantic tree of test sentence at the fourth iteration of the algorithm.As you can see from Figure 6, all syntactic units of the fourth level of the syntactic tree of the test sentence that not marked as processed were processed.After applying the translation rules form the syntactic unit "*homo_adj(comprehensive, detailed)" the nodes "comprehensive" and "comprehensive" of semantic tree were formed that are connected by relation "hasAttribute" with node "formalization".
All syntactic units of the fourth level and all syntactic units of the fifth level that are related to the syntactic units of the fourth level were marked as processed in syntactic tree of test sentence.At the fifth iteration of the algorithm, the process of building the semantic tree of the test sentence is complete.The resulting semantic tree for the test fragment is shown in Figure 6.The resulting semantic tree can be merged with other semantic trees in a text resource.In addition, this semantic tree can be merged with the domain ontology created by the expert.

The Architecture of the Pipeline of Translating a Syntactic Tree into a Semantic tree
In our work, the results of the AOT project were used for constructing a syntactic tree.The AOT project started in Moscow in Dialing Company.Russian website is www.aot.ru[24].The AOT Russian Linguistic Environment (AOT RML) is distributed under the Library GNU Public Licence, that makes it possible to freely use the AOT RML for solving tasks of automated text processing in NL.The AOT RML was written by Alexey Sokirko, Igor Nozhov, Lev Gershenzon, Andrey Putrin and many other people.
The source code of AOT RML is available at https://sourceforge.net/p/seman/svn/HEAD/tree/ trunk/.You can get the installation and configuration manual at https://sourceforge.net/p/seman/ svn/HEAD/tree/trunk/readme.
Our system is split in two pieces: • AOT RML Environment is the third-party tool that used for grafematic, morphological and syntactic analysis.
• Semantic Environment is used for semantic analysis.The following pipeline of linguistic methods are used to build an ontology based on the analysis of the contents of text resources [21,24]: 1. Grafematic analysis is the process of initial analysis of the text on NL.The grafematic analysis presents the input text data in a convenient format for further analysis (separation of input text into words, delimiters etc).
2. Morphological analysis is used for construction of morphological interpretation of words of the input text.Morphological dictionary of AOT RML for Russian is based on the grammatical dictionary by A. A. Zaliznyak.The morphological dictionary of AOT RML for the Russian language includes 161 thousand lemmas.In the process of lemmatization, a set of morphological interpretations is generated for each word of the input text: lemma, morphological part of speech, a set of elementary morphological descriptors relating the wordform to some morphological class.
3. Syntactic analysis is used for construction of syntactic tree from extracted syntactic groups consisting of simple sentences.Simple sentences can be part of complex ones.As a result of the algorithm a set of semantic trees is formed.Thus, resulting semantic trees form the initial (raw/unprocessed) version of the domain ontology.The resulting domain ontology is saved as OWL file.

Conclusions
We have described a modular pipeline that can be used for translating a syntactic tree of sentence into a semantic tree.This approach can be used to automatically build a domain ontology.Manually building an ontology is a long and complex process.The main lack of domain ontologies is the need for their development and updating due to PrA change.The idea of our approach is to use the existing methods of linguistic analysis to construct a syntactic tree of sentence.Further, using a set of rules, you can translate a syntax tree into a semantic tree.Semantic representation of the text on native language (NL) is the most complete of those that can be achieved only by linguistic methods.The domain ontology can be build from the semantic trees extracted from content of text resources.Also, we have described the algorithm of translating a syntactic tree into a semantic tree.An example of the proposed approach of translating the syntactic tree of test sentence into a semantic tree is considered in detail.
In our work, the results of the AOT project were used for constructing a syntactic tree.Our system is split in two pieces: AOT RML Environment and Semantic Environment.AOT RML Environment is the third-party tool that used for grafematic, morphological and syntactic analysis.Semantic Environment is used for semantic analysis.The pipeline of linguistic methods are used to build an ontology based on the analysis of the contents of text resources.The pipeline includes blocks of grafematic, morphological, syntactic and semantic analysis.As a result of the algorithm a set of semantic trees is formed.Thus, resulting semantic trees form the initial (raw/unprocessed) version of the domain ontology.The resulting domain ontology is saved in the OWL file.The generated OWL file can be modified by an expert.

Future Work
In the future work we plan to use methods of deep learning to translating the syntactic tree of sentence into a semantic tree.Comparison of the two approaches to solving problem of automatically build a domain ontology will allow us to understand when you need to use the semantic approach and when you need to use the methods of deep learning.
Also, we plan to extend the set of rules for translating the syntactic tree into a semantic tree to cover a greater number of types of semantic relationships between objects of PrA.
In addition, we plan to develop an algorithm for evaluating the quality of the resulting ontology.
There are various approaches to the automatic generation of ontologies based on the analysis of the contents of wiki-resources: Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted: 1 February 2018 doi:10.20944/preprints201802.0001.v1© 2018 by the author(s).Distributed under a Creative Commons CC BY license.

Figure 1 .
Figure 1.Example of a syntactic tree of test sentence.

Figure 2 .
Figure 2. Example of a semantic tree of test sentence at the beginning of the algorithm.

Figure 3 Figure 3 .
Figure3shows the semantic tree of test sentence at the first iteration of the algorithm.

Figure 4 .
Figure 4. Example of a semantic tree of test sentence at the second iteration of the algorithm.

Figure 5 .
Figure 5. Example of a semantic tree of test sentence at the third iteration of the algorithm.

Figure 6 .
Figure 6.Example of a semantic tree of test sentence at the fourth iteration of the algorithm.

Figure 7
Figure 7 is shows the architecture of the pipeline for translating a syntactic tree into a semantic

Figure 7 .
Figure 7.The architecture of the pipeline for translating a syntactic tree into a semantic tree.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 1 February 2018 doi:10.20944/preprints201802.0001.v1 where
R Sem isA is set of transitive relations of hyponymy; R Sem partO f is set of transitive relations "part/whole"; R Sem associateWith is set of symmetrical relations of association; R Sem dependsOn is asymmetric relations of associative dependence; R Sem hasAttribute is set of asymmetric relations describing the attributes of nodes.2.1.The Algorithm of Translating a Syntactic Tree into a Semantic Tree

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 1 February 2018 doi:10.20944/preprints201802.0001.v1 4
. Semantic analysis is used for building the semantic structure of one sentence in Russian.The semantic structure consists of semantic nodes and semantic relations.A set of rules is used for semantic analysis.Each rule uses the results of the syntactic analysis.