An Approach to Determining Software Projects with Similar Functionality and Architecture Process Based on Artificial Intelligence Methods

Software engineers from all over the world solve independently a lot of similar problems. 1 In this condition the problem of code or even better architecture reusing becomes an issue of the 2 day. In this paper two phase approach to determining the functional and structural likenesses of 3 software projects is proposed. This approach combines two methods of artificial intelligence: natural 4 language processing techniques with a novel method for comparing software projects based on 5 ontological representation of their architecture automatically obtained from the projects source code. 6 Additionally several similarity metrics are proposed to estimate similarity between projects. 7


Introduction
Well known that human resources in modern software development are the most valuable.
Nevertheless often happens that the same tasks are solved by independent engineers multiple times, this leads to an ineffective software development process organization.There are approaches for reusing source code at various stages of development.Basically, these approaches allow to reuse certain functions and classes only and do not allow reveal the projects similarity on the basis of their subject area.Knowledge of architecture gained from already implemented projects in the same subject area could allow borrowing and reusing much larger parts of projects and avoiding conceptually incorrect solutions in the future.Quite often with a change of project developers team the implemented software solutions are forgotten and not reused.
A tool capable of determining similarities between projects can be very useful in software development.This tool can not be replaced by any version control system, because it has completely different purpose.Version control system provides storage of all the project versions with comments and ability to compare file versions.Meanwhile proposed tool could store, explain and compare projects structure.Such an instrument could be able to work not only with projects from single organization, but also with projects from open repositories owned by authors from all over the world.
Search on open repositories is carried out on the basis of keywords.This search could result in returning several thousand projects, which can not be handled by hand.The choice of projects set based on project subject matter and inner architectural solutions is a promising approach.Project subject matter could be obtained by analyzing project description which is usually done in "readme.txt"file in versions control system repository and by scanning issue forums of the project.This task requires implementation of natural language processing (NLP) which is widely used in artificial intelligence and data science world and there are a lot of tools to perform NLP procedures for such a popular programming languages as Python, R, Java, C#.An interesting approach to forum analysis could be found in [1].The state of the art technique in this area is word2vec models of natural language [2,3].
Nowadays there are many available pre-trained word2vec models [4,5], so there only thing to do is additionally tune word2vec model for specific task.
Further, to implement projects comparison based on their structure a tool for architectural concept extraction is needed.Commonly, the software project architecture is built at the design stage, which is prior to the development one.The UML language were developed to describe the project architecture with the required abstraction level.Based on the results of our previous research [6], could be concluded that developers use a lot of different types of structural elements.That is why ontology as a knowledge storage system could well act as a reference for the project analysis tool.Attempts to integrate ontologies into software development were carried out at different levels: technical documents [7][8][9][10][11], maintenance and testing of the source code [12], UML diagrams [13][14][15].
The minimal structural UML diagram elements, such as classes, interfaces, objects themselves, weakly convey the semantics and architectural solutions of the project.But combination of such an elements is much better describes the architecture.Stable combinations of structural elements are known as design patterns, this term exists in information technology for a long time but it is still relevant.Design patterns are actively used by the developer community, thus representing a reliable benchmark in the software project analysis.In addition, it makes sense to create local design patterns that solve specific task in a given subject area.A design pattern based on a specific subject area loses its main advantage -universality, but its greater semantic weight becomes more important characteristic for solving the problem of tool construction for searching and measure similarities between projects.
There are many works devoted to the integration of software development with ontologies.There is a complete approach to development based on a domain known as development based on the subject area [16][17][18].
The rest of this paper is organized as follows.In section 2 the detailed problem formulation is presented.Section 3 dedicated to the preliminary filtering of software projects.Then in section 4 the construction of software design ontology technique is presented step by step.The experimental results are discussed in Section 5. Section 6 concludes the paper.

Formulation of the problem
The results presented in this paper are based on the work described in research [6].The system described in that research made it possible to extract information from conceptual models and save it as an ontology of a certain format.But the life cycle and development practices of IT companies show that conceptual models are created mostly once at the beginning of the project and in rare cases updated at the beginning of each stage of the project.
The best description of the project state is its source code.Developers try to maintain the source code in a good condition, create documentation, provide comments and perform code refactoring.
Another advantage of using source code as a source of project state information is the fact of wide using of version control systems.Tracking all the versions of software products allows effectively manage the software development process and generate a huge amount of information available for analysis.
Comparison of information obtained from conceptual models of a new project at the design stage and information obtained from the source code of projects that have already been implemented could allow to determine projects structural similarity.
In order to be able to analyze and measure the projects structural similarity, it is necessary to transform information about projects from different sources to a single format.The most convenient way of presentation of the extracted information is a form of ontology using OWL format.OWL ontology format allows to preserve semantics of complex architectural solutions, to modify already existing data and to perform logical operations on statements.
The search for project structural similarity is part of the project comparison method.Another part of this method is projects filtering base on subject areas.As long as information about software project collected from open sources and there is no common tag system or any commonly used classifier, the only way to group project is to perform clustering procedure.This part is performed by using a combination of approaches suggested in papers [19,20] for short text clustering based on semantic similarity measure obtained from wod2vec pre-trained model.
If a comparison is made for projects of the same enterprise in the same subject area, then the comparison should be performed at the level of the processes and the components of the subject area.
In case when project comparison is carried out among projects hosted on any open repository, the structural similarity of the projects could be more important metric then specific subject area, that is why filtering based on NLP used just for preliminary selection of possibly relevant software projects.
In such a condition the precision of filtering method is not very important but the convenience of the result presentation way is vital for further expert analysis.The detailed explanation of suggested approach is done in the next section.

Fuzzy hierarchical classifier constructing
In order to construct fuzzy hierarchical classifier an approach from paper [20] was adopted.This algorithm takes as an input a set of sentences (short text fragments), for this task project descriptions, forum discussions of the project, set of comments from source code, etc. could be taken as sentences.
For the experiment we have selected projects with a set of keywords ("api", "java", "mobile", "sdk") from source code repositories like GitHub, GitLab and others.It total there were 490 projects in the input dataset, and general project description has been taken as sentences.As long as there is not possibility to show all the input data, it Table 1 could be found the most demonstrative samples translated into English.Some of them are obviously appropriate, and after the clustering procedure will be searched for design patterns.While the others are obviously are not appropriate and will be moved to different clusters and thus filtered out.

Feature construction
Obtained classifier could be treated as a data source to construct feature vectors.The exact algorithm was described in [19] 1 .A function described bellow was used to transform sentences into vector form.
where SL is a lemmatized set of sentences, R is a membership degree of sentence in current fuzzy classifier vertex.
with HV is an obtained classifier, parameter settings are: σ = 1, µ = 2.4 in order to make function global maximum equal to 4 repetition of the word in the dataset.The next step is to find the most appropriate groups of similar software projects.

Projects clustering
To perform clustering procedure an HDBScan [21] algorithm has been chosen.This clustering algorithm combined with features constructed from hierarchical classifier hierarchical classifier has been chosen for its ability to return quite accurate and pure clusters, mainly at the expense of precision.
Metrics for precision, accuracy and purity are defined for each class as follows: where CM = [cm i,j = (ω i ∩ c j ), ω i -class with number i, c j -cluster with number j ].
General precision, accuracy and purity are calculated as average values.
This method has shown quite a good performance results for a similar task which is discussed in paper [19].
During the clustering process 15 clusters we determined, main clusters with general description are presented on the Figure 2. HDBScan algorithm was performed with parameter min_cluster_size = 5.Algorithm HDBScan as a true clustering algorithm always forms 'noise cluster' with number -1, where all the samples that could not be grouped are moved to.

Noise cluster (#-1) 31%
VK APIs (#1) 32% VK Bots (#2) 13% VK Players (#3) 9% VK Crawlers (#4) 5% Other 10% In our example the 'noise cluster'is quit big, the main reason for this are too poor projects' description and high dimensional clustering features.More detailed discussion for this could be found in [19].Future studies will be dedicated to solving this problem.
Clustering labels for previously shown sentences could be found in Table 2.As it could be seen projects are grouped quite accurate.Calculated quality metrics based on expert evaluation are following: precision = 0.7, accuracy = 0.98, purity = 0.98.High level of accuracy and purity quite important in cases when expert evaluation follows the clustering process and homogeneous clusters are preferred to precise ones.In our case clustering has been made to facilitate the process of software projects filtering, but the final decision is on expert.A software projects selection strategy for the next phase could be different.The first strategy is to choose one example from each cluster in order to have a good variety of possible architectures.The opposite strategy is to chose a few entire clusters in case of getting some clusters that generally satisfy business purposes of a new software project.In the experiment part of this paper the second strategy has been chosen.For the further processing sentences from clusters with numbers 1, 2 and 3 have been chosen.

Software design ontology
If class diagram for software was built during design stage the structure analysis could be done.
To complete this task ontology design approach was used and described below.

UML meta-model based ontology
As a target for storing knowlege from UML class diagrams has been chosen an OWL ontology format.OWL was chosen because this format is the most expressive in terms of representation of knowledge for complex subject areas.The class diagram elements should be translated into ontology as concepts with their semantics consideration.Semantics of the whole diagram is being formed from the semantics of the diagram elements and the semantics of their relations.That is why the ontology was built on the basis of the UML meta-scheme, and not as a formal set of translated elements.
To solve the problem of intellectual analysis of project diagrams that ware included in the project documentation, it is necessary to have knowledge about formalized diagrams constructing.
Ontology contains concepts that describe the most basic elements of the class diagram, but it could be expanded if necessary.During translation of the UML meta-scheme the following notations were applied.Formally, the ontology of project diagrams is represented as a set: where : C prj = {c prj 1 , ....c prj i } -is a set of concepts that define main UML diagram elements such as : "Class", "Object", "Interface", "Relationship" and others; R prj -the set of connections between ontology concepts.These relationships allow to describe correctly rules of UML notation.
F prj -is the set of interpretation functions defined on the relationships R prj if the element is a class, its name is written.If the element is a relationship, then names of connecting elements are written with underscore symbol between them.One of the most commonly used design patterns is the Builder [22].
Builder is a creation desing pattern.This desing pattern separates the algorithm for the step-by-step construction of a complex object from its external representation to make it possible to have different representations of this object using the same algorithm.
In order to preserve this design pattern in the developed ontology, the following individuals belonging to relevant concepts are required.
Ontological representation of the design pattern: In fact, the ontological representation of a single design pattern is a set of individuals of concepts and relations from the ontology of project diagrams.
To calculate the structural similarity of projects based on developed ontology, the following evaluation functions were proposed.The first metric gives priority to the maximum single expressed design pattern in both diagrams: where dc γ and dc δ is projects class diagrams presented as UML metamodel ontology Abox expressions, µ dc γ ,dc δ (tmp) -measure of expression the design pattern in project diagram.
The second metric considers the coincidence of all design patterns in equal proportions and does not considers design patterns with a measure of expression less than 0.3: where N -count of design patterns with a measure of expression greater than 0.3 for both of projects.
The third metric works in the same way as the second one, but the contribution to the evaluation by design patterns depends on the number of elements in the design pattern (the design pattern with 20 elements means more than a design pattern with 5 elements): where ∼ µ dc γ ∩dc δ -weighted measure of expression.

Searching design patterns in projects
To determine the measure of similarity between two projects, it is necessary to calculate an expression degree for each design pattern in each project.The expression measure of the design pattern in the project can be calculated by mapping a project ontology Abox on a design pattern ontology Abox.The Table 3 contains expression degree for each design pattern in each project.

Results of searching structurally similar software projects by different metrics
This estimations are normalized from 0 to 1.For the first metric estimations are always equal to 1.This could be easily explained because first metric chooses the most expressed design pattern in

Conclusions
In this paper two phase approach of artificial intelligence to determining the functional and structural similarity of software projects is presented.NLP analysis and ontology construction allow to find and investigate projects with similar purposes and architecture.In the experimental part the proposed method was applied to different projects and proposed several similarity metrics to measure similarity between projects.Moreover, the work presented in this paper have great potential for further research.Number of projects could be expanded.It is possible to include new design patterns in consideration.Ontologies obtained in the intermediate stages could be used separately in Protege editor.The results of this research correspond to the artificial intelligence and can be used to create intellectual systems.Expanding the system by using ontologies of subject areas can significantly increase the relevance of the similar projects selection.

Preprints
(www.preprints.org)| NOT PEER-REVIEWED | Posted: 31 January 2018 doi:10.20944/preprints201801.0290.v1 Each sentence was tokenized and lemmatized.Resulting terms were organized in fuzzy graph based on semantic similarity measure obtained from pre-trained word2vec model.And, finally, with a help of hierarchical fuzzy graph -clustering algorithm a fuzzy hierarchical classifier was obtained, see Figure1.On this figure only a two pieces of hierarchy are presented for the reason of space and clear visibility.These two sub-hierarchy shows two semantically related groups of words, the first one for the programming and API and the second one for the music.In the clustering result part below will be shown that these two groups lead to forming clusters dedicated to VK APIs and VK Players respectively.

Figure 1 .
Figure 1.Extract from hierarchical classifier of software project terms

4. 2 .
Design patterns as structural parts of software projects Design patterns are inserted into ontology as a set of individuals based on the ontology concepts described above.Semantic constraints and properties of design patterns are specified by the ObjectProperties and DatatypeProperties of OWL ontology.Since many design patterns are stored in the ontology at the same time, it is necessary to define naming convention for their elements to avoid names duplication.Name of the design pattern element begins with the design pattern name, and then

Table 1 .
A set of sentences for software projects preliminary filtering Java library for VK API interaction, includes OAuth 2.0 authorization and API methods.Full VK API features documentation can be found here.This library has been created using the VK API JSON Schema.It can be found here.It uses VK API version 5.69.dewarder/ HoldingButton Button which is visible while user holds it.Main use case is controlling audio recording state (like in Telegram, Viber, VK).korobitsyn/ VKOpenRobot VK Open Bot is a library for bot creation for VK social network.Main features: mass friends collection, mass group searching and aggregation, user detailed information, user status detection gleb-kosteiko/ vkb Script allows you to automate the searching and participation in random reposts competitions in vk.com.Check friends walls for competition posts and repost these posts (also joined all needed communities and added all needed users to friends).Do reposts for simulation of the real user behavior.Search competition posts in VK.Small Java API used for work with VK.Example of using is in VK Example.java.Authorization Counters of new messages, friends, answers and groups Total count of friends.Loading basic data about friends id, name, photo.Friends' status.Send private message to chat or user.Load list of dialogs.Load list of groups akveo/ cordova-vk You can use this plugin to authenticate user via VK application rather than via webview.It makes use of official VkSDKs for iOS and Android.This project is based on another github project https://github.com/DrMoriarty/cordova-social-vk .But Api was made a bit more generic to fit our needs.

Table 2 .
Sample of clustering results