ARTICLE | doi:10.20944/preprints202209.0309.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: machine learning; natural language processing; commit messages; change prediction model
Online: 20 September 2022 (14:52:49 CEST)
Version Control and Source Code Management Systems, such as GitHub, contain large amount ofunstructured historical information of software projects. Recent studies have introduced Natural Language Processing (NLP) to help software engineers retrieve information from very large collection of unstructured data. In this study, we have extended our previous study by increasing our datasets and ML and clustering techniques. Method: We have followed a complex methodology made up of various steps. Starting from the raw commit messages we have employed NLP techniques to build a structured database. We have extracted their main features and used as input of different clustering algorithms. Once labelled each entry, we have applied supervised machine learning techniques to build a prediction and classification model. Results: We have developed a machine learning-based model to automatically classify commit messages of a software project. Our model exploits a ground-truth dataset which includes commit messages obtained from various GitHub projects belonging to the HEP context. Conclusions: The contribution of this paper is two-fold: it proposes a ground-truth database; it provides a machine learning prediction model. They automatically identify the more change-proneness areas of code. Our model has obtained a very high average precision, recall and F1-score.