ARTICLE | doi:10.20944/preprints202211.0005.v1
Subject: Medicine And Pharmacology, Epidemiology And Infectious Diseases Keywords: Epidemics; Twitter; Natural Language Processing; Topic Modelling; Sentiment Analysis; ARI; Cholera; Ebola; HIV/AIDS; Influenza; Malaria; Spanish influenza; Swine flu; Tuberculosis; Typhus; Yellow fever; and Zika
Online: 1 November 2022 (01:17:14 CET)
At the end of 2019, while the world was being hit by the COVID-19 virus and, consequently, was living a global health crisis, many other pandemics were putting humankind in danger. The role of social media is of paramount importance in these kinds of contexts since they help health systems to cope with emergencies by contributing to conducting some activities such as the identification of public concerns, the detection of infections’ symptoms, and the traceability of the virus diffusion. In this paper, we have analyzed comments on events related to cholera, ebola, HIV/AIDS, influenza, malaria, Spanish influenza, swine flu, tuberculosis, typhus, yellow fever, and zika, collecting 369,472 tweets from the 3rd of March to the 15th of September, 2022. Our analysis has started with the collection of comments composed of unstructured texts on which we have applied natural language processing solutions. Afterward, we have employed topic modelling and sentiment analysis techniques to obtain a collection of people’s concerns and attitudes toward these pandemics. According to our findings, people's discussions were mostly about malaria, influenza, and tuberculosis and the focus was on the diseases themselves. As regards emotions, the most popular were fear, trust, and disgust where trust is mainly regarding HIV/AIDS tweets.
ARTICLE | doi:10.20944/preprints202209.0309.v1
Subject: Computer Science And Mathematics, Artificial Intelligence And Machine Learning Keywords: machine learning; natural language processing; commit messages; change prediction model
Online: 20 September 2022 (14:52:49 CEST)
Version Control and Source Code Management Systems, such as GitHub, contain large amount ofunstructured historical information of software projects. Recent studies have introduced Natural Language Processing (NLP) to help software engineers retrieve information from very large collection of unstructured data. In this study, we have extended our previous study by increasing our datasets and ML and clustering techniques. Method: We have followed a complex methodology made up of various steps. Starting from the raw commit messages we have employed NLP techniques to build a structured database. We have extracted their main features and used as input of different clustering algorithms. Once labelled each entry, we have applied supervised machine learning techniques to build a prediction and classification model. Results: We have developed a machine learning-based model to automatically classify commit messages of a software project. Our model exploits a ground-truth dataset which includes commit messages obtained from various GitHub projects belonging to the HEP context. Conclusions: The contribution of this paper is two-fold: it proposes a ground-truth database; it provides a machine learning prediction model. They automatically identify the more change-proneness areas of code. Our model has obtained a very high average precision, recall and F1-score.
ARTICLE | doi:10.20944/preprints202206.0205.v1
Subject: Computer Science And Mathematics, Information Systems Keywords: log analysis; monitoring data; anomaly detection; natural language processing; topic modeling; clustering technique; time series anomaly detection
Online: 14 June 2022 (11:10:15 CEST)
Context: Anomaly detection in a data center is a challenging task, having to consider different services on various resources. Current literature shows the application of artificial intelligence techniques to either log files or monitoring data: the former created by services at run time, while the latter produced by specific sensors directly on the physical or virtual machine. Objectives: We propose a model that exploits information both in log files and monitoring data to identify patterns and detect anomalies over time. Methods: The key idea is to use on one side natural language processing solutions to detect problems at service level, extracting words that represent anomalies. Clustering and topic modeling techniques have been used to identify patterns and group them with respect to topics. On the other side time series anomaly detection technique has been applied to sensors data in order to combine problems found in the log files with problems stored in the monitoring data. Results: We have tested our approach on a real data center equipped with log files and monitoring data that can characterize the behaviour of physical and virtual resources in production. We have observed a correspondence between anomalies in log files and monitoring data, e.g. an increase in memory usage or machine load. The results are extremely promising. Conclusion: Our model requires to integrate site administrators' expertise in order to consider all critical scenario in the data center and understand results properly.