Version 1
: Received: 8 March 2021 / Approved: 10 March 2021 / Online: 10 March 2021 (13:22:18 CET)
Version 2
: Received: 9 April 2021 / Approved: 12 April 2021 / Online: 12 April 2021 (12:20:31 CEST)
Lino Ferreira da Silva Barros, M.H.; Oliveira Alves, G.; Morais Florêncio Souza, L.; da Silva Rocha, E.; Lorenzato de Oliveira, J.F.; Lynn, T.; Sampaio, V.; Endo, P.T. Benchmarking Machine Learning Models to Assist in the Prognosis of Tuberculosis. Informatics2021, 8, 27.
Lino Ferreira da Silva Barros, M.H.; Oliveira Alves, G.; Morais Florêncio Souza, L.; da Silva Rocha, E.; Lorenzato de Oliveira, J.F.; Lynn, T.; Sampaio, V.; Endo, P.T. Benchmarking Machine Learning Models to Assist in the Prognosis of Tuberculosis. Informatics 2021, 8, 27.
Lino Ferreira da Silva Barros, M.H.; Oliveira Alves, G.; Morais Florêncio Souza, L.; da Silva Rocha, E.; Lorenzato de Oliveira, J.F.; Lynn, T.; Sampaio, V.; Endo, P.T. Benchmarking Machine Learning Models to Assist in the Prognosis of Tuberculosis. Informatics2021, 8, 27.
Lino Ferreira da Silva Barros, M.H.; Oliveira Alves, G.; Morais Florêncio Souza, L.; da Silva Rocha, E.; Lorenzato de Oliveira, J.F.; Lynn, T.; Sampaio, V.; Endo, P.T. Benchmarking Machine Learning Models to Assist in the Prognosis of Tuberculosis. Informatics 2021, 8, 27.
Abstract
Tuberculosis (TB) is an airborne infectious disease caused by organisms in the Mycobacterium tuberculosis (Mtb) complex. In many low and middle-income countries, TB remains a major cause of morbidity and mortality. This work performs a benchmarking of machine learning models using a Brazilian health database related to TB confirmed cases and deaths, named SINAN-TB. The goal is to predict the probability of death by TB, assisting the TB prognosis and decision taking process. The database originally has 130 features, and many of these features had missing data, or incorrect data regarding the notification dates or birth dates, or were not related to the clinical and laboratory data. These data are treated, and after the preprocessing step, a new database with 38 features and 24,015 records is generated, having 22,876 TB cases and 1,139 deaths by TB. We design two experiments to investigated how the data unbalancing impacts on the models performance. With the evaluation of the f1-macro metric, we verify that the best result is achieved when using the imbalanced database, with the ensemble model that is composed of gradient boosting (GB), random forest (RF) and multi-layer perceptron (MLP) models.
Computer Science and Mathematics, Algebra and Number Theory
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.