ARTICLE | doi:10.20944/preprints202201.0333.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Covid-19; Ensemble; Genome sequencing; Machine learning; Variant
Online: 21 January 2022 (15:17:58 CET)
Covid-19 has caused infections and deaths worldwide. While research in the field of Data Science has contributed good predictions of positive Covid-19 case numbers, this study's review of literature shows there is little research in the use of variants of the virus in predictions. We set out to define and evaluate novel variant features. We find that features relating to variant trends, thresholds and amino acid substitutions are especially powerful in two tasks. In the first task, predicting Covid-19 case numbers, accuracy improved from 71.53% without variant features to 82.12% with variant features. In the second task, predicting transmission severity of variants between two classes, we created a method to build some variable ensembles through selecting appropriate models that are generated with variant features. The test results showed that our ensembles are more accurate and reliable. One particular ensemble of 14 models correctly classified 90.91% of variants, outperforming other models including the popular Random Forest ensemble. In addition, as the variant features have represented more underlying information about Covid-19 pathophysiology, our ensemble methods use only a few data samples to achieve an accurate prediction. The ensemble of 14 models uses only 50 cases of each variant, an ability that could be exploited for early detection of highly infectious variants. These research findings may benefit public health professionals, policy makers, and the research community in the collective efforts to overcome this disease.