Preprint Technical Note Version 2 Preserved in Portico This version is not peer-reviewed

On the Inapplicability of Supervised Machine Learning to Studying the Driving Forces of Evolution

Version 1 : Received: 8 December 2020 / Approved: 9 December 2020 / Online: 9 December 2020 (09:53:18 CET)
Version 2 : Received: 28 January 2021 / Approved: 28 January 2021 / Online: 28 January 2021 (11:31:23 CET)

How to cite: Elhaik, E.; Graur, D. On the Inapplicability of Supervised Machine Learning to Studying the Driving Forces of Evolution. Preprints 2020, 2020120214 (doi: 10.20944/preprints202012.0214.v2). Elhaik, E.; Graur, D. On the Inapplicability of Supervised Machine Learning to Studying the Driving Forces of Evolution. Preprints 2020, 2020120214 (doi: 10.20944/preprints202012.0214.v2).

Abstract

Supervised machine learning (SML) is a powerful method for predicting a small number of well-defined output groups (e.g., potential buyers of a certain product) by taking as input a large number of known well-defined measurements (e.g., past purchases, income, ethnicity, gender, credit record, age, favorite color, favorite chewing gum). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known to be true. SML has had enormous success in the world of commerce, and this success may have prompted a few scientists to employ it in the study of molecular and genome evolution. Here, we list the properties of SML that make it an unsuitable tool in certain evolutionary studies. In particular, we argue that SML cannot be used in an evolutionary exploratory context for the simple reason that training datasets that are known to be a priori true do not exist. As a case study, we use an SML study in which it was concluded that most human genomes evolve by positive selection through soft selective sweeps (Schrider and Kern 2017). We show that in the absence of legitimate training datasets, Schrider and Kern (2017) used (1) simulations that employ many manipulatable variables and (2) a system of cherry-picking data that would put to shame most modern evangelical exegeses of the Bible. These two factors, in addition to the lack of methodological detail and negative controls, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S-HIC) should be taken with a huge shovel of salt.

Subject Areas

machine learning; evolution; discoal; SML

Comments (1)

Comment 1
Received: 28 January 2021
Commenter: Eran Elhaik
Commenter's Conflict of Interests: Author
Comment: Change of title to make it more accurate, adding references, correcting small errors.
+ Respond to this comment

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 1
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.