Version 1
: Received: 13 May 2022 / Approved: 17 May 2022 / Online: 17 May 2022 (03:10:48 CEST)
Version 2
: Received: 7 July 2022 / Approved: 7 July 2022 / Online: 7 July 2022 (10:49:33 CEST)
Version 3
: Received: 5 January 2023 / Approved: 6 January 2023 / Online: 6 January 2023 (01:55:19 CET)
Esnault, C.; Rollot, M.; Guilmin, P.; Zucker, J.-D. Qluster: An Easy-to-Implement Generic Workflow for Robust Clustering of Health Data. Frontiers in Artificial Intelligence 2023, 5, doi:10.3389/frai.2022.1055294.
Esnault, C.; Rollot, M.; Guilmin, P.; Zucker, J.-D. Qluster: An Easy-to-Implement Generic Workflow for Robust Clustering of Health Data. Frontiers in Artificial Intelligence 2023, 5, doi:10.3389/frai.2022.1055294.
Esnault, C.; Rollot, M.; Guilmin, P.; Zucker, J.-D. Qluster: An Easy-to-Implement Generic Workflow for Robust Clustering of Health Data. Frontiers in Artificial Intelligence 2023, 5, doi:10.3389/frai.2022.1055294.
Esnault, C.; Rollot, M.; Guilmin, P.; Zucker, J.-D. Qluster: An Easy-to-Implement Generic Workflow for Robust Clustering of Health Data. Frontiers in Artificial Intelligence 2023, 5, doi:10.3389/frai.2022.1055294.
Abstract
The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant diversity in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this paper proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity, as it is suitable regardless of the data volume (small/big) and regardless of the nature of the variables (continuous/qualitative/mixed), (2) ease of implementation, as it is based on few easy-to-use software packages, and (3) robustness, through the stability evaluation of the final clusters and through recognized algorithms and implementations. This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.
Computer Science and Mathematics, Data Structures, Algorithms and Complexity
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.