ARTICLE | doi:10.20944/preprints201908.0100.v1
Subject: Medicine & Pharmacology, Oncology & Oncogenics Keywords: cancer biomarker discovery; gene expression data; Ingenuity Knowledge Base (IKB); transfer learning; interpretable classification rules
Online: 8 August 2019 (05:28:26 CEST)
Background: Ongoing molecular profiling studies enabled by advances in biomedical technologies are producing vast amounts of ‘omic’ data for early detection, monitoring, and prognosis of diverse diseases. A major common limitation is the scarcity of biological samples, necessitating integrative modeling frameworks that can make optimal use of available data for disease classification tasks. Related data sets are often available from different studies, but may have been generated using different technology platforms. Thus, there is a critical need for flexible modeling methods that can handle data from diverse sources to facilitate the discovery of robust biomarkers that underlie disease regulatory processes. Results: In this paper, we introduce a novel framework called Knowledge Augmented Rule Learning (KARL), which incorporates two sources of knowledge, domain, and data, for pattern discovery from small and high-dimensional datasets, such as transcriptomic data. We propose KARL as a transfer rule learning framework in which knowledge of the domain is transferred to the learning process on data in order to 1) improve the reliability of the discovered patterns, and 2) study the knowledge of the domain when used along with data for modeling. In this work, we generated KARL models on gene expression datasets for five types of cancer, including brain, breast, colon, lung, and prostate. As our knowledge of the domain, we used the Ingenuity Knowledge Base (IKB) to extract genes related to hallmarks of cancer and annotated these prior relationships before learning classifiers from these datasets. Conclusions: Our results show that KARL produces, on average, rule models that are more robust classifiers than the baseline without such background knowledge, for our tasks of cancer prediction using 25 publicly available gene expression datasets. Moreover, KARL helped us learn insights about previously known relationships in these gene expression datasets, along with new relationships not input as known, to enable informed biomarker discovery for cancer prediction tasks. KARL can be applied to modeling similar data from any other domain and classification task. Future work would involve extensions to KARL to handle hierarchical knowledge to derive more general hypotheses to drive biomedicine.
ARTICLE | doi:10.20944/preprints201612.0031.v1
Subject: Life Sciences, Other Keywords: helminth infection; microbiota; 16S rRNA Gene; taxonomic tree; classification; SMART-scan method; transfer learning
Online: 6 December 2016 (08:23:46 CET)
Human microbiome data from genomic sequencing technologies is fast accumulating, giving us insights into bacterial taxa that contribute to health and disease. Predictive modeling of such microbiota count data for classification of human infection from parasitic worms such as helminths can help in detection and management across global populations. Real-world datasets of microbiome experiments are typically sparse containing hundreds of measurements for bacterial species, of which only a few are detected in the bio-specimens that are analyzed. This feature of microbiome data produces the challenge of needing more observations for accurate predictive modeling and has been dealt with previously using different methods of feature reduction. To our knowledge, integrative methods such as transfer learning has not been explored in the microbiome domain yet, as a way to deal with data sparsity by incorporating knowledge of different but related datasets. One way of incorporating this knowledge is by using a meaningful mapping among features of these datasets. In this paper, we claim that this mapping would exist among members of each individual cluster grouped based on phylogenetic dependency among taxa and their association to the phenotype. We validate our claim by showing that models incorporating associations in such a grouped feature space result in no performance deterioration for the given classification task. In this paper, we test our hypothesis using classification models that detect helminth infection in microbiota of human fecal samples obtained from Indonesia and Liberia countries. In our experiments, we first learn binary classifiers for helminth infection detection by using Naive Bayes, Support Vector Machines, Multilayer Perceptrons, and Random Forest methods. In the next step, we add taxonomic modeling using the SMART-scan module to group the data, and learn classifiers using the same four methods, to test the validity of the achieved groupings. We observed a 6% to 23% and 7% to 26% performance improvement based on the Area Under ROC Curve (AUC) and Balanced Accuracy(Bacc) measures, respectively, over 10 runs of 10-fold cross-validation. These results show that using phylogenetic dependency for grouping our microbiota data actually results in a noticeable improvement in classification performance for helminth infection detection. These promising results from this feasibility study demonstrate that methods such as SMART-scan can be utilized in the future for knowledge transfer from different but related microbiome datasets by phylogenetically-related functional mapping, to enable novel integrative biomarker discovery.
ARTICLE | doi:10.20944/preprints201612.0078.v1
Subject: Mathematics & Computer Science, General & Theoretical Computer Science Keywords: missing value imputation; machine learning; decision tree imputation; k-nearest neighbors imputation; self-organizing map imputation
Online: 15 December 2016 (08:27:13 CET)
Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.
ARTICLE | doi:10.20944/preprints201612.0060.v1
Subject: Mathematics & Computer Science, Other Keywords: neonatal MRI; brain structure segmentation; volume extraction
Online: 10 December 2016 (08:44:55 CET)
1) Introduction: Brain parcellation is an important processing step in the analysis of structural brain MRI. Existing software implementations are optimized for fully developed adult brains, and provide inadequate results when applied to neonatal brain imaging. 2) Methods: We developed a semi-automated pipeline, NeBSS, for extracting 50 discrete brain structures from neonatal brain MRI, using an atlas registration method that leverages the existing ALBERT neonatal atlas 3) Results: We demonstrate a simple linear workflow for neonatal brain parcellation. NeBSS is robust to variation in imaging acquisition protocol and magnet field strength. 4) Conclusion: NeBSS is a robust pipeline capable of parcellating neonatal brain MRIs using a simple processing workflow. NeBSS fills a need in clinical translational research in neonatal imaging, where existing automated or semi-automated implementations are too rigid to be successfully applied to multi-center neuroprotection studies and clinically heterogeneous cohorts. The software is open source and freely available.
ARTICLE | doi:10.20944/preprints201612.0077.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: rule based models; gene expression data; bayesian networks; parsimony
Online: 15 December 2016 (08:21:24 CET)
The comprehensibility of good predictive models learned from high-dimensional gene expression data is attractive because it can lead to biomarker discovery. Several good classifiers provide comparable predictive performance but differ in their abilities to summarize the observed data. We extend a Bayesian Rule Learning (BRL-GSS) algorithm, previously shown to be a significantly better predictor than other classical approaches in this domain. It searches a space of Bayesian networks using a decision tree representation of its parameters with global constraints, and infers a set of IF-THEN rules. The number of parameters and therefore the number of rules are combinatorial to the number of predictor variables in the model. We relax these global constraints to a more generalizable local structure (BRL-LSS). BRL-LSS entails more parsimonious set of rules because it does not have to generate all combinatorial rules. The search space of local structures is much richer than the space of global structures. We design the BRL-LSS with the same worst-case time-complexity as BRL-GSS while exploring a richer and more complex model space. We measure predictive performance using Area Under the ROC curve (AUC) and Accuracy. We measure model parsimony performance by noting the average number of rules and variables needed to describe the observed data. We evaluate the predictive and parsimony performance of BRL-GSS, BRL-LSS and the state-of-the-art C4.5 decision tree algorithm, across 10-fold cross-validation using ten microarray gene-expression diagnostic datasets. In these experiments, we observe that BRL-LSS is similar to BRL-GSS in terms of predictive performance, while generating a much more parsimonious set of rules to explain the same observed data. BRL-LSS also needs fewer variables than C4.5 to explain the data with similar predictive performance. We also conduct a feasibility study to demonstrate the general applicability of our BRL methods on the newer RNA sequencing gene-expression data.