Search | Preprints.org

Background: Ongoing molecular profiling studies enabled by advances in biomedical technologies are producing vast amounts of ‘omic’ data for early detection, monitoring, and prognosis of diverse diseases. A major common limitation is the scarcity of biological samples, necessitating integrative modeling frameworks that can make optimal use of available data for disease classification tasks. Related data sets are often available from different studies, but may have been generated using different technology platforms. Thus, there is a critical need for flexible modeling methods that can handle data from diverse sources to facilitate the discovery of robust biomarkers that underlie disease regulatory processes. Results: In this paper, we introduce a novel framework called Knowledge Augmented Rule Learning (KARL), which incorporates two sources of knowledge, domain, and data, for pattern discovery from small and high-dimensional datasets, such as transcriptomic data. We propose KARL as a transfer rule learning framework in which knowledge of the domain is transferred to the learning process on data in order to 1) improve the reliability of the discovered patterns, and 2) study the knowledge of the domain when used along with data for modeling. In this work, we generated KARL models on gene expression datasets for five types of cancer, including brain, breast, colon, lung, and prostate. As our knowledge of the domain, we used the Ingenuity Knowledge Base (IKB) to extract genes related to hallmarks of cancer and annotated these prior relationships before learning classifiers from these datasets. Conclusions: Our results show that KARL produces, on average, rule models that are more robust classifiers than the baseline without such background knowledge, for our tasks of cancer prediction using 25 publicly available gene expression datasets. Moreover, KARL helped us learn insights about previously known relationships in these gene expression datasets, along with new relationships not input as known, to enable informed biomarker discovery for cancer prediction tasks. KARL can be applied to modeling similar data from any other domain and classification task. Future work would involve extensions to KARL to handle hierarchical knowledge to derive more general hypotheses to drive biomedicine.

Preprint ARTICLE | doi:10.20944/preprints201612.0031.v1

Application of Taxonomic Modeling to Microbiota Data Mining for Detection of Helminth Infection in Global Populations

Mahbaneh Eshaghzadeh Torbati, Makedonka Mitreva, Vanathi Gopalakrishnan

Subject: Biology And Life Sciences, Immunology And Microbiology Keywords: helminth infection; microbiota; 16S rRNA Gene; taxonomic tree; classification; SMART-scan method; transfer learning

Online: 6 December 2016 (08:23:46 CET)

Show abstract| Download PDF| Share

Human microbiome data from genomic sequencing technologies is fast accumulating, giving us insights into bacterial taxa that contribute to health and disease. Predictive modeling of such microbiota count data for classification of human infection from parasitic worms such as helminths can help in detection and management across global populations. Real-world datasets of microbiome experiments are typically sparse containing hundreds of measurements for bacterial species, of which only a few are detected in the bio-specimens that are analyzed. This feature of microbiome data produces the challenge of needing more observations for accurate predictive modeling and has been dealt with previously using different methods of feature reduction. To our knowledge, integrative methods such as transfer learning has not been explored in the microbiome domain yet, as a way to deal with data sparsity by incorporating knowledge of different but related datasets. One way of incorporating this knowledge is by using a meaningful mapping among features of these datasets. In this paper, we claim that this mapping would exist among members of each individual cluster grouped based on phylogenetic dependency among taxa and their association to the phenotype. We validate our claim by showing that models incorporating associations in such a grouped feature space result in no performance deterioration for the given classification task. In this paper, we test our hypothesis using classification models that detect helminth infection in microbiota of human fecal samples obtained from Indonesia and Liberia countries. In our experiments, we first learn binary classifiers for helminth infection detection by using Naive Bayes, Support Vector Machines, Multilayer Perceptrons, and Random Forest methods. In the next step, we add taxonomic modeling using the SMART-scan module to group the data, and learn classifiers using the same four methods, to test the validity of the achieved groupings. We observed a 6% to 23% and 7% to 26% performance improvement based on the Area Under ROC Curve (AUC) and Balanced Accuracy(Bacc) measures, respectively, over 10 runs of 10-fold cross-validation. These results show that using phylogenetic dependency for grouping our microbiota data actually results in a noticeable improvement in classification performance for helminth infection detection. These promising results from this feasibility study demonstrate that methods such as SMART-scan can be utilized in the future for knowledge transfer from different but related microbiome datasets by phylogenetically-related functional mapping, to enable novel integrative biomarker discovery.

Preprint ARTICLE | doi:10.20944/preprints201612.0078.v1

An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data

Yuzhe Liu, Vanathi Gopalakrishnan

Subject: Computer Science And Mathematics, Computer Science Keywords: missing value imputation; machine learning; decision tree imputation; k-nearest neighbors imputation; self-organizing map imputation

Online: 15 December 2016 (08:27:13 CET)

Show abstract| Download PDF| Supplementary Files| Share

Preprint ARTICLE | doi:10.20944/preprints201612.0060.v1

NeBSS: Semi-Automated Parcellation of Neonatal Structural Brain MRI

Rafael Ceschin, Alexandria Zahner, Vanathi Gopalakrishnan, Ashok Panigrahy

Subject: Computer Science And Mathematics, Software Keywords: neonatal MRI; brain structure segmentation; volume extraction

Online: 10 December 2016 (08:44:55 CET)

Show abstract| Download PDF| Share

Preprint ARTICLE | doi:10.20944/preprints201612.0077.v1

Learning Parsimonious Classification Rules from Gene Expression Data Using Bayesian Networks with Local Structure

Jonathan Lyle Lustgarten, Jeya Balaji Balasubramanian, Shyam Visweswaran, Vanathi Gopalakrishnan

Subject: Computer Science And Mathematics, Data Structures, Algorithms And Complexity Keywords: rule based models; gene expression data; bayesian networks; parsimony

Online: 15 December 2016 (08:21:24 CET)

Show abstract| Download PDF| Supplementary Files| Share

The comprehensibility of good predictive models learned from high-dimensional gene expression data is attractive because it can lead to biomarker discovery. Several good classifiers provide comparable predictive performance but differ in their abilities to summarize the observed data. We extend a Bayesian Rule Learning (BRL-GSS) algorithm, previously shown to be a significantly better predictor than other classical approaches in this domain. It searches a space of Bayesian networks using a decision tree representation of its parameters with global constraints, and infers a set of IF-THEN rules. The number of parameters and therefore the number of rules are combinatorial to the number of predictor variables in the model. We relax these global constraints to a more generalizable local structure (BRL-LSS). BRL-LSS entails more parsimonious set of rules because it does not have to generate all combinatorial rules. The search space of local structures is much richer than the space of global structures. We design the BRL-LSS with the same worst-case time-complexity as BRL-GSS while exploring a richer and more complex model space. We measure predictive performance using Area Under the ROC curve (AUC) and Accuracy. We measure model parsimony performance by noting the average number of rules and variables needed to describe the observed data. We evaluate the predictive and parsimony performance of BRL-GSS, BRL-LSS and the state-of-the-art C4.5 decision tree algorithm, across 10-fold cross-validation using ten microarray gene-expression diagnostic datasets. In these experiments, we observe that BRL-LSS is similar to BRL-GSS in terms of predictive performance, while generating a much more parsimonious set of rules to explain the same observed data. BRL-LSS also needs fewer variables than C4.5 to explain the data with similar predictive performance. We also conduct a feasibility study to demonstrate the general applicability of our BRL methods on the newer RNA sequencing gene-expression data.

Search Results

7 articles found