Introduction
Respiratory diseases account for over 5 million deaths yearly and are a huge burden to health-care systems worldwide [
1]. Recent advances in high-throughput technologies have provided access to multiomics biological data including genomics, epigenomics, transcriptomics, proteomics, metabolomics and immunomics and provide a holistic view of pathophysiology in lung disease [
2]. Biological insights gained from multi-omics can be integrated with clinical and social data and applied in the clinical setting for improved health outcomes. Single omics is limited by the only providing associations whereas multiomic integrations result in a clearer overall mechanistic picture-based overview and thus generates testable hypotheses. State-of-the-art machine-learning methods can integrate these datasets resulting in the ability to predict short- and long-term health trajectories and enable early timely interventions that alters the health course towards better outcomes.
High dimensional data from multiple sources can be integrated using machine learning tools including deep learning and neural networks to yield reliable holistic predictive models to predict mortality, morbidity or other complications in lung disease. Large datasets such as the omics dataset rely on ‘deep learning’ based on neural networks loosely modeled after neurons of the brain. The insights gained by deep learning of multi-omic datasets lead to personalized healthcare decision making (precision medicine) and biomarker discovery.
The NIH defines precision medicine as ‘an innovative approach that takes into account individual differences in patients’ genes, environments, and lifestyles’ [
3] (
https://www.nih.gov/about-nih/what-we-do/nih-turning-discovery-into-health/promise-precision-medicine). There is an urgent need to shift our current thinking on traditional reactive medicine based on prior literature/data to a more proactive precision medicine (PM) based approach where the trajectory towards health and disease can be predicted in advance, so interventions to improve survival or decrease morbidity can be instituted earlier to improve survival and decrease morbidity. Machine learning tools have already been enabled in a holistic, systems biology approach in oncology fields for prediction of survival, disease severity and biomarker development. This proactive approach need to be adapted in other fields and disciplines. In this review, we will assess the multi-omics strategy as it is integrated into humans, animal and organoid models to provide insight into lung health and disease.
Insights into cell biology using multi-omics
Quantitative omics technologies enable cost-effective and high-throughput profiling of numerous properties of cell biology. Genomics can be profiled using whole-exome or whole-genome sequencing. Transcriptomics is assessed using RNA-Sequencing (RNA-Seq). Protein expression in its total form or as post-translational modifications is measured using mass-spectrometry or antibody-based proteomics assays. Measurable epigenomic properties of the cells span DNA methylation, assessed via whole-genome bisulfite sequencing or probe-based micro-arrays, microRNAs, measured using smallRNA-Seq, histone modifications, measured using Chromatin Immunoprecipitation and sequencing (ChIP-Seq), and open chromatin, measured using assay for transposase-accessible chromatin with sequencing (ATAC-Seq). Getting closer to cell biology, metabolomics and lipidomics can be measured using mass-spectrometry based techniques. Appreciating that humans live in symbiosis with a rich microbial and fungal community, microbiome or mycobiome can be measured using whole genome shotgun sequencing.
The advent of
single cell technologies, including single cell RNA-Seq (scRNA-Seq), single cell ATAC-Seq (scATAC-Seq) have provided further insight into cell biology in the last decade. A single-cell multi-omics study CoV2 employed single nuclei RNA-Seq and single nuclei ATAC-Seq in phenotypically healthy lungs of donors with ages of 30 week gestation, 3 years, and 30 years[
4]. It aimed to decipher the cell specific landscape of expression and candidate cis-regulatory elements (cCREs) of key host entry genes for SARS-CoV2 infection, including ACE2 and TMPRSS2, and to further explore the changes with age, given the age-associated documented risks of SARS-CoV2 infection and outcome. ACE2 transcript was found in under 100 cells, with almost half of them in AT2 cells. TMPRSS2 expression was detected particularly in AT1, AT2, club, ciliated, and goblet cells. ATAC signal, eg accessible chromatin, was found primarily in gene bodies for ACE2 and TMPRSS2 in AT1, AT2, club, ciliated, and basal cells. cCREs are areas of ATAC peaks, determined within each cell type; cCRE association with a nearby gene is inferred by co-accessibility with promoters of nearby genes. Using the Cicero software, 15 cCREs co-accessible with the promoter were found for ACE2, whereas 73 were found for TMPRSS2[
5]. Given the dramatic changes in SARS-CoV2 infection, symptoms, and outcome risk across the age spectrum, the study quantified the expression and cCREs for ACE2 and TMPRSS2 across neonate, infant, and adult lung. A larger proportion of AT2 cells expressed ACE2 and TMPRSS2 in adults compared to neonates and infants. Using the ATAC data, two cCRE clusters for TMPRSS2 were in AT2 cells, comprising nine cCREs, showed enhanced accessibility with age. These clusters associated with genes involved in response to viral infection, immune response, and injury repair, and also overlapped with genes discovered in mouse models to associate with lung epithelial necrosis and chronic inflammation. Overall, this map of epigenomic/transcriptomic at SARS-CoV2 host genes can serve as a reference for studies using lungs from donors with SARS-CoV2 or animal models.
A study utilizing single cell RNA-Seq, single cell ATAC-Seq, and spatial transcriptomics generated an atlas of human fetal lungs spanning 5-22 post conception weeks (PCW)[
6]. This study identified 144 cell types. Overall, cell clusters are grouped by age groups, into 5-6, 9-11, and 15-22 PCW. Many of the fetal cells were matched to their adult lung counterpart, such as fetal airway progenitors with adult secretory club cells, and proximal secretory fetal cells with adult goblet cells. Interestingly, fetal AT1 and AT2 cells had highly concordant transcriptomic profiles with the adult cells. This study identified multiple cell types specific to the developing lung, such as progenitor of secretory cells and transition populations. Progenitor cells were further spatially localized. Epithelial progenitor cells were stratified into tip, stalk, airway progenitor, and proximal secretory progenitor, and using the spatial transcriptomics they were spatially localized and assigned on a trajectory of differentiation. The software CellPhoneDB was utilized to elucidate cell-cell communication in distinct lung niches[
7]. The airway niche was comprised of airway fibroblasts, late airway SMCs, and airway epithelial cells. Cell-cell communication analysis between airway fibroblasts and airway epithelium included known signaling, such as via TGFB and BMP4, but also novel ones, such as FGF7/18 to FGFR2/3, and non-canonical WNT5A to FZD/ROR. These results were validated via tissue staining, but also by using distal tip based lung organoids; when grown in media with FGF, those organoids showed robust airway differentiation into secretory, basal, and ciliated cells. Using scATAC-Seq, transcription regulation was assessed in each cell type; analysis recapitulated known results, such TCF21 in fibroblasts, KLF in secretory and AT1/AT2 cells, and TP63 in basal cells. A novel observation was of TCF4 enriched in pulmonary neuro-endocrine cells. Overall, this study showed how multi-omic lung profiling can provide rich references for diseases models, used for integration with and interpretation of data generated from human, in-vitro, or in-vivo models of lung disease.
Integration of Multiomics Data
Whereas substantial knowledge can be derived from applying a single omics at scale, in either human samples or model organisms, additional and refined insight can be obtained via
multi-omics integration. Multi-omics data may be integrated via early, intermediate and late integration approaches (
Figure 1)[
8,
9].
Early integration or early concatenation although not complicated, may have problems with vast number of features while the number of available data points is low, known as the "
curse of dimensionality"[
10]. Multi-omics datasets may contain > 50,000 features when the genome, transcriptome, and proteome are combined but available number of patient samples may be relatively small (hundreds or less). The heterogeneity of omics datasets may be a serious issue as omic data sets can have different distributions (e.g., numerical, categorical, continuous, discrete) and differ significantly in size (number of features). A necessary step in multi-omics analysis frequently is
dimensionality reduction, which is the process of reducing the number of variables in order to decrease the dimensionality and noise of a dataset. It is an optional simplification step but some (early and iintermediate integration) often require prior dimensionality reduction to be more effective.
The intermediate integration strategy works on transforming each omics dataset independently into a simpler representation, thus overcoming some issues with the early integration strategy. Transformation converts the data set to a less dimensional and less noisy one, which decreases heterogeneity, and facilitates integration and analysis.
Late integration involves combination of the results from each omics layer or each omics dataset by machine learning tools (or manually) and the predictions combined at the end[
8,
11]. Since each omics dataset is analyzed by omic-specific machine learning tools, the problems of noise and heterogeneity found in other strategies are not present. However, the downside of the late integration strategy is that it cannot capture inter-omics interactions and the different machine learning models (for the different omics datasets) do not share knowledge or utilize the complementarity information between omics[
11]. Combining predictions is simply not enough to accurately exploit multi-omics data and understand the underlying biological mechanisms of diseases.
The potential applications are endless; we will enumerate a subset of them that have been reported in the literature in lung-related disease models. Based on lessons from The Cancer Genome Atlas, molecular disease endotypes can be inferred for lung diseases. Disease drivers, disease presence or response to treatment biomarkers can be refined using multi-omics. Further, new therapeutic vulnerabilities can be determined and exploited by drug repurposing. Finally, multi-omics can be extended to surrogate sites, such as blood, skin, gut, saliva, or nasal cavity. In addition to access to numerous technologies, researchers have access to a trove of public data, using either reference or disease model datasets, including repositories such as LungMap, ENCODE, NIH Epigenome Roadmap, or NCBI Gene Expression Omnibus (GEO), or the Clinical Proteomic Tumor Analysis Consortium (CPTAC)[
12,
13,
14,
15].
Lung multiomics Models
Respiratory disease is a common cause of morbidity and mortality worldwide[
16]. With the global ongoing pandemic due to SARS-Covid19[
17], respiratory diseases remain a leading cause of death and disability. In almost all respiratory diseases, the epithelium, a monolayer of cells, which comprises the conducting and respiratory airways is damaged which in turn results in functional effects on the proximal airways’ ability to warm, humidify, and cleans inhaled air and on the distal airway to facilitate gas exchange. As a result, health and quality of life are severely impacted by the impaired lung function that occurs in respiratory disease. Human models, such as primary cells and organoids, and animal studies involving integrated multi-omics will allow differences in markers and biologic processes between disease and non-disease models to be elucidated (
Figure 2). These differences will provide insight into lung disease including pathways that result in regeneration and repair. Understanding these pathways will be a critical factor in the development of preventive treatments and therapeutic modalities to treat lung diseases and can eventually be harnessed to develop a personalized approach to treating respiratory diseases.
Primary cells and transformed or tumor cells lines have been used for the last half century to understand lung diseases. The cells in these models retain many of the donor tissue characteristics and recapitulate markers and functions that are present in vivo [
18,
19,
20]. These models have the advantage that they are amenable to genetic engineering allowing the dissection of the role of individual molecules and pathways in disease[
21]. Additionally, the ease of genetic engineering in these systems has allowed testing function via inducible gene expression[
22]. Because of their wide use, many of these cell lines are well characterized, providing a foundation that can only enhance multi-omics studies in which they are used. Studies in cell lines are well suited for high throughput drug screening and evaluation of drug response[
23] and are particularly valuable in studying lung cancer[
24,
25]. However, these models are not without limitations. First and foremost, they fall short at replicating the complex nature of many respiratory diseases. Many lack the multiple cell types and cellular polarity that are present in the proximal and distal respiratory epithelium and exhibit an absence of morphology and structural features that play a significant role in lung biology. Additionally, these cell lines also lack an immune cell component which plays a significant role in the etiology of many lung diseases[
26]. Coupled with questions that now have arisen as to the relevance of findings using these models, technical issues including a requirement for tissue donors, a finite lifespan, and limited expansion capacity have contributed to a reduced focus on using multi-omics in these models to study many respiratory diseases.
More recently organoid models have come to the forefront of models in which the use of multi-omics to understand respiratory disorders are being used. Organoids can be derived from either induced pluripotent stem cells or embryonic stem cells (hereafter referred to as iPSC organoids) or established from tissue derived multipotent stem cells (referred to as ASC organoids)[
27]. Differentiation of iPSC organoids occurs in a multistep process that involves a definitive endoderm stage, anterior foregut stage, and then into NKX2-1+ lung epithelial progenitors[
28,
29]. ASC organoids are established following mechanical and enzymatic isolation of either conducting airway or respiratory epithelium stem cells from lavage, small amounts of native tissue, or biopsy specimen[
30,
31,
32,
33]. Both iPSC and ASC organoids rely heavily on manipulation of exogenously added growth factors to induce differentiation of the mature polarized airway epithelium and the presence of extracellular matrix such as Matrigel®, synthetic matrices, or decellularized tissue scaffoldings. Both organoid models require cultivation on transwells under air-liquid interface conditions (ALI) where the basal side of the epithelium is in contact with media and the apical side is exposed to air to achieve maximum differentiation potential[
32,
34]. iPSC and ASC organoids can give rise to alveolar organoid models that recapitulate respiratory epithelium, nasal, trachea, or bronchial organoids that recapitulate conducting airway epithelium, and lung organoids that are mixture[
35,
36,
37]. Like primary and transformed cell models, organoids are amenable to genetic engineering and can be established from donors with genetic disorders that cause lung disease[
38]. Organoids are well suited for drug screening and as models for infectious disease research[
39,
40,
41,
42] and recapitulate many aspects of other chronic lung diseases such as idiopathic pulmonary fibrosis [
43] and cancer[
32]. However, iPSC derived airway cells do not seem to achieve the maturation levels observed in human lung[
44] and although ASC organoids seem to contain mature epithelial cells, they lack stromal components such as the immune system that play a significant role in most lung diseases. However, the increased cellular complexity and the modeling of human epithelium combined with a forward-thinking multi-omics approach provide an area for advancement in information surrounding respiratory illness and significant translational potential.
There are many animal models of lung diseases including chronic diseases such as cystic fibrosis [
45], idiopathic pulmonary fibrosis[
46,
47], viral and bacterial infections[
48], and cancer [
49]. Animal models of respiratory diseases offer several advantages including reproducibility, control of environmental factors, unlimited numbers of replicates, genetic phenotyping, and accessibility to lung tissue. Multi-omics approaches can be easily used to provide insight into the relationship between environmental stressors and the effect of the stressor on respiratory disease. Information gained can lead to detailed physiologic and pathologic pathways that contribute to disease pathogenesis. Animal models of lung disease are particularly useful in assessing the predisposition of genetic mutations in causing a specific disease and provide a model in which interactions between components of the whole system can also be examined[
50]. Animal models are limited by the fact that in most cases, there are significant differences between human lung tissue and animal lung tissue[
12,
51]. In addition, many human respiratory diseases are not recapitulated in animal models and clinical manifestations are difficult to assess. However, comparisons between human and animal multi-omics analyses can provide validation for animal models and together multi-omic based approaches combining data collected from both human in vitro and animal in vivo models will provide robustness, rigor, and reproducibility to support drawn conclusions.