Identification of biomarkers and enriched pathways involved in lung cancer

Objective: The aim of study is to find key genes and enriched pathways associated with lung cancer. Participants and Methods: Differentially expressed genes (DEGs) data of 54674 genes based on stage, tumor and status of lung cancer was taken from 66 patients of African American (AAs) origin. 2392 DEGs were found based on stage, 13502 DEGs were found based on tumor, 2927 DEGs were found based on status having p value (p<0.05). Results: Total 33 common DEGs were found from stage, tumor and status of lung cancer. Gene ontology (GO) and KEGG pathway enrichment analysis was performed and 49 significant pathways were obtained, out of which 10 pathways were found to be exclusively involved in lung cancer development. Protein-protein interaction (PPI) network analysis found 69 nodes and 324 edges and identified 10 hub genes based on their highest degrees. Module analysis of PPI found that ‘Viral carcinogenesis’, ‘pathways in cancer’, ‘notch signaling pathway’, ‘AMPK signaling pathways’ had a close association with lung cancer. Conclusion: These identified DEGs regulate other genes which play important role in growth of lung cancer. The key genes and enriched pathways identified can thus help in better identification and prediction of lung cancer.


Introduction
Worldwide mortality from lung cancer growth expanded from 3.5 million in 1990 to 4.2 million in 2015 1 and it is assessed that there will be 2.1 million new lung cancer incidents and 1.8 million deaths in 2018, representing (18.4%) incidents of cancer-related mortality. 2 Lung cancer is a heterogeneous disease and various factors including hereditary transformations, ecological components and individual habits can add to cancer incident, evolution and metastasis. 3 According to histological disparity, lung cancer can be partitioned into non-small cell lung cancer (NSCLC) and small-cell lung cancer (SCLC), of which NSCLC representsroughly85%, and 30% of SCLC cases can be named as lung squamous cell carcinoma. 4 It is reported that a number of genes and biological, cellular and molecular pathways take part in these processes.
Hence, it is crucial to understand the important mechanisms that lead to the onset and development of lung cancer in order to produce diagnostic and therapeutic strategies. A past researches on gene expression profiling in cancer used microarray tools for examining oncology 5 however some of these studies have been directed on lung cancer with comparative analysis of the DEGs 6 , and a very authentic biomarker profile refining cancerous tissues from normal ones remains to be discovered.
In the present study, gene expression data of mRNAs and miRNAs have been taken from 66 patients of AAs origin. A total of 54674 genes were screened, on the basis of stage (I or II), tumor (present or absent) and status (dead or alive). Student's t-test for difference of means assuming unequal variances was applied to test the datasets and two-tailed (p<0.05) was considered statistically significant. Out of these, 2392 DEGs from stage of lung cancer, 13502 DEGs from tumor and 2979 DEGs from status were obtained. 33 common DEGs from stage, tumor and status of the lung cancer were found. These 33 DEGs were screened further for gene ontology (GO) using DAVID database. The genes were analysed in STRING database for showing PPI network analysis. KEGG pathway analysis was also performed to see the pathway enriched among these genes. Using Cytoscape software, PPI was visualized. Using MCODE plug-in of Cytoscape, module analysis was performed and top 3 modules involved in lung cancer were identified which depicted top 6 pathways with genes involved in them. Using CYTOHUBBA plug-in of Cytoscape, top 10 hub genes involved in the lung cancer were identified along with their respective ranks and score. In survival analysis, Kaplan-Meier (KM) curve was drawn to represent the survival of lung cancer patients. The aim of this study is to find DEGs and related pathways for development of lung cancer and also identify possible genes biomarkers for identification and prospects of lung cancer.

Materials and methods
Gene expression data: The mRNA and miRNA of 66 patients of AAs origin was used for analysis purpose. The DEGs data was based on stage, tumor and status of lung cancer patients.
Data is obtained from gene omnibus website https://www.ncbi.nlm.nih.gov/geo/. The accession number of the data is GSE102287.
Student's t-test for identification of significant genes: We have taken 54647 DEGs in 66 patients of AAs origin. We have categorized the genes on the basis of stage, tumor and status.
Student's t-test is applied to test the difference of means for unequal variances on the basis of pvalue. This procedure is adopted to screen the gene expression data and find out the DEGs based on their(p<0.05) Table 2.
Heat map: Heat map is used to represent the level of expression genes with comparable samples. By using R software we have created heat maps to show gene expressions level for DEGs obtained based on stage 1 and stage 2. Thereafter the patients were classified as tumor present, tumor absent and dead and alive status. Now the gene expression levels are shown by yellow, orange and red colors with gene affy ID and patients ID along with x-axis and y-axis respectively.
GO term enrichment analysis: GO of these 205 DEGs were done using DAVID Database that is available at https://david.ncifcrf.gov/.GO is a major bioinformatics activity to combine the demonstration of gene and gene product attributes with all variety. The aim is to: 1) maintain and expand its restricted vocabulary of gene and gene product attributes; 2) interpret genes and gene products data; and 3) provide tools for easy access to all aspects of the data. 7 Establishment of PPI Network: Search Tool for Retrieval of the Interacting genes(String) online database is used for representation of PPI networks and available at https://string-db.org/ .
A frame work comprehension of cell function requires information of all practical relations between the expressed proteins. The STRING database is used to collect and combine this information and predicted Protein-Protein Interaction (PPI) for a large number of organisms. 8 Investigating the predicted interaction networks can recommend new directions for future computational research and provide cross-species expectations to efficient associated mapping. 9 String database gave the list of most significantly enriched pathways by KEGG pathway analysis. In the PPI network, the nodes involved in pathways exclusively involved in lung cancer with various colors were highlighted. These pathways showed genes that were involved in the NSCLC with their false discovery rates.
Cytoscape: This is online open software platform for representation molecular communication networks and genetic pathways and combine these networks with annotations, gene expression profiles and other state of data and can be downloaded from https://cytoscape.org/. Cytoscapeis used to provide a basic set of features for data integration, analysis, and representation. The string file was saved in .tsv format and was imported in Cytoscape software. Using the MCODE (molecular complex detection) plug-in of Cytoscape, top 3 modules of protein-protein interactions were visualized that are seen to be involved in the lung cancer. By using CYTOHUBBA plug-in of cytoscape, we found top 10 hub genes which are highly involved in lung cancer.
TCGA database: TCGA database offers various computational tools that can be used to analyze data. One such tool is cBioPortal for cancer genomics (http://www.cbioportal.org/). The cBioPortal for Cancer Genomics provides visualization, analysis and download of large-scale cancer genomics data sets. This tool was used to find the role of hub genes in NSCLC. Oncoprint and cancer type summary was studied for all the hub genes.
Survival analysis: Survival analysis is used to analysis of life time until one or more event happen. The KM curve is used to estimate the survival of patients from time dependent data. In medical sciences, it is often used to find the fraction of patients living for a certain time after treatment. Here, we have plotted the KM curve using R software for the stage-wise survival of lung cancer patients. 23

Results
After applying student's t-test for unequal variances on 54647 genes with their gene expression values, we obtained 33 common DEGs. The selected genes had (p<0.05) in (Table 2). The description of cancer patients is shown in (Table 1) Table 3). The genes were enriched significantly in BP, including 'directive of receptor activity', 'Anterior/posterior pattern specification'. The genes enriched in MF, including 'Poly (A) RNA binding' and 'Protein binding'. KEGG pathway analysis is used to identify the pathways involving these genes. A total of 24 significantly enriched pathways were identified ( Table 4). The most significantly enriched pathways related with lung cancer were 'AMPK signaling pathway', 'PPAR signaling pathway', 'pathways in cancer', 'PI3K-Akt signaling pathway', 'notch signaling pathway', 'viral carcinogenesis', 'microRNAs in cancer', 'HIF-1 signaling pathway' ,'Valine, leucine and isoleucine degradation' and ' Wnt signaling pathway' (Figure 2 and Table 5).The PPI network is constructed to classify the mainly important proteins and genetic modules that may serve critical roles in the growth of lung cancer. A total of 69 nodes and 324 edges were screened from PPI network ( Figure 2). The average node degree was 9.39, the average local coefficient clustering was 0.694 and the PPI enrichment (p<0.01). Each gene was entrusted a degree that predicted  (Table 6 and Figure 3). EP300 has highest degree of 29. It is found that high degree of these hub genes which play animportant role in maintaining the entirePPI. In addition, to find the significance DEGs, the top 3 significant modules were selected and functional interpretation of genes related with the modules were analyzed ( Figure 4 and Table 7). The results described that these modules had pathways that were seen to play a critical role in lung cancer. Module 1 was associated with viral carcinogenesis, pathways in cancer, notch signaling pathway, microRNAs in cancer, wnt signaling pathway. Module 2 was associated with AMPK signaling pathway, PPAR signaling pathway, PI3K-Akt signaling pathway, HIF-1 signaling pathway. Module 3 was associated with AMPK signaling pathway, pathways in cancer, wnt signaling pathway. cBioPortal is a computational tool present in TCGA database that provides representation, analysis and download of cancer DEGs data. This tool is used to evaluate the oncoprint ( Figure 5), lung cancer type summary which depicts the mutations, fusion, amplification etc in the genes. It was found that out of 10 hub genes, only 4 were exclusively involved in lung cancer EP300, TP53, KMT2A and KMT2C. The 4 genes underwent mutations largely.KM plotted for the stage-wise survival curves of lung cancer AAs patients. Stage 3 clearly depicts the lowest rate of survival among all the 3 stages ( Figure 6).

Discussion
Cancer is basically a hereditary disease, and different hereditary changes collect during the multistep process of carcinogenesis, which finally leads to anomalous excessive cell development and malignant phenotype. 10 Lung cancer is basically essential pulmonary malignant tumor in terms of incidence and mortality. 11 Early identification and efficient treatment of lung cancer is need of the hour and it can be achieved by the identification of significant genes and understanding their molecular mechanisms which play an important role in causing lung cancer. DEGs data of various genes can be used for further functional analysis and to screen biomarkers that can serve for early identification and remedial targets. Therefore, they may help in finding of lung cancer in the early stages and can be used for the development of targeted treatment.
In present study statistical and bioinformatics methods are applied to identify new candidate genes that can serve critical roles in development of lung cancer. The data used here has gene expression values of 54674 genes for 66 patients, being categorized on the basis of stage, tumor and status of the lung cancer. A total of 33 common DEGs from stage, tumor and status were obtained based on their p-value score calculated by t-test for difference of means with unequal variances. Then, GO and KEGG pathway analyses are performed to find the associations of these significant genes. Finally, a PPI network was constructed that depicted that these identified DEGs directly do not play role in causing lung cancer, but they interact and regulate other neighboring genes that play a very important role in development of lung cancer (Figure 2). GO analysis is helpful for annotating genes and gene products. GO analysis in the present study showed that these significant genes involved in biological process like 'Regulation of receptor activity', Anterior/posterior pattern specification', molecular functions like 'Poly (A) RNA binding' and 'Proteinbinding'. It is observed that defective functioning of biological processes and body system status are important causes of tumor growth and evolution. Hence, monitoring the expression of these genes may help in discovery of tumor mechanisms. The KEGG pathway database carries methodical analysis of gene functions, linking genomics and the functional information. Enrichment analysis is used to find important and most significant KEGG pathways which are related with lung cancer and its growth were 'AMPK signaling pathway', 'PPAR signaling pathway', 'pathways in cancer', 'PI3K-Akt signaling pathway', 'notch signaling pathway', 'viral carcinogenesis', 'microRNAs in cancer', 'HIF-1 signaling pathway', 'Valine, leucine and is oleucine degradation' and ' Wnt signaling pathway' (Figure 2 and Table 5).Taking pathways into consideration, AMPK plays a central role in the control of cell growth, prevalence and autotrophic through the rule of mTOR activity, which is consistently uncontrolled in cancer cells. Targeting of AMPK/mTOR is thus strategy in the growth of remedial elements against NSCLC. 12 The PI3K pathway is frequently uncontrolled in lung. 13 Cancer due to hereditary variation affecting its components resulting in increased PI3K signaling PPAR-γ factor bring development and promote changes related with separation as well as apoptosis in different lung carcinoma cell lines. 14 Thus, defects in PPAR signaling pathway can promote tumor growth. In case of notch signaling pathway and Dang et al. found that the over expression of Notch3 was perceived in 40% of patients with NSCLC, and that this over expression was connected with a translocation including 19p. 15 In HIF-1 signaling pathway, Hypoxia-inducible factor-1α (HIF-1α) is over expressed in human lung diseases, especially in NSCLC, and is firmly related with a propelled tumor grade, expanded angiogenesis, and protection from chemotherapy and radiotherapy. 16 In case of wnt signaling pathway, over expression of Wnt-1, -2, -3, and -5a and of Wnt-pathway components Frizzled-8, Disheveled, Porcupine, and TCF-4 is common in NSCLC and is associated with poor prognosis. 17 p53 is the most frequently mutated gene in lung cancer. 18 Most clinical studies suggest that NSCLC with TP53 alterations carries a worse prognosis and may be relatively more resistant to chemotherapy and radiation. 19 Inactivation of TP53 capacity or its orderly pathway is a typical component of human tumors that regularly relates with expanded danger, poor patient survival, and protection from treatment. [20][21][22] It is observed that many genes though not in our 33 common DEGs, comes into picture because it is regulated by genes present in our initial DEGs list such as PPP1R3C, ACAA2, TRIM5, PCSK9, P2RY1, CISH, PARN and KMT2A (Figure 2). Hence, it is clearly seen that the 33 DEGs do not directly participate in development of lung cancer but some of them influence and regulate other genes which play key role in development of lung cancer. PPP13RC is predicted functional partner of GYS1 AND GYS2. ACAA2 is neighbor of ACADM. TRIM5 and PCSK9 are in a cluster network of APOA1 and APOA2. P2Y1 is connected to CREB1. CISH is connected to two most crucial genes TP53 and EP300. PARN is found associated with TP53.
KMT2A is the gene with high no. of degree among our 33 DEGs. It is connected to CREB1, EP300, TP53, HDAC1 and SIRT1.
The string file is imported in cytoscape software and using CYTOHUBBA plug-in, top 10 hub genes based on their degree was found. The gene with highest score was EP300, followed by TP53 and KAT2B (Figure 3 and Table 6). These 10 hub genes played important role in growth of lung cancer. Using MCODE plug-in of cytoscape, top 3 modules of this network were seen which were again observed to take part in pathways that caused lung cancer ( Figure 4 and Table 7). The oncoprint and cancer summary type study is done by cBioPortal of TCGA database shows that TP53 is most mutated gene among all the top 10 hub genes. Also among the 10 hub genes, only 4 genes are exclusively involved in lung cancer viz. EP300, TP53, KMT2C and KMT2A. Cancer type summary is depicted in figure 6. The survival analysis was done and KM Plot was plotted which demonstrated that Stage 3 clearly has the lowest rate of survival among all the 3 stages.
Hence, this study made us to reach on a conclusion that DEGs may directly be involved in the pathways that lead to the development of cancer or may sometimes be indirectly involved like influencing and regulating other genes and their pathways that may play a crucial role in development of a tumor or a cancer.

Conclusion
Overall, through identification and functional analysis of DEGs we identified Regulation of receptor activity, anterior/posterior pattern specification and protein binding as significant terms for lung cancer. The initial 33 DEGs found in this study trigger or influence other neighboring DEGs that may be directly involved in the onset or development of lung cancer. Although, it may be early to suggest that these DEGs might be ready for clinical trials, it is clearly a direction that wants further attention. All these results may help us in better diagnosis and prognosis of lung cancer and may pave way for better treatment of the disease.

Ethical statements:
The current study has been done based on secondary data source. Data is obtained from gene omnibus website https://www.ncbi.nlm.nih.gov/geo/. There is no need of ethical approval for this study.