1. Introduction
The Gene is the most basic unit of information; contained within itself are the blueprints required for the development of all the final bioactive products in the cell, whether it may be a miRNA or other small RNA, all the way to a messenger RNA leading to a protein. Besides these products, there is further information pertaining to the gene’s own expression. Importantly, regulation is an integral part of expression, a process controlled through elements found in the gene, particularly sequences in the promoter region [
1,
2].
Gene expression is a process regulated by transcription factors in which there is activation or repression of the transcription activity through their binding to specific DNA motifs mediated by their DNA-binding domains[
3]. DNA motifs can be seen as short conserved sequences (ranging from 2 to 20 bp) to which sets of transcription factors (or entire families) bind. One such example of a DNA motif is CACCC-box, which serves as a binding site for several transcription factors, including the Krüppel-like factor (KLF) family[
4], the Specificity Protein (SP) family [
5], and the Wilms tumor gene (WT1)[
6].
As a regulatory element, the DNA-motif CACCC-box, alongside other Cis-elements such as TATA and CAAT boxes, work together in promoters, such as seen in the cardiac B-type Natriuretic Peptide (BNP) gene, wherein, deletions result in reduced transcriptional activity in postnatal cardiomyocytes; an activity regulated by KLF13[
7,
8,
9,
10].
Out of the earlier mentioned transcription factor families which bind to the CACCC-box, much research has been devoted to the KLFs for their roles in diverse heart-related processes such as the maturation of cardiac myocytes, and their dysregulation is associated with several CVDs as exemplified by cardiomyopathies, infarctions, left ventricular hypertrophy and diabetic cardiomyopathies[
4,
11,
12,
13,
14].. Particularly, CVDs continue as the leading cause of death worldwide. According to a recent publication by the American College of Cardiology “CVDs are a persistent challenge that led to an enormous number of premature and preventable deaths”, specifically ischemic heart disease ranks #1 in mortality leading to 108 deaths per 100,000, overall CVD-related deaths accounted for 19.8 million in 2022 [
15]. As stated, KLFs have been shown to play a crucial role in the progression and control of CVDs[
4]. Some examples of regulating CVDs by KLFs can be seen in atherosclerosis, as the expression of KLF-5 switches a proliferative phenotype in vascular smooth muscle cells, while the repression of KLF-2 leads to an inflammatory state and vasculature remodeling [
16,
17,
18]. During myocardial infarction, there is a noted elevation of KLF-4, which induces the secretion of collagen type I and III via the TGF-β1/Smad3 pathway, contributing to myofibroblast differentiation[
19,
20,
21]. In diabetic cardiomyopathy KLF-5 increase has been shown to upregulate NADPH oxidase 4, a primary cause of cardiomyocyte superoxide accumulation, moreover, KLF-5 further leads to the activation of the serine palmitoyl transferase [SPT] long-chain base subunit 1 and 2, enzymes involved in ceramide synthesis; this process in turn changes the lipid landscape of the heart further contributing to the cardiomyopathy phenotype[
11,
22]. Interestingly, within diabetic patient’s mutations in KLF-15, such as rs9838915 are associated with an increase in the risk of heart failure, as rs9838915 is linked to increase LV mass and thickening of the septal wall [
23]. KLF-13 loss-of-function variant in the heart have been associated with double-outlet right ventricle and ventricular septal defects [
10]. Hence reducing the activity of GATA-6, GATA-4, and ANP promoters. Additional KLF-13 mutations have also been connected to congenital heart defects, particularly KLF-13 is an activator of VEGF-a and ANP, nonetheless, its mutants are linked to the bicuspid aortic valve, patent ductus arteriosus, and ventricular septal defect [
8]. Finally, in stroke there is a reduction in the expression of KLF-2 which directs proinflammatory effects by permitting the expression of the NF-B/p65 pathway[
24].
Other examples of regulatory effects structured by the CACCC-box can be seen by the governing effects of SP family[
25]. Among SP members, the transcriptional factor Sp1 has been reported to work with KLF-4 and even overlap its binding site. Particularly, during esophagus carcinoma, levels of Sp1 rise disrupting KRT19 regulation by KLF-4 which leads to malignant transformation and metastasis[
25].
As mentioned, gene expression is a tightly controlled process facilitated in part by the accessibility of the DNA and regulated by the elements to which transcription factors can bind to. Therefore, we sought to take a bioinformatic approach and investigate gene enrichment, specially targeting diseases and metabolic pathways by establishing associations between genes containing the CACCC-box upstream their promoter, furthermore we investigated their relation to the KLF family as regulators or key players in CVDs to comprehensively understand the interplay between these genetic elements.
2. Materials and Methods
Raw data Collection & Web Scraping
The genome annotation of H. sapiens GRCh38/hg38 (2013) from the Ensembl project series was utilized in this study, employing the Signal Search Analysis Server (SSA) software package (
https://epd.expasy.org/ssa/findm.php) with the promoter prediction tool. The -CACCC- motif sought 150 base pairs upstream of the TATA-Box / Goldberg—Hogness Box functional site, selecting the best options.
The sequences containing the TATA-Box motif were obtained, and an algorithm was used to search for the -CACCC- motif within the first 150 bases. Subsequently, these sequences underwent multiple sequence alignment analyses using BLAST. Given the large number of sequences, an algorithm utilizing Web Scraping and Python code with the Selenium WebDriver was employed to extract the gene names associated with each sequence.
With the Selenium Python library and the web driver functionality, the program automates the process of inputting the search query on the BLAST NCBI website. The web driver interacts with the website’s interactive slot, allowing the script to enter the desired query and initiate the search engine. Once the search is performed, the web scraper identifies the specific result page containing the relevant information. The required data from the result page is extracted by analyzing the website’s HTML structure and utilizing techniques such as HTML parsing and Text pattern matching.
The extracted data was transformed into a CSV, for further analysis and storage. To be easily processed and utilized for further bioinformatic analysis.
Implementation with Python and Selenium
The Python programming language and the Selenium library (v. 4.9.0) were used to automate web scraping on the BLAST NCBI website. By leveraging the Selenium web driver and the Python framework, a web bot was created to complete the search form and crawl through the links on the main page. Specific parameters were established to narrow down the search to Homo sapiens (taxid:9606) and (RefSeqGene[Title]) OR gene[Title] in the Entrez query. The Python code utilizes XPath and EC.presence_of_element_located condition to target and extract the desired elements, ensuring only relevant data is included. The extracted gene information can be printed for further processing.
MEME
A randomized sampling of 40 sequences was conducted using Python from the dataset containing 3044 sequences to observe the filtered motif. The most representative motifs were selected by XSTREME (5.5.5) a motif discovery and enrichment analysis with p-value threshold of 0.05 and motif sequences in the range of 8 to 15 nucleotides. Similar motifs to those searched were compared within the same platform using Tomtom (5.5.5) a motif comparison tool.
STRING
The genes were categorized into two groups, direct and indirect, employing STRING (11.5) as TSV format and visualizing the interactions using Excel. The direct group comprised genes exhibiting various levels of interaction, ranging from weak to strong, with members of the KLF family. Conversely, the remaining genes were assigned to the indirect group.
ShinyGO
We employed ShinyGo (version 0.77), a specialized analysis tool tailored to the target species, Homo sapiens, to analyze the gene sets. We set a statistical significance threshold of 0.05 (false discovery rate, FDR), ensuring robust results.
To focus on relevant findings, we applied filtering criteria by considering only pathways with a minimum of 10 genes. This step aimed to reduce noise and present meaningful results. Ultimately, the analysis yielded a total of 30 metabolic pathways that were statistically significant and enriched with genes related to KLFs.
The pathways analysis provided valuable insights into the biological processes, cellular components, molecular functions, and diseases associated with the gene sets. This approach facilitated the interpretation and analysis of the results, contributing to a comprehensive understanding of the interaction and functional implications of genes with KLF transcription factors.
Using the same dataset, relevant information regarding cardiovascular diseases could also be extracted through the representative KEGG pathway maps.
Enrichr & Appyter
Continuing from the previous point, utilizing the same dataset, we conducted a similar analysis using Enrichr. Specifically, we focused on disease analysis, utilizing the JENSEN database as a resource. Coupled with this analysis, the Appyter platform provided various visualization options for the results. This approach allowed us to identify the genes associated with the presented cardiovascular diseases and facilitated the creation of corresponding tables for further exploration.
3. Results
The CACCC-box motif was initially searched using Signal Search Analysis Server (SSAS) at 150bp upstream and 300bp downstream of transcription start sites (TSS) 5’-TATA(A/T)A(A/T)-3´(CACCC-regions-), Genome Reference Consortium Human Build 38 Organism:
Homo sapiens (GRCh38) was used as reference. Results obtained a total of 3044 hits for the CACCC-box with 40 diverse matrices regions.
Figure 1. Shows the 3 highest representative core matrix regions found encompassing the core CACCC-box. CCCCCACCCCCAC(C/T) sequence was found in 67.5% of the CACCC-regions with a score value of 7.8e-012, next CTCCCCCTCCT was found in 72.5% of the regions and with a score value of 6.9e-009, and finally the sequence CCCCTCC(C/T)(C/T)CCTC found in 62.5% of the regions and with a score of 3.2e-006. Additional sequences can be found in
supplemental Table S1. Moreover
Figure 2. shows a representative diagram for each chromosome, denoting CACCC-box binding (red marks) for identifiable genes. Out of 3044 potential hits; these included nonsense repeats, as well as non-transcription starting points, 1174 hits were related to identifiable genes.
Our initial objective was to resolve the involvement of the KLF family, after processing our data registered 95 hits confirmed by STRING for KLF interactions.
Table 1 shows the interacting genes related to their corresponding KLF members. Surprisingly, KLF-16 did not show any gene interaction, moreover, most confirmed gene interactions are linked to the activator group (Group 2), wherein KLF-4 had the most interactions with nearly 50 hits. Opposingly, Group 1 had the least number of interactions with a total of 8 hits; mostly related to the Homeobox NK family. It is important to note that while the CACCC box is a direct binding motif for KLFs other important factors involved in cardiovascular regulation can potentially bind. Using the same methodology as previously mentioned, we found potential motif sires for WT1, FATZ1, KMT2A, Specific Protein (SP), SALL4, and VEZF1.
supplemental Figure S1 Shows both the confirmed binding motif of these factors and the theoretical motifs found through our analysis.
supplemental Table S2 shows interactions between non-members of the KLF family and their target genes, including several genes which seem to be regulated either in combination KLFs and non KLFs or in alternative modes. Some of these genes include IGF1, MEF2C, MYH11, MYOD, TCF7L2 amongst others.
Taking into consideration all 1174 genes, we ran a ShinyGO analysis focused on diseases using Jessen database. Overall, there were 30 known diseases, with an overrepresentation of genes with the CACCC-motif at the TSS. Of these diseases 4 were directly linked to cancer, and 7 to cardiovascular disease, the focus of this manuscript. Other minor representations included neurological condition, pancreatis, lung, and congenital diseases (
Figure 3A). In addition, we went ahead with and ran a second ShinyGO analysis using only the 95 genes confirmed to have KLF interaction. This second analysis showed 5 similar diseases wherein CACCC-box is overrepresented (
Figure 3B).
The resulting information was then taken to ENRICHR for gene identification.
Table 2A shows the relation between the identified disease and the set of CACCC-box related genes for all 1174 potential genes. Meanwhile,
Table 2B denotes the relation between the identified disease and the set of CACCC-box related genes for the 95 confirmed genes related to the KLF family. Interestingly, both tables show similar gene enrichment for holt-oram, hypertension, cerebrovascular disease, and heart conduction disease. Meanwhile, cardiomyopathy and coronary artery disease were only present in
Table 2A, while diabetes and diabetic retinopathy were present in
Table 2B. Out of all the present genes, the troponin I isoform 3 (TNNI3) was the gene most prevalent as seen in hypertension, cardiomyopathy, cerebrovascular disease, and coronary artery disease. Other highly represented genes were ICAM and IL6, both related to inflammation. Additionally, there was an important presence of the TBX family, particularly 3, 5, and 6, and troponin T isoform 2 (TNNT2).
Further ShinyGo analysis was run for the general 1174 genes which reveal their involvement in major (Cardiac) metabolic pathways such as PI3K-Akt, and MAPK signaling. (
Figure 4A). Additionally diabetic and hypertrophic cardiomyopathy-related pathways. Moreover, the 95 gene specific analysis reveals enrichment in MAPK signaling, AGE-RAGE, RAS signaling, and HIF-1 signaling amongst other pathways. (
Figure 4B)
ShinyGo analysis directly links to the KEGG database, to directly visualize gene interaction in the cell (cardiomyocyte). As a representation of cardiac disease involvement, we present in
Figure 5 a schematic depiction of hypertrophic cardiomyopathy and
Figure 6 of diabetic cardiomyopathy. Blue-colored genes represent KLF-confirmed interacting genes, while red colored represents the general 1174 (CACCC-motif) genes. Furthermore, to show direct protein-protein (KLF) interaction, a STRING (12.0) analysis was run for both diseases (
supplemental Figure S2). For hypertrophic cardiomyopathy, the analysis showed a direct interaction, specifically co-expression and text mining, between IL-6 and KLF-2, KLF-4, and KLF-6. Similarly, IL-6 also directly interacts with TNNT2, TNNI3, and ACE, suggesting the potential for indirect interactions mediated by IL-6, KLFs, and genes related to the sarcomere.
For diabetic hypertrophy, KLF-4 exhibits a robust interaction with COL1A1, which, in turn, demonstrates a co-expression interaction with CD36. Similarly, ACE, NCF1, REN, and SLC2A1 also display co-expression interactions with CD36. The presence of genes related to the mitochondrial cell component, such as NDUFB9, NDUFA6, NDUFA12, ATP5MC3, COX7B, NDUFA7, and ATP5P, which specifically function in mitochondrial proton transport, is also observed with a strong correlation.
4. Discussion
The CACCC-box motif, a well-recognized cis-regulatory element, plays a fundamental role in a diverse range of developmental processes and diseases, including CVDs. This research delves specifically into the CACCC-box and its intricate connection to CVDs, focusing on the KLF of transcription factors as the primary effectors.
Our findings illuminate significant activity among various KLFs, particularly those of the KLF subgroup 2 recognized as transcriptional activators (
Table 1). Notably, KLF-4 stands out as the KLF with the highest number of interactions, likely because of its well-established roles in pluripotency, cancer, and most importantly, cardiovascular development and disease (
Table 1, Table2,
Figure 3,
Figure 4 and
supplemental Figure S2)[
4,
11,
26,
27,
28]. This is further confirmed by several crucial KLF-4 interactions in our work, including those with PAX9, PAX6, TBX5, and TBX3 (
Table 1 and
supplemental Figure S2). Previous research has established a link between KLF-4 and PAX9 (
Table 1 and
supplemental Figure S2) during cardiovascular development, specifically regarding the crucial WNT signaling pathway. In the pharyngeal endoderm, PAX9 interacts with TBX1 and GBX2, tightly controlling the intricate process of pharyngeal arch morphogenesis [
29]. Similarly, PAX6 and NKX2.2 (
Table 1) have been shown to orchestrate the differentiation of neural tube ventral progenitors by mediating the sonic hedgehog signaling pathway[
30].
Our exploration of the motif landscape yielded intriguing insights into pathological hypertrophy, revealing striking similarities to signaling pathways observed during early cardiac development (
Figure 3). This resulted in the identification of KLF-4 interactions with genes like GATA4, MEF2C, TBX5, NKX2.5, and SRF (
Table 1,
Table 2, and
supplemental Table S2, supplemental Figure S2), all of which are known regulators of cardiac hypertrophy and pro-hypertrophic pathways[
31,
32,
33]. Interestingly, KLF-15 presents a contrasting role, as it was found to repress TCF7L2 expression specifically in the postnatal heart, influencing cardiac growth. TCF7L2 is associated with diabetes and exhibits dependence on KLF-14 (
Table 1 and
supplemental Table S2), where reduced expression increases pre-adipocyte proliferation but impairs lipogenesis[
34]. KLF-15 plays a vital role in regulating cardiac physiology, including circadian oscillations in cardiac cells, and regulating genes involved in cardiac metabolism. Notably, both KLF-15 and ARNTL work in parallel to control cardiac circadian rhythms, even though no evidence of co-regulation with the ARNTL gene was observed (
Table 1). KLF-15 plays a key role in binding circadian repressors REV-ERBα (NR1D1) and NCOR (
Table 1), ensuring stable expression of a subset of cardiac genes [
35,
36].
The significant impact of hypertrophic cardiomyopathy and diabetic cardiomyopathy on cardiovascular health cannot be overstated. While hypertrophic cardiomyopathy affects approximately 1 in 500 individuals globally, with a significant proportion of cases remaining undiagnosed, diabetic cardiomyopathy emerges as a serious complication of diabetes mellitus, significantly increasing the risk of heart failure and premature death [
37,
38]. Both conditions highlight the critical importance of early detection, accurate diagnosis, and implementing effective management strategies to minimize the burden of cardiovascular disease on the population.
As diabetes is a condition linked to increased cardiovascular risk (
Table 2 and
Figure 3), it was unsurprising that our work revealed gene interaction between KLFs, particularly KLF-14, and genes associated with diabetic risk factors, such as CAMK1D, HHEX, JAZF1, and TCF7L2 (
Table 1 and
supplemental Table S2). Notably, TCF7L2’s ability to modulate PIK3R1 expression suggests its involvement in insulin signaling pathways, potentially mediating diabetes risk through interactions with KLF-14 [
39,
40].
As illustrated in
Figure 4, metabolic pathways like MAP Kinases and PI3K/Akt were overrepresented, with approximately 40 genes associated with these processes, their participation, or regulation. KLF-4 emerged as the regulator of genes such as FN1, IGF1, and COL1A1 (
Table 1). Upon binding to the α5β1 membrane receptor, FN1 activates the PI3K/Akt signaling pathway [
29]. Additionally, the PI3K/Akt signaling pathway induces the proliferation of cardiac fibroblasts and collagen accumulation, thus contributing to fibrosis, a process associated with collagen (COL1A1) (
Table 1 and
supplemental Table S2) [
41]. This process is a consequence of various diseases identified in the enrichment analysis (
Figure 3 and
Table 2), such as hypertension and its vascular remodeling process, which alters the composition of the vascular wall, leading to stiffness and loss of elasticity [
42].
Furthermore, our analysis identified other factors, including members of the SP family, WT1, and others (
supplemental Table S2 and supplemental Figure S1), that potentially bind to the CACCC-box motif. Notably, earlier studies showed that SP members 1 and 3 are further involved in contraction, as they play critical roles in calcium management by regulating the Ca2+ ATPase pump SERCA2 [
43]. Other SP members have been shown to participate in other cardiac-related pathways, such is the case of SP6 which can directly interact with β-catenin and modulate the expression of TCF7L2, a result confirmed by the presence of the CACCC-box in the promoter region of TCF7L2 (
Table 1). These findings implicate all these members as potential regulators in electrically related CVDs.
Moreover, our results also showed regulatory effects of SP7 over COL1A1, members of the FGF and TNF families, as well as DKK-1 and IL-6 (
supplemental Table S2). This implicates SP7 in potential remodeling and inflammatory processes. Specifically, in the heart, SP7 has been previously shown to play a decremental role in favor of calcification by permitting the activity of Osteoporin and Matrix Gla protein. Downregulation of SP7 in VSMCs leads to a loss of contractile markers and paves the way for the upregulation of RUNX2 [
44,
45]. RUNX2 is also regulated by another CACCC-box motif binding protein KLF-5, which regulates the activity of SP7. Overall, the process of calcification further induces remodeling via inflammation, to compensate for the reduction in contractile force[
45,
46].
Furthermore, our results revealed that the CACCC-box can serve as a complementary reverse binding site for PATZ1 and WIT1, suggesting a broader transcriptional regulation network. In addition, other potential binding proteins found included VEZF1 and SALL4 (
supplemental Table S2). The VEZF1 gene encodes a nuclear protein highly conserved among vertebrates, regulating genes related to cardiac muscle contraction and potentially regulating dilated cardiomyopathy[
47]. Regarding the interaction between SALL4 and KLFs in the heart, SALL4 interacts directly with members of the T-box family 3 and 5 and regulates the expression of Gap junction alpha-5 (GJA5) and KLF-4. This final regulation of KLF-4 is further seen in stem cells, as SALL4 has important role in proliferation and survival, mediated by the Bmi-1 pathway[
48]. Meanwhile, in cardiac development, another KLF, KLF-5, is involved in regulating TBX5, as KLF-5 absence reduces expression of TBX5 (
Table 1,
Table 2, and
supplemental Table S2), contrasting with the interactions described between SALL4 and TBX5 in cardiac patterning[
49].
From a clinical perspective, our study holds value at a diagnostic and potentially prognostic level. By continuing to understand the molecular mechanisms that govern gene expression, new potential biomarkers for diverse CVDs can be found and used, giving rise to predicting the development of these pathologies. It is worth noting, that through a deep understanding of the dysregulation of specific genes and pathways, targets such as downstream effectors of KLF4, 5, and 15 who are involved in hypertrophy and fibrosis may be new avenues for treatment in conditions such as diabetic cardiomyopathy and heart failure. In addition, our finding suggested pathways like the PI3k/Akt which, when properly targeted, could attenuate the development of cardiovascular complications in diabetic patients.
Through the discovery and comprehension on interactions between KLFs and other CACCC-box binding factors, regulatory effects can be better explained, additionally the further exploration of adjacent motif and potential binding partners sets-up future studies to have a more developed picture at a particular pathology. Targeting these interactions may offer new opportunities for therapeutic intervention, either through small molecule inhibitors or gene editing approaches aimed at modulating the activity of specific transcription factors involved in cardiovascular gene expression.
From a diagnostic perspective, the identification of gene signatures associated with cardiovascular pathology may help identify novel biomarkers for risk stratification, early detection, and monitoring of disease progression in patients at risk of CVD’s. Biomarkers derived from the transcriptional profiles of CACCC box-regulated genes, especially those under the control of KLF family members, may offer improved sensitivity and specificity compared to existing biomarkers, enabling more accurate diagnosis and prognosis of CVD’s. By leveraging the understanding of the molecular mechanisms underlying CVD’s, clinicians can look into the future to tailor therapeutic interventions targeting specific pathways or gene networks dysregulated in each patient, thereby optimizing treatment efficacy, and minimizing adverse effects.