Preprint
Review

This version is not peer-reviewed.

Unannotated Genes in Genomics: Challenges, Opportunities, and AI Solutions

Submitted:

09 April 2026

Posted:

13 April 2026

You are already at the latest version

Abstract
The rapid proliferation of next-generation sequencing (NGS) technologies has generated an unprecedented volume of genomic data, yet a substantial fraction of these sequenced genomes remains functionally uncharacterized, a phenomenon collectively termed " genomic dark matter." Unannotated genes, including hypothetical proteins (HPs), orphan and de novo genes, small open reading frames (smORFs), and non-canonical ORFs (ncORFs), constitute 40–60% of bacterial genomes, approximately 30–35% of the human proteome, and up to 43% of metagenomic protein clusters. These uncharacterized sequences represent a critical bottleneck in translating genomic data into biological insight and biotechnological innovation. This review provides a comprehensive examination of the categories of unannotated genes, the systemic challenges that perpetuate the annotation gap, and the diverse biotechnological opportunities these sequences harbor across plant, animal, microbial, medical, and industrial domains. Critically, we evaluate the transformative role of artificial intelligence (AI) in bridging this gap, encompassing protein structure prediction tools such as AlphaFold2 and ESMFold, protein and genome language models including ESM2 and DNABERT-2, deep learning-based functional inference frameworks, and high-throughput experimental validation platforms such as CRISPR perturbomics and transposon-insertion sequencing (TIS). We argue that an integrative, AI-driven approach to functional genomics is not merely advantageous but essential for realizing the full potential of the genomic revolution.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

The sequencing of the first bacterial genome, Haemophilus influenzae, in 1995 marked the dawn of the genomics era [1]. Since then, the development of next-generation sequencing (NGS) technologies has driven an exponential increase in genomic data, reducing the cost of sequencing a megabase of DNA to fractions of a dollar [2]. As of 2025, the Universal Protein Knowledgebase (UniProt) contains over 200 million protein sequences, and more than two million assembled bacterial genomes are publicly available [2,3]. Despite this remarkable progress, the functional characterization of these genomes has not kept pace with sequence generation. The result is a widening gulf between what is known at the sequence level and what is understood at the functional level, a gap that represents one of the most pressing challenges in contemporary biology [4].
This uncharacterized genetic space is often described metaphorically as “genomic dark matter,” drawing an analogy to the dark matter of the universe: it is abundant, pervasive, and its properties remain largely unknown [3]. Even in Escherichia coli K-12, one of the most comprehensively studied organisms in the history of microbiology, approximately 35% of genes remain completely or partially uncharacterized more than two decades after its genome was first sequenced [3]. In bacterial genomes more broadly, it is estimated that 40% to 60% of genes in any given genome have no assigned function [4]. In large-scale metagenomic surveys of environmental samples, the proportion of uncharacterized sequences is even higher, with up to 43% of high-quality protein clusters identified from diverse biomes classified as unknowns [5].
The uncharacterized genome is not a monolithic entity. It encompasses several distinct categories of genetic elements, each with unique biological origins and annotation challenges. These include hypothetical proteins (HPs), sequences predicted from open reading frames (ORFs) but lacking experimental evidence of expression; orphan and de novo genes that are lineage-specific and lack detectable homologs in other species; small open reading frames (smORFs) that encode peptides shorter than 100 amino acids; and non-canonical ORFs (ncORFs) translated from regions previously annotated as non-coding, such as long non-coding RNAs (lncRNAs) and untranslated regions (UTRs) [4,6,7,8].
The persistence of unannotated genes carries significant consequences for all branches of genome biotechnology. In plant and agricultural genomics, unannotated genes in the “dispensable” portion of crop pangenomes may harbor critical alleles for stress tolerance and yield improvement [9]. In medical genomics, the dark proteome is increasingly recognized as a source of novel cancer biomarkers and therapeutic targets [10]. In microbial and industrial biotechnology, the vast reservoir of uncharacterized enzymes in environmental metagenomes represents an untapped source of novel biocatalysts [5]. Addressing the annotation gap is therefore not merely an academic exercise but a prerequisite for translating genomic data into real-world applications.
Historically, the primary tool for genome annotation has been homology-based inference, which assigns function to a new sequence based on its similarity to experimentally characterized sequences in reference databases. While powerful for conserved genes, this approach fails when sequence divergence is high, as is the case for orphan genes, highly divergent HPs, and novel protein families [6]. Moreover, the propagation of annotation errors through databases has created a “circular bias” that compounds over time. Today, the field is undergoing a paradigm shift driven by artificial intelligence (AI). Breakthroughs in deep learning, protein structure prediction, and large language models are providing transformative new tools for functional discovery [11,12].
This review aims to provide a comprehensive, critical, and forward-looking synthesis of the current state of knowledge on unannotated genes. We systematically examine the major categories of the dark genome, the multifaceted challenges in their characterization, the biotechnological opportunities they present, and the AI-driven solutions that are reshaping functional genomics. By integrating perspectives from plant, animal, microbial, medical, and industrial biotechnology, we seek to establish a unified framework for understanding and exploiting the genomic dark matter.

2. The Landscape of Unannotated Genes: Categories and Origins

The genomic dark matter is not a uniform entity (Figure 1). It comprises several distinct classes of unannotated sequences, each arising through different evolutionary mechanisms and presenting unique challenges for functional discovery. Understanding these categories is a prerequisite for developing appropriate annotation strategies.

2.1. Hypothetical Proteins (HPs)

Hypothetical proteins represent the largest and most heterogeneous category of unannotated genes. They are defined as sequences predicted to be protein-coding based on the presence of an ORF, but for which there is no experimental evidence of translation, expression, or in vivo function [4]. In many bacterial species, HPs constitute a significant proportion of the total coding sequences. A comprehensive analysis of genome annotation coverage across the bacterial tree of life found that only 52% to 79% of the average bacterial proteome could be functionally annotated using protein and domain-based homology searches, leaving a substantial fraction as HPs [13].
HPs are particularly prevalent in newly sequenced genomes, in organisms with limited experimental tractability, and in environmental metagenomes. A landmark study that clustered 415,971,742 genes predicted from 1,749 metagenomes and 28,941 bacterial and archaeal genomes into 2,940,257 high-quality clusters found that 43% of all clusters were classified as unknowns [5]. Intriguingly, these unknown families tend to exhibit narrower taxonomic and ecological distributions compared with known families, suggesting that they may encode functions important for niche adaptation rather than core metabolism. Furthermore, unknown protein families are conserved in archaeal groups, suggesting their importance in the emergence and diversification of these ancient lineages [7].
The primary challenge with HPs is the absence of detectable sequence homology to any experimentally characterized protein. Approximately one-third of all bacterial proteins are not similar enough to any characterized protein to allow for reliable homology-based functional prediction [6]. This means that standard annotation tools such as BLAST, which rely on sequence similarity, are fundamentally incapable of assigning functions to this substantial fraction of the proteome.

2.2. Orphan and De Novo Genes

Orphan genes are defined as genes that lack recognizable homologs outside a given taxonomic unit [14]. They represent a fascinating evolutionary paradox: if genes arise through duplication and divergence of existing genes, how can species-specific genes with no detectable homologs exist? The resolution lies in the process of de novo gene origination, whereby entirely new protein-coding genes emerge from ancestrally non-coding DNA sequences [15]. This mechanism has been documented across the tree of life, from bacteria and yeast to Drosophila, plants, and humans [14,15].
De novo genes are thought to originate through a stepwise process. Initially, a non-coding genomic region acquires a transcription start site and becomes transcribed as a long non-coding RNA (lncRNA). Subsequently, the lncRNA acquires an ORF that can be translated, producing a proto-protein. Over evolutionary time, selection can act on this proto-protein if it provides a fitness benefit, leading to the fixation of a new gene [15]. In humans, de novo genes can originate from neutral lncRNA loci and are evolutionarily significant, with some acquiring roles in the central nervous system [15].
In plants, population genetic evidence increasingly supports the functional importance of de novo genes in environmental adaptation and stress responses [16]. These lineage-specific genes may encode novel functions that allow species to adapt to their specific ecological niches, and their absence from other genomes means they are systematically missed by standard annotation pipelines that rely on cross-species homology [17]. Orphan genes are now recognized as drivers of evolutionary innovation, contributing to species-specific adaptations through diverse functions [14].

2.3. Small Open Reading Frames (smORFs) and Micropeptides

For decades, genome annotation algorithms imposed a minimum length threshold, typically 100 amino acids, to distinguish genuine protein-coding ORFs from random, spurious ORFs that arise by chance in any long DNA sequence. This practical heuristic, while reducing false positives, systematically excluded a vast class of functional sequences: small open reading frames (smORFs) that encode peptides shorter than 100 amino acids [8].
Recent advances in ribosome profiling (Ribo-seq) and proteogenomics have demonstrated that thousands of smORFs are actively translated in diverse organisms, including humans, Drosophila, and plants. These micropeptides are now recognized as crucial regulators of diverse biological processes. For example, the DWORF micropeptide in mammals regulates the sarcoplasmic reticulum calcium pump SERCA and plays a role in muscle function. Proteogenomic discovery of novel small ORFs in human and mouse tissues has revealed hundreds of previously unannotated functional peptides. The functional and translational profiles of smORFs suggest an evolutionary process that produces new peptides from non-coding sequences through a mechanism analogous to de novo gene origination [17].

2.4. Non-Canonical ORFs (ncORFs)

Non-canonical ORFs represent a category of unannotated sequences that are translated from genomic regions previously annotated as non-coding. These include ORFs embedded within lncRNAs, upstream ORFs (uORFs) in the 5’ UTRs of mRNAs, downstream ORFs (dORFs) in 3’ UTRs, and ORFs in antisense strands [18]. The discovery of widespread translation from these regions has fundamentally challenged the classical distinction between “coding” and “non-coding” genomes.
Non-canonical ORFs are particularly relevant in the context of medical genomics. A growing body of evidence implicates ncORF-encoded micropeptides in cancer biology, where they can regulate tumor progression, immune evasion, and cellular metabolism [18,23]. The identification of non-annotated ORFs has led to the construction of databases of novel unannotated ORFs (nuORFs), which are being actively mined for novel biomarkers and drug targets for cancer diagnosis and therapy [18]. This emerging field represents a paradigm shift in our understanding of the functional genome and highlights the importance of moving beyond canonical annotation frameworks.

3. Challenges in Genome Annotation

The persistence of the genomic dark matter is not merely a consequence of insufficient data; it reflects deep-seated methodological, systemic, and biological challenges that have proven difficult to overcome.

3.1. The Fundamental Limitations of Homology-Based Inference

The dominant paradigm in automated genome annotation is homology-based inference: the assumption that sequence similarity implies functional similarity. This approach, implemented in tools such as BLAST, HMMER, and their derivatives, has been enormously successful for annotating conserved core metabolic genes that are shared across broad phylogenetic ranges [3]. However, it has an inherent and fundamental limitation: it can only annotate what is already known. For sequences that are highly divergent, lineage-specific, or represent genuinely novel protein families, homology-based methods provide no information whatsoever [3].
This limitation is not trivial. As discussed in Section 2, approximately one-third of bacterial proteins lack sufficient similarity to any characterized protein for homology-based prediction [6]. For orphan genes, by definition, no homologs exist in other species, making homology-based annotation entirely inapplicable [14]. For smORFs and ncORFs, the short length of the encoded peptides means that even genuine homologs may not achieve statistical significance in sequence similarity searches [17]. The result is a systematic and unavoidable blind spot in conventional annotation pipelines.

3.2. Annotation Error Propagation and Database Contamination

A particularly insidious challenge in genome annotation is the propagation of errors through public databases. The process of annotation transfer, whereby the function of a newly sequenced gene is inferred from a homologous sequence in a reference database, creates a pathway for errors to cascade. If the reference sequence is itself misannotated, the error is perpetuated and amplified across all subsequently annotated genomes [19,20].
Schnoes et al. demonstrated that misannotation is a pervasive problem in public databases, particularly for complex enzyme superfamilies where different members catalyze different reactions despite sharing sequence similarity [19]. Their analysis found that a substantial proportion of sequences in the non-redundant (NR) database were misannotated, and that these errors were propagated to new genomes through annotation transfer [19]. A subsequent analysis by Goudey et al. confirmed that propagation is a major source of annotation error and proposed computational methods for its detection and correction [20]. This “circular bias” means that bioinformatics tools trained on existing database annotations may learn and perpetuate errors, rather than providing independent validation.

3.3. Experimental Intractability and the “Streetlight Effect”

Even when a gene is correctly identified as being of unknown function, its experimental characterization presents formidable practical challenges. Traditional biochemical approaches to protein characterization including protein purification, structural determination by X-ray crystallography, and functional assays are labor-intensive, time-consuming, and often require prior knowledge of the protein’s likely function to design appropriate assays [4]. Many HPs may encode membrane proteins, intrinsically disordered proteins, or proteins that only function in the context of specific cellular conditions or protein complexes, making them particularly difficult to study in isolation.
Furthermore, the allocation of research resources is profoundly skewed toward a small subset of well-characterized proteins. This phenomenon, sometimes called the “streetlight effect,” reflects the tendency of researchers and funding agencies to focus on familiar, well-studied targets where the probability of success is higher. The consequence is a self-reinforcing cycle in which well-known proteins are studied ever more intensively, while the vast majority of the proteome remains in the dark. Initiatives such as the Structural Genomics Consortium, COMBREX, and the Understudied Protein Initiative have been established specifically to counteract this bias, but the challenge remains substantial [4].

3.4. Biological Complexity: Multifunctionality and Context-Dependence

A further challenge is the inherent biological complexity of gene function. Many proteins are multifunctional, performing different roles in different cellular contexts, developmental stages, or environmental conditions [4]. A protein may have a primary biochemical function that is well-characterized, but its regulatory, interaction, or disease-relevant roles may be entirely unknown. This contextual incompleteness means that even “annotated” genes may harbor significant unknown functional dimensions [4].
For orphan and de novo genes, the challenge is compounded by the absence of any evolutionary context. Because these genes have no homologs, there are no comparative genomics clues to guide functional inference. Their functions may be entirely novel, with no precedent in the existing biochemical literature [14]. Similarly, for ncORFs, the functional relevance of translation may be context-specific, occurring only under particular stress conditions or in specific cell types, making systematic characterization particularly challenging [18].

4. Biotechnological Opportunities in the Dark Genome

Despite the formidable challenges in their characterization, unannotated genes represent an enormous reservoir of biological novelty with profound implications across the full spectrum of genome biotechnology (Table 1). The following sections explore the specific opportunities presented by the dark genome in key application areas.

4.1. Plant Genetics, Biotechnology, and Agricultural Applications

In plant genomics, the functional characterization of unknown genes is a major frontier with direct implications for food security and sustainable agriculture. Wang et al. have systematically classified unknown/uncharacterized (U/U) genes in plants into two types: those with conserved structural domains but unknown functions, and those with no recognizable domains at all. Both types have been shown to play important roles in plant growth, development, and stress resistance [8].
Numerous examples demonstrate the agricultural value of characterizing these genes. Unknown members of the MYB transcription factor family, such as MbMYBC1 and MbMYB108 from Malus baccata, have been shown to significantly enhance cold and drought resistance when overexpressed in transgenic Arabidopsis [21]. Similarly, unknown NAC family genes such as CaNAC46 from pepper and SlNAC10 from Suaeda liaotungensis confer salt and drought resistance [8]. These discoveries underscore the principle that unknown genes with conserved domains can harbor novel functional variants with significant biotechnological potential.
Beyond individual gene characterization, the construction of crop pangenomes has revealed that the “dispensable” genome, the portion of the genome present in some but not all individuals of a species, contains a disproportionately high fraction of unannotated, lineage-specific genes [22]. These genes are thought to encode functions important for local adaptation to specific environmental conditions, including soil type, temperature extremes, and pathogen pressure. Mining the dispensable genome of major crops such as rice, maize, sorghum, and wheat for novel stress tolerance genes is therefore a high-priority research direction for climate-smart agriculture [22].

4.2. Animal Genetics and Biotechnology

In animal genomics, unannotated genes have significant implications for understanding livestock adaptation, disease resistance, and production traits. The genomes of indigenous livestock breeds, such as the recently assembled Guyuan cattle genome, contain a substantial proportion of unannotated sequences that may encode breed-specific adaptations to local environments [14]. Characterizing these genes could provide novel genetic resources for breeding programs aimed at improving animal welfare and productivity under climate change.
In the context of animal biotechnology, the identification of lineage-specific genes in model organisms such as Drosophila has provided fundamental insights into the mechanisms of evolutionary innovation [14]. Orphan genes in insects, for example, have been shown to play roles in behavioral adaptations and immunity, demonstrating that novel genes can rapidly acquire essential functions [14]. These findings suggest that the dark genome of livestock and other production animals may harbor similarly important, unexplored functional diversity.

4.3. Medical Biotechnology and Genomic Medicine

The medical implications of the dark proteome are perhaps the most immediately compelling. Proteins of unknown function (PUFs) are increasingly recognized as playing crucial roles in disease biology, particularly in cancer and infectious diseases [9]. The non-canonical proteome comprising micropeptides and ncORF-encoded proteins is emerging as a novel contributor to cancer biology, with implications for tumor progression, immune evasion, and therapeutic resistance [24].
Genome-wide CRISPR screens have proven to be a powerful tool for identifying unannotated genes with disease relevance. A recent study employing an in vivo genome-wide CRISPR screen in syngeneic colorectal cancer models identified C9orf50, a gene previously lacking any functional annotation, as a critical dependency for tumor growth [24]. This discovery exemplifies the potential of high-throughput functional genomics to illuminate the dark proteome in medically relevant contexts.
Furthermore, the Human Proteome Project (HPP) has identified a category of proteins termed “uPE1” proteins of unknown function for which experimental evidence of expression exists but whose molecular roles remain uncharacterized. These proteins represent high-priority targets for functional characterization, as their expression in human tissues suggests biological relevance. Plasma proteomics studies have demonstrated that some of these uncharacterized proteins correlate with disease states, suggesting their potential as biomarkers for conditions such as Parkinson’s disease [9].

4.4. Microbial, Industrial, and Environmental Biotechnology

The microbial dark matter, the vast reservoir of uncharacterized genes in environmental metagenomes, represents an extraordinary opportunity for industrial and environmental biotechnology. The recovery of thousands of novel metagenome-assembled genomes (MAGs) from extreme environments such as deep-sea hydrothermal vents, acidic hot springs, and permafrost soils has revealed entirely new branches of the tree of life, each carrying a complement of novel, uncharacterized genes [7].
These uncharacterized genes are a promising source of novel biocatalysts for industrial applications. Extremophilic enzymes, proteins that function optimally under extreme temperatures, pH, or salinity, are highly sought after for industrial processes such as precision fermentation, biofuel production, and the synthesis of fine chemicals [25]. By characterizing the unknown protein clusters from extreme environment metagenomes, researchers can potentially discover enzymes with properties superior to those currently in industrial use.
For environmental biotechnology, the characterization of unknown genes in environmental microbiomes is critical for developing next-generation bioremediation strategies. Novel degradative enzymes capable of breaking down recalcitrant pollutants such as per- and polyfluoroalkyl substances (PFAS) and microplastics may exist within the uncharacterized fraction of environmental metagenomes [25]. Identifying and engineering these enzymes could provide sustainable solutions to some of the most pressing environmental contamination challenges of our time.

5. AI Solutions for Functional Annotation: A Multi-Tiered Framework

The scale and complexity of the genome annotation challenge necessitate computational solutions that can operate at the scale of millions of sequences while capturing the nuanced relationships between sequence, structure, and function, as illustrated in Figure 2. Artificial intelligence, in its various forms, is rapidly emerging as the primary engine for functional discovery in the dark genome.

5.1. Tier 1: Sequence Level AI Using Genome and Protein Language Models

The first tier of the AI annotation pipeline operates directly at the level of raw sequences, leveraging the power of transformer-based language models. Protein language models (PLMs) are trained on massive datasets of unaligned protein sequences using a self-supervised learning objective, typically masked language modeling, where the model learns to predict masked amino acids based on their context [30]. By training on hundreds of millions of sequences, these models learn the “grammar” of protein evolution, the statistical patterns of amino acid co-occurrence that reflect structural and functional constraints [30].
The most prominent PLM is ESM2, developed by Meta AI, which was trained on 250 million protein sequences and has demonstrated remarkable ability to capture structural and functional information in its learned representations [30]. ESMFold, which builds upon ESM2, can predict protein structures at the scale of hundreds of millions of metagenomic sequences, as demonstrated by the ESM Metagenomic Atlas [30]. For unannotated proteins, PLM-derived embeddings can serve as rich, information-dense representations that encode implicit functional information, even in the absence of any homology to known proteins.
At the genome level, DNA foundation models such as DNABERT-2 and the Nucleotide Transformer are trained on raw nucleotide sequences from diverse species [31,32]. The Nucleotide Transformer, with models ranging from 50 million to 2.5 billion parameters, has demonstrated strong performance across a wide range of genomic tasks, including the identification of regulatory elements, splice sites, and non-coding functional regions [31]. These models are particularly relevant for annotating smORFs and ncORFs, where the genomic context of the sequence provides crucial information about its likely function [31].

5.2. Tier 2: Structure Level AI for Protein Structure Prediction

The second tier leverages the principle that protein structure is more conserved than sequence across evolutionary time. Two proteins may have diverged beyond the point of detectable sequence similarity while retaining a similar three-dimensional fold and, consequently, a similar function. Structure-based annotation can therefore reveal functional homologies that are invisible to sequence-based methods [28].
The landmark development in this area was AlphaFold2, published by DeepMind in 2021 [12]. AlphaFold2 uses a deep learning architecture, incorporating multiple sequence alignments and pairwise residue relationships, to predict the 3D coordinates of all heavy atoms in a protein with near-experimental accuracy [12]. The AlphaFold Protein Structure Database, which provides open access to over 200 million predicted structures, has effectively democratized structural biology and placed structural data for virtually the entire known protein universe in the hands of researchers [27]. Critically, a substantial proportion of these structures correspond to hypothetical proteins, providing a structural foundation for functional inference.
Building on AlphaFold2, the 3DFI pipeline automates the process of structural comparison to infer the functionality of hypothetical proteins by searching predicted structures against databases of proteins with known structures and functions [29]. The CATH structural classification database enables searches based on protein fold, allowing the identification of structural homologs even when sequence similarity is negligible [29]. Furthermore, AlphaFold-Multimer extends structure prediction to protein complexes, enabling the prediction of protein-protein interactions for uncharacterized proteins, which can provide powerful clues about their cellular roles [29].

5.3. Tier 3: Functional Inference AI Using Deep Learning and Large Language Models

The third tier of the pipeline directly targets the prediction of protein function, typically expressed in terms of Gene Ontology (GO) terms, Enzyme Commission (EC) numbers, or pathway memberships. Deep learning models for protein function prediction have advanced rapidly, moving from simple sequence-based classifiers to sophisticated architectures that integrate sequence, structure, and evolutionary information.
DPFunc is a recent deep learning framework that integrates domain-guided structural information for accurate protein function prediction [33]. By explicitly incorporating domain boundaries and structural features derived from AlphaFold2 predictions, DPFunc achieves state-of-the-art performance on standard benchmarks [33]. The Two-model Adaptive Weight Fusion Network (TAWFN) takes a complementary approach, combining convolutional neural networks (CNNs) for local sequence feature extraction with graph convolutional networks (GCNs) for capturing global structural relationships [34].
Large language models (LLMs) are also being applied to the annotation problem in novel ways. GeneWhisperer is an LLM-based agent designed to assist manual genome annotation by synthesizing statistical data and biomedical literature to generate functional hypotheses for genes of unknown function [35]. Gene-LLMs, a class of transformer-based models specifically designed for genomic applications, have been developed for tasks including regulatory element annotation, variant calling, and motif discovery [35]. These models represent a new paradigm in which AI systems can reason about biological function in a manner analogous to an expert biologist, rather than simply pattern-matching against known sequences.

5.4. Tier 4: Experimental Validation Using High-Throughput Functional Genomics

While AI models generate powerful and testable hypotheses, experimental validation remains the gold standard for functional annotation. The fourth tier of the pipeline integrates high-throughput experimental platforms that can validate AI predictions at genome scale.
Transposon-Insertion Sequencing (TIS), encompassing methods such as TraDIS, Tn-seq, INSeq, and HITS, is a powerful approach for linking genotype to phenotype in bacteria [6]. By generating highly saturated transposon mutant libraries and subjecting them to selective conditions, TIS can simultaneously assess the fitness contribution of every gene in a bacterial genome, including those with no prior annotation [6]. This approach has been used to reveal predicted functions for hundreds of genes previously representing genomic dark matter, providing empirical evidence for AI-generated functional hypotheses [6].
In eukaryotic systems, CRISPR-Cas9 genome-wide screens sometimes termed “perturbomics”—provide an analogous capability [36]. By systematically knocking out every gene in a cell line and measuring the effect on a phenotype of interest (e.g., cell viability, drug sensitivity, or immune evasion), CRISPR screens can identify the functional roles of unannotated genes in complex biological contexts [24,36]. The combination of AI-driven functional prediction with CRISPR-based experimental validation creates a powerful, iterative cycle of discovery that is accelerating the characterization of the dark proteome. The major artificial intelligence frameworks and experimental validation approaches discussed across the four tiers are summarized in Table 2.

6. Integrative Strategies and Future Perspectives

The most effective approaches to illuminating the dark genome are not those that rely on a single AI tool or experimental method, but rather those that integrate multiple complementary strategies in a coherent, iterative pipeline. As illustrated in Figure 2, the path from an unannotated sequence to a validated functional annotation involves a progression from sequence-level inference to structure-level analysis, functional prediction, and ultimately experimental confirmation.
Several key principles should guide the development of integrative annotation strategies. First, the use of complementary tools is essential, as different methods capture different aspects of protein function and have different failure modes [10]. A protein that yields no signal in a sequence-based search may reveal functional homologs through structural comparison, and a protein that lacks structural homologs may be characterized through guilt-by-association in a co-expression network [3]. Second, the integration of multi-omics data—including transcriptomics, proteomics, metabolomics, and phenomics—provides crucial biological context that can constrain functional hypotheses and prioritize candidates for experimental validation [3].
Third, community efforts and data sharing are critical for accelerating progress. Initiatives such as the Structural Genomics Consortium, the Human Proteome Project, and the MorPhiC Consortium are building the shared experimental and computational infrastructure needed to systematically characterize the dark proteome [3]. The development of standardized benchmarks for evaluating AI annotation tools, and the creation of curated databases of experimentally validated functions for previously unknown proteins, will be essential for driving the field forward.
Looking ahead, several emerging technologies promise to further accelerate the characterization of unannotated genes. The continued development of protein language models with larger training datasets and more sophisticated architectures will improve the accuracy of functional inference for highly divergent sequences [30]. The integration of AlphaFold2 predictions with molecular dynamics simulations and cryo-electron microscopy data will provide increasingly detailed mechanistic insights into the functions of structurally characterized HPs [12]. And the development of AI systems capable of reasoning about biological function in a more holistic, systems-level manner, integrating knowledge from genomics, biochemistry, cell biology, and evolutionary biology, will ultimately be required to fully illuminate the dark genome.

7. Conclusions

The genomic dark matter comprising hypothetical proteins, orphan and de novo genes, small ORFs, and non-canonical ORFs, represents one of the most significant remaining frontiers in biology and biotechnology. Constituting 40–60% of bacterial genomes, approximately 30–35% of the human proteome, and up to 43% of metagenomic protein clusters, these unannotated sequences are not genomic junk but a vast, largely untapped reservoir of biological novelty. Their characterization is essential for realizing the full potential of genomic data across all domains of genome biotechnology, from engineering climate-resilient crops and discovering novel industrial biocatalysts to identifying new therapeutic targets for cancer and infectious diseases.
The challenges in characterizing these sequences are real and multifaceted, encompassing the fundamental limitations of homology-based annotation, the propagation of errors through public databases, experimental intractability, and the inherent biological complexity of gene function. However, the field is undergoing a transformative shift driven by artificial intelligence. The convergence of protein structure prediction (AlphaFold2, ESMFold), protein and genome language models (ESM2, DNABERT-2, Nucleotide Transformer), deep learning-based functional inference (DPFunc, TAWFN), and LLM-assisted annotation (GeneWhisperer) is providing an unprecedented toolkit for functional discovery at scale. When integrated with high-throughput experimental validation platforms such as CRISPR perturbomics and transposon-insertion sequencing, these AI-driven approaches create a powerful, iterative cycle of hypothesis generation and experimental confirmation.
The era of AI-driven functional genomics is not a distant prospect; it is already underway. As these tools mature and are applied systematically to the dark genome, the proportion of unannotated sequences will diminish, and the functional landscape of life will come into ever-sharper focus. The implications for genome biotechnology for plant improvement, animal breeding, medical genomics, industrial biotechnology, and environmental sustainability are profound and far-reaching.

Supplementary Materials

The following supporting information can be downloaded at: Preprints.Org.

Author Contributions

Conceptualization, A.F.; methodology, A.F. and A.R..; software, A.F.; validation, A.F., and A.R.; formal analysis, A.F.; investigation, A.F.; resources, A.R.; data curation, A.F., E.H., and A.R.; writing—original draft preparation, A.F.; writing—review and editing, A.F., E.H., and A.R; visualization, A.F., and A.R.; supervision, A.F.; project administration, A.F.; funding acquisition, A.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AI Artificial Intelligence
AMR Antimicrobial Resistance
CNN Convolutional Neural Network
dORF Downstream Open Reading Frame
EC Enzyme Commission
GCN Graph Convolutional Network
GO Gene Ontology
HITS High-throughput Insertion Tracking by Deep Sequencing
HP Hypothetical Protein
HPP Human Proteome Project
INSeq Insertion Sequencing
LLM Large Language Model
lncRNA Long Non-coding RNA
MAG Metagenome-Assembled Genome
ML Machine Learning
ncORF Non-canonical Open Reading Frame
NGS Next-Generation Sequencing
NR Non-Redundant
nuORF Novel Unannotated Open Reading Frame
ORF Open Reading Frame
PFAS Per- and Polyfluoroalkyl Substances
PLM Protein Language Model
PUF Protein of Unknown Function
Ribo-seq Ribosome Profiling
SmORF Small Open Reading Frame
TAWFN Two-model Adaptive Weight Fusion Network
TIS Transposon-Insertion Sequencing
Tn-seq Transposon Sequencing
TraDIS Transposon-Directed Insertion-Site Sequencing
uORF Upstream Open Reading Frame
UTR Untranslated Region
U/U Unknown/Uncharacterized

References

  1. Fleischmann, R. D.; et al. (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269(5223), 496-512. [CrossRef]
  2. UniProt Consortium. (2025). UniProt: The Universal Protein Knowledgebase in 2025. Nucleic Acids Research, 53(D1), D609-D617. https://academic.oup.com/nar/article/53/D1/D609/7902999.
  3. Nolan, L. M., Webber, M. A., & Filloux, A. (2025). Throwing a spotlight on genomic dark matter: The power and potential of transposon-insertion sequencing. Journal of Biological Chemistry, 301(6), 110231. [CrossRef]
  4. Moitra, T., & Larrouy-Maumus, G. (2026). Integrated approaches for discovery and functional annotation of proteins of unknown function. Trends in Biochemical Sciences, 51(1), 80-92. [CrossRef]
  5. Vanni, C.; et al. (2022). Unifying the known and unknown microbial coding sequence space. eLife, 11, e67667. [CrossRef]
  6. Rocha, J. J.; et al. (2023). Functional unknomics: Systematic screening of conserved genes of unknown function. PLoS Biology, 21(8), e3002222. [CrossRef]
  7. Rodríguez del Río, Á.; et al. (2024). Functional and evolutionary significance of unknown genes. Nature, 626, 104-111. [CrossRef]
  8. Wang, X., Wang, B., & Yuan, F. (2023). Deciphering the roles of unknown/uncharacterized genes in plant development and stress responses. Frontiers in Plant Science, 14, 1276559. [CrossRef]
  9. Ge, A., Chan, C., & Yang, X. (2024). Exploring the dark matter of human proteome: The emerging role of non-canonical open reading frame (ncORF) in cancer diagnosis, biology, and therapy. Cancers, 16(15), 2660. [CrossRef]
  10. Vincent, A. T. (2024). Bacterial hypothetical proteins may be of functional interest. Frontiers in Bacteriology, 3, 1334712. [CrossRef]
  11. Jumper, J.; et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583-589. [CrossRef]
  12. Zhang, Y.; et al. (2025). Predicting functions of uncharacterized gene products from microbial communities. Nature Biotechnology. [CrossRef]
  13. Lobb, B.; et al. (2020). An assessment of genome annotation coverage across the bacterial tree of life. Microbial Genomics, 6(5), mgen000341. [CrossRef]
  14. Casola, C. (2025). De Novo Genes: Current Status and Future Goals. Genome Biology and Evolution, 17(12), evaf230. [CrossRef]
  15. Grandchamp, A.; et al. (2025). De Novo Gene Emergence: Summary, Classification, and Challenges of Current Methods. Genome Biology and Evolution, 17(11), evaf197. [CrossRef]
  16. Luo, M.; et al. (2025). Rethinking de novo genes in plants: Mechanisms, methodological progress, and future prospects. Frontiers in Plant Science, 16, 1724832. [CrossRef]
  17. Baena-Angulo, C.; et al. (2025). Cis to trans: Small ORF functions emerging through evolution. Trends in Genetics, 41(2), 119-131. [CrossRef]
  18. Ruiz-Orera, J.; et al. (2025). The non-canonical proteome: A novel contributor to cancer biology. Nature Reviews Cancer. https://pmc.ncbi.nlm.nih.gov/articles/PMC11909265/.
  19. Schnoes, A. M.; et al. (2009). Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies. PLoS Computational Biology, 5(12), e1000605. [CrossRef]
  20. Goudey, B.; et al. (2022). Propagation, detection and correction of errors using the sequence database network. Briefings in Bioinformatics, 23(6), bbac416. [CrossRef]
  21. Yao, C.; et al. (2022). Overexpression of a Malus baccata MYB Transcription Factor Gene MbMYB4 Increases Cold and Drought Tolerance in Arabidopsis thaliana. International Journal of Molecular Sciences, 23(3):1794. [CrossRef]
  22. Kumar, B.; et al. (2023). Orphan crops: A genetic treasure trove for hunting stress tolerance genes. Food and Energy Security, 12(1), e436. [CrossRef]
  23. Fierro-Monti, I. (2025). Tiny proteins, great impacts: Non canonical ORFs in cancer. Academia Molecular Biology and Genomics, 2(2). [CrossRef]
  24. Park, B. S.; et al. (2025). Perturbomics: CRISPR–Cas screening-based functional genomics approach for drug target discovery. Experimental & Molecular Medicine, 57, 1-12. [CrossRef]
  25. Saggu, S. K., Kumar, M., & Kumar, S. (2026). Metagenomics and its impact on environmental and therapeutic microbiology. Archives of Microbiology, 208, 1-18. [CrossRef]
  26. Rennie, M. L., & Oliver, M. R. (2025). Emerging frontiers in protein structure prediction following the AlphaFold revolution. Journal of the Royal Society Interface, 22(225), 20240886. [CrossRef]
  27. Varadi, M.; et al. (2022). AlphaFold Protein Structure Database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 50(D1), D439-D444. [CrossRef]
  28. Sousounis, K.; et al. (2012). Protein function and structure: A systems biology perspective. Briefings in Bioinformatics, 13(5), 527-538.
  29. Julian, T.; et al. (2021). 3DFI: A pipeline for structural-based functional annotation of proteins. Bioinformatics, 37(18), 3028-3030.
  30. Lin, Z.; et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130. [CrossRef]
  31. Dalla-Torre, H.; et al. (2024). Nucleotide Transformer: Building and evaluating robust foundation models for human genomics. Nature Methods, 21, 1-10. [CrossRef]
  32. Zhou, Z.; et al. (2024). DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. arXiv preprint, arXiv:2306.15006. https://arxiv.org/abs/2306.15006.
  33. Wang, W.; et al. (2025). DPFunc: Accurately predicting protein function via deep learning with domain-guided structure information. Nature Communications, 16, 1-15. [CrossRef]
  34. Meng, L.; et al. (2024). TAWFN: A deep learning framework for protein function prediction. Bioinformatics, 40(10), btae571. [CrossRef]
  35. Balakrishnan, P.; et al. (2025). Gene-LLMs: A comprehensive survey of transformer-based genomic language models for regulatory and clinical genomics. Frontiers in Genetics, 16, 1634882. [CrossRef]
  36. Przybyla, L., & Gilbert, L. A. (2022). A new era in functional genomics screens. Nature Reviews Genetics, 23(2), 89-103. [CrossRef]
Figure 1. The landscape of unannotated genes across sequenced genomes. The diagram illustrates the major categories of the genomic dark matter including hypothetical proteins (HPs), orphan/de novo genes, small ORFs (smORFs), and non-canonical ORFs (ncORFs), and summarizes the key challenges in their annotation and the biotechnological opportunities they represent. Data sources: Nolan et al. [3], Moitra & Larrouy-Maumus [4], Vanni et al. [5].
Figure 1. The landscape of unannotated genes across sequenced genomes. The diagram illustrates the major categories of the genomic dark matter including hypothetical proteins (HPs), orphan/de novo genes, small ORFs (smORFs), and non-canonical ORFs (ncORFs), and summarizes the key challenges in their annotation and the biotechnological opportunities they represent. Data sources: Nolan et al. [3], Moitra & Larrouy-Maumus [4], Vanni et al. [5].
Preprints 207591 g001
Figure 2. The multi-tiered AI-driven pipeline for the functional annotation of unannotated genes. The pipeline integrates four tiers: (1) sequence-level AI using genome and protein language models; (2) structure-level AI using deep learning structure predictors; (3) functional inference AI using deep learning predictors and LLM-assisted annotation; and (4) experimental validation using CRISPR screens and transposon-insertion sequencing. The output is a putative functional annotation that can be updated in databases and confirmed experimentally.
Figure 2. The multi-tiered AI-driven pipeline for the functional annotation of unannotated genes. The pipeline integrates four tiers: (1) sequence-level AI using genome and protein language models; (2) structure-level AI using deep learning structure predictors; (3) functional inference AI using deep learning predictors and LLM-assisted annotation; and (4) experimental validation using CRISPR screens and transposon-insertion sequencing. The output is a putative functional annotation that can be updated in databases and confirmed experimentally.
Preprints 207591 g002
Table 1. Summary of biotechnological opportunities presented by unannotated genes across key application domains.
Table 1. Summary of biotechnological opportunities presented by unannotated genes across key application domains.
Application Domain Opportunity Key Example Reference
Plant & Agricultural Biotechnology Novel stress tolerance genes in dispensable genome MbMYBC1 enhancing drought resistance [8,21]
Medical Biotechnology ncORFs as cancer biomarkers and drug targets C9orf50 in colorectal cancer [9,24]
Microbial Biotechnology Novel biocatalysts from environmental metagenomes Extremophilic enzymes for precision fermentation [7,25]
Industrial Biotechnology Undiscovered metabolic pathways for bioproduction Novel biosynthetic gene clusters [7]
Environmental Biotechnology Novel degradative enzymes for bioremediation PFAS and microplastic degradation [25]
Animal Genomics Lineage-specific genes in livestock adaptation Selective sweeps in cattle populations [14]
Table 2. Summary of key AI tools and experimental methods for the annotation of unannotated genes, their categories, capabilities, and applications.
Table 2. Summary of key AI tools and experimental methods for the annotation of unannotated genes, their categories, capabilities, and applications.
AI Tool/Method Category Key Capability Application to Dark Genome Reference
AlphaFold2 Structure Prediction Atomic-level 3D structure from sequence Structural annotation of HPs [12]
ESMFold / ESM2 Protein Language Model Structure & function from sequence Metagenomic protein annotation [30]
DNABERT-2 Genome Language Model Regulatory element & ORF annotation smORF and ncORF discovery [32]
Nucleotide Transformer Genome Language Model Multi-species genomic task performance Cross-species annotation [31]
DPFunc Deep Learning Predictor GO term prediction with domain guidance HP function inference [33]
TAWFN Deep Learning Predictor CNN + GCN fusion for function prediction Multi-scale functional annotation [34]
GeneWhisperer LLM-Assisted Annotation Literature-guided gene curation Automated HP annotation [35]
TraDIS / Tn-seq (TIS) Experimental Validation Genome-wide fitness profiling in bacteria Bacterial HP characterization [6]
CRISPR Perturbomics Experimental Validation Genome-wide knockout screens Eukaryotic dark proteome discovery [36]
3DFI Pipeline Structure-Function Mapping Automated structural comparison HP function inference via CATH [29]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated