Introduction
Since the emergence of SARS-CoV-2 in late 2019, the virus has undergone a remarkable evolutionary journey, reshaping the global pandemic landscape through a relentless series of antigenic shifts, immune escape events, and transmission optimizations. What began as a relatively uniform pathogen has evolved into a complex constellation of variants, each bearing distinct mutational signatures that reflect the interplay between viral fitness, host immunity, and population-level selective pressures. At the heart of this evolution lays the spike (S) glycoprotein - a multifunctional molecular machine that mediates viral entry, determines cell tropism, and serves as the primary target of neutralizing antibodies and vaccine-induced immune responses.
The spike protein’s structural and functional complexity makes it both a vulnerability for the virus and a focal point for immune evasion. Comprising two functional subunits - S1, which contains the receptor-binding domain (RBD) and N-terminal domain (NTD), and S2, responsible for membrane fusion, the spike is subject to intense selective pressure. Mutations in the RBD, such as N501Y, E484K, and L452R, have repeatedly enhanced binding affinity to the human ACE2 receptor or enabled escape from monoclonal antibodies. Meanwhile, changes in the NTD, including deletions like H69-V70del, have been linked to increased infectivity and conformational masking of epitopes. Even the more conserved S2 domain is no longer immune to change; mutations such as N969K and Q954H are now emerging as stabilizers of the post-fusion state, potentially enhancing viral entry efficiency and resistance to fusion inhibitors.
The evolutionary trajectory of the spike protein has followed a clear pattern: convergent evolution driven by mutation clusters, not isolated substitutions. From the early dominance of D614G to the rise of Omicron and its subvariants, we have observed the repeated co-emergence of specific mutation pairs and constellations such as S:T478K + S:D614G or S:F456L + S:N969K across genetically distinct lineages. This recurrence suggests strong epistatic interactions and functional synergy, where one mutation compensates for the fitness cost of another, or where multiple changes collectively enhance immune evasion, fusogenicity, or receptor binding.
Despite advances in genomic surveillance, most monitoring systems remain reactive rather than predictive. We typically detect a new variant only after it has already gained a foothold in the population, leaving public health responses playing ‘catch-up to the new variant’ scenario. This delay limits the effectiveness of countermeasures, from vaccine updates to monoclonal antibody deployment. What is needed is a proactive framework the one that not only describes current trends but anticipates future evolutionary pathways by modeling mutation co-occurrence, transition probabilities, and selective constraints.
In this study, we present such a framework: a closed-loop model of SARS-CoV-2 spike evolution (2020 - 2025) that integrates six complementary analytical approaches into a unified narrative of viral adaptation. We analyzed over 7.4 million high-quality genomes using a tiered statistical strategy:
Jaccard Index to identify high-co-occurrence mutation pairs across 6.8 million sequences
Chi-Square and Mutual Information to detect non-random associations and evolutionary hubs in a focused dataset of 158,000 genomes
Cramér’s V and Fisher’s Exact Test to assess statistical significance of mutation pairs in 79,000 recombinant and emerging lineage sequences
Lineage-specific analysis (Q1 2025) to map observed mutations in dominant variants like KP.2, LB.1, and FL.1.5.1
Drug-resistant mutation profiling to link mutational profiles with therapeutic escape
Markov chain forecasting to simulate future variants and validate predictions against real-world data
By combining these methods, we move beyond descriptive genomics toward predictive virology. Our model successfully anticipated the rise of key mutations - including S:L452W, S:F456L, and S:N969K which are now confirmed in globally circulating recombinant lineages. This convergence of prediction and observation validates a new paradigm that is SARS-CoV-2 evolution is not chaotic it is constrained, directional, and increasingly predictable.
This work demonstrates the power of integrating real-world data with computational modeling to create a feedback loop between prediction and validation, transforming genomic surveillance from a retrospective tool into a forward-looking early warning system. As we face ongoing threats from respiratory pathogens with pandemic potential, such frameworks will be essential for staying ahead of the next wave not just for SARS-CoV-2, but for future pathogens as well including h5n1.
Methods
We conducted a comprehensive analysis of SARS-CoV-2 spike protein evolution using a multi-tiered statistical framework applied to a global dataset of 7,402,176 high-quality genome sequences. This large-scale dataset, curated from public repositories including GISAID and NCBI, enabled robust detection of mutation co-occurrence patterns, lineage-specific signatures, and evolutionary trajectories across the pandemic timeline.
To ensure data integrity, raw sequencing reads were first processed using the GVATLAS Quality Control pipeline, which includes adapter trimming, host read removal, and filtering for minimum coverage (≥10x) and consensus confidence (≥90% base call frequency). Only genomes with full-length spike gene coverage (positions 21563 - 25384 relative to Wuhan-Hu-1, NC_045512.2) were retained for downstream analysis.
Mutation calling was performed using the GVATLAS Mutation Annotation Engine, which aligns sequences to the reference genome, identifies nucleotide substitutions, and translates them into amino acid changes. All mutations were annotated in standard HGVS nomenclature (e.g., S:D614G) and cross-referenced with functional domains (NTD, RBD, S1/S2, S2).
Three complementary statistical approaches were applied to assess mutation interactions at different scales:
Jaccard Similarity Index was calculated across 6.8 million genomes to quantify pairwise mutation co-occurrence. This metric, defined as the intersection over union of mutation pairs, identifies mutations that frequently appear together in the same genetic background, suggesting potential functional synergy.
Chi-Square tests and Mutual Information (MI) analysis were performed on a subset of 158,000 genomes to evaluate the strength of association between mutations. MI, a measure of non-linear dependence, was used to detect higher-order dependencies beyond simple pairwise associations. Hub mutations were identified as those with the highest degree of connectivity in the co-occurrence network.
Cramér’s V and Fisher’s Exact Test were applied to a focused dataset of 79,000 high-confidence recombinant and emerging lineage sequences to assess statistical significance of mutation pairs while controlling for multiple testing (Benjamini-Hochberg correction, FDR < 0.01). Cramér’s V, a normalized measure of association strength (0-1), was used to rank mutation pairs by biological relevance.
To contextualize these findings in real-world evolution, we performed a lineage-specific mutation analysis for Q1 2025, focusing on recombinant variants such as KP.2, LB.1, and FL.1.5.1. This involved extracting lineage-defining mutations, mapping them to structural domains, and comparing observed frequencies against historical trends.
Additionally, we developed a Markov chain forecasting model to simulate future variant emergence. Transition probabilities between mutation states were derived from longitudinal sequence data, and synthetic genomes were generated using a first-order Markov process. The model was validated by comparing predicted mutation clusters with those observed in Q1 2025 lineages.
All analyses were conducted using custom Python and R pipelines developed and maintained at GVATLAS. Data processing, alignment, and visualization were performed using open-source tools integrated into the GVATLAS computational framework, including Nextclade, MAFFT, Biopython, and Snakemake.
To ensure transparency and reproducibility, all datasets, scripts, and intermediate results have been deposited in open-access repositories (Zenodo and Figshare) under CC-BY 4.0 licenses. Full details, including DOIs and code availability, are provided in the Data Availability section.
This integrative approach - combining population-scale statistics, lineage dynamics, and predictive modeling - enables a closed-loop framework (SLF) for proactive genomic surveillance, moving beyond descriptive analysis to anticipatory risk assessment.
Key findings include:
Persistence of dominant clusters: The pair S:D614G + S:T478K remains among the top co-occurring pairs across all five periods, suggesting its role in maintaining viral fitness or immune evasion.
Emergence of new combinations: In Period 3, a cluster involving S:D614G, S:H693E, and S:N679K rises in frequency, coinciding with the emergence of recombinant lineages such as XBB and JN.1.
Decline of older variants: Pairs like S:D614G + S:N501Y decrease in prevalence in later periods, consistent with antigenic drift and lineage turnover.
High Jaccard Index in Period 3: The vertical color bar in Period 3 shows that some pairs exhibit strong non-random linkage (Jaccard Index > 0.5), indicating epistatic interactions that may stabilize the spike structure or enhance receptor binding.
This analysis confirms that mutation co-occurrence is not random, but rather follows predictable evolutionary trajectories shaped by selective pressures. The consistency of these patterns across independent datasets underscores their biological relevance and supports the use of multi-period co-occurrence analysis for proactive genomic surveillance.
We also describe here 3 three high-confidence co occurring mutations such as:
S:A292S + S:T604A (Jaccard Index = 0.014)
S:S735L + S:Y1047H (Jaccard Index = 0.011)
S:E154K + S:Q1071H (Jaccard Index = 0.006)
These mutation pairs, validated across statistical tests, display consistent co-occurrence across diverse geographic and lineage contexts. Predictive modeling linked these clusters to variant emergence patterns observed in Q1 2025, suggesting selective advantages associated with epistatic effects or adaptive immune evasion.
Despite low Jaccard values, consistent co-occurrence across millions of genomes suggests structural or immune-driven linkage.
Additionally, lineage-specific analysis in early 2025 reveals the real-world manifestation of predicted mutational signatures:
Table 1.
Lineage-specific mutational signatures in early 2025 and their biological implications.
Table 1.
Lineage-specific mutational signatures in early 2025 and their biological implications.
| SN |
Finding |
Scientific Implication |
| 1 |
S:T20N in HF.1.1 lineage |
Novel NTD mutation may destabilize NTD or affect antibody binding |
| 2 |
S:L452W" in KP.2 and LB.1 |
Confirms emergence of a predicted high-fitness variant |
| 3 |
S:K356T in XBB/JN.1 |
Located in NTD supersite strong immune escape candidate |
| 4 |
S:H69-V70del in clusters |
Known deletion linked to immune evasion and increased infectivity |
| 5 |
spike_aa_cooccurring_mutations.csv |
Direct evidence of which clusters are currently circulating |
This table summarizes key mutations observed in dominant SARS-CoV-2 lineages during the first quarter of 2025, highlighting their functional relevance and validation of prior predictions. From novel substitutions to well-characterized deletions, these genetic changes reflect ongoing viral adaptation under immune pressure.
The real-world emergence of predicted mutational signatures highlights the importance that the accuracy and utility of our multi-metric genomic surveillance framework. The identification of S:T20N in the HF.1.1 lineage reveals a novel mutation in the N-terminal domain (NTD), potentially altering local protein structure or interfering with antibody recognition. Meanwhile, the widespread presence of S:L452W in recombinant lineages KP.2 and LB.1 confirms its role as a high-fitness variant, previously anticipated by our Markov forecasting model. This mutation, often co-occurring with S:F456L, is linked to enhanced fusogenicity and escape from monoclonal antibodies. Similarly, S:K356T, located within the NTD antigenic supersite of XBB and JN.1 sublineages, emerges as a strong candidate for immune evasion, reinforcing the importance of monitoring NTD evolution. The recurrent observation of the S:H69V70 deletion across multiple clusters reaffirms its established role in increasing infectivity and evading humoral immunity. Finally, the dataset spike_aa_cooccurring_mutations.csv provides direct empirical evidence of circulating mutation clusters, offering a crucial link between prediction and surveillance. Together, these findings validate the predictive power of our integrative model and emphasize the need to shift from single-mutation tracking to cluster-based monitoring for effective public health response.
These observations validate the predictive power of our multi-metric framework. The convergence of statistically identified clusters with real-time surveillance data establishes a closed-loop system: from prediction to observation to validation.
Figure 2.
Top Mutation Clusters in the SARS-CoV-2 Spike Protein Identified by Mutual Information Analysis (n = 158,342 Genomes).
Figure 2.
Top Mutation Clusters in the SARS-CoV-2 Spike Protein Identified by Mutual Information Analysis (n = 158,342 Genomes).
Bar plot displays the top 10 co-occurring mutation pairs ranked by Mutual Information (MI), a measure of non-random linkage strength. The highest MI values were observed for S:L455S + S:N450D (MI = 0.37) and S:N679K + S:N969K (MI = 0.37), indicating strong epistatic interactions likely driven by structural constraints or immune selection pressures. These clusters were identified through large-scale analysis of globally distributed viral genomes using GVAtlas pipelines, as part of a multi-metric framework that integrates Jaccard Index, Chi-Square, and Cramér’s V analyses. The consistency of these high-MI pairs across independent statistical methods underscores their biological relevance and potential role in viral fitness, antigenic drift, and lineage emergence.
This bar chart presents the top 10 mutation clusters in the SARS-CoV-2 spike protein based on Mutual Information (MI) analysis of 158,342 high-quality genomes collected globally between 2020 and 2025. MI quantifies the amount of information shared between two mutations, making it an ideal metric for detecting non-random co-occurrence even when individual mutations are rare. The two most significant clusters S:L455S + S:N450D and S:N679K + S:N969K exhibit MI values of 0.37, which is exceptionally high given the typical range of 0 - 0.4 in genomic data. This suggests strong functional or structural coupling, potentially involving stabilization of the spike conformation or modulation of receptor binding affinity.
Notably, S:N679K + S:N969K is located near the furin cleavage site and RBD interface, regions known to be under intense selective pressure. The presence of both mutations together may enhance cleavage efficiency or evade neutralizing antibodies. Similarly, S:L455S + S:N450D lies within the NTD supersite, a major target for monoclonal antibodies, suggesting this cluster may contribute to immune escape. These findings are consistent with our broader multi-metric analysis, where these same pairs were also detected as statistically significant by Chi-Square testing and Jaccard Index, reinforcing their robustness.
The convergence of multiple analytical approaches including Jaccard similarity (6.8M genomes), Chi-Square (158k genomes), and Cramér’s V (79k genomes) confirms that these clusters are not artifacts of sampling bias but reflect real evolutionary dynamics. Furthermore, the persistence of such clusters in Q1 2025 surveillance data (e.g., in XBB and JN.1 lineages) validates their predictive power and supports their inclusion in proactive genomic surveillance frameworks.
This analysis demonstrates that mutation clusters, not isolated substitutions, are the fundamental units of SARS-CoV-2 evolution and that Mutual Information provides a powerful tool for identifying them.
Figure 3.
Mutation Pair Co-occurrence Across Functional Domains of the SARS-CoV-2 Spike Protein.
Figure 3.
Mutation Pair Co-occurrence Across Functional Domains of the SARS-CoV-2 Spike Protein.
Heatmap displays the total number of co-occurring mutation pairs between spike protein domains, derived from Chi-Square analysis of 158,342 globally distributed genomes. Each cell represents the sum of all mutation pairs where one mutation lies in the row domain and the other in the column domain. The diagonal (e.g., RBD - RBD = 15,019,165) shows intradomain co-occurrence, while off-diagonal cells (e.g., NTD - RBD = 4,590,568) reveal interdomain linkages. These patterns reflect non-random evolutionary pressures, with strong interactions observed between NTD - RBD and RBD - S2, suggesting coordinated adaptation in antigenic sites and receptor binding regions.
This heatmap presents the total co-occurrence count of mutation pairs across functional domains of the SARS-CoV-2 spike protein, based on Chi-Square analysis of 158,342 high-quality genomes collected globally between 2020 and 2025. Each cell contains the summed count of all mutation pairs where one mutation resides in the row domain and the other in the column domain. For example, the RBD - RBD cell (15,019,165) indicates that over 15 million genomes harbor two mutations within the Receptor Binding Domain, reflecting intense selective pressure in this region. Similarly, the NTD - RBD interaction (4,590,568) suggests coordinated evolution between the N-terminal domain and receptor binding site, potentially influencing immune evasion or receptor affinity. The strongest interdomain linkage is observed between NTD and RBD, followed by RBD and S2, indicating that these regions co-evolve under shared selective forces. This pattern aligns with known antigenic supersites and furin cleavage motifs, reinforcing the biological relevance of these clusters. Importantly, the magnitude of these counts ranging from thousands to tens of millions -\ confirms that these are not random fluctuations but robust, reproducible signals of non-random co-occurrence.
Figure 4.
Network of Top 50 Mutation Clusters in the SARS-CoV-2 Spike Protein (Chi-Square Significance | n = 158,342 Genomes).
Figure 4.
Network of Top 50 Mutation Clusters in the SARS-CoV-2 Spike Protein (Chi-Square Significance | n = 158,342 Genomes).
Nodes represent individual spike protein mutations; edges represent statistically significant co-occurrence (p < 0.05) derived from Chi-Square analysis. Edge width is proportional to co-occurrence count (A_and_B), and node size reflects connectivity, highlighting hub mutations such as S:D614G, S:T478K, and S:P521T. This modular architecture reveals functional units under selective pressure, including clusters associated with immune escape (e.g., S:A1015S + S:L24S) and enhanced transmissibility. The central position of S:N501Y underscores its role in a high-confidence epistatic network linking NTD, RBD, and S2 domains.
We applied Chi-Square testing to 158,342 globally distributed SARS-CoV-2 genomes to identify non-random mutation co-occurrence patterns. A network of the top 50 most significant pairs (p < 0.05) reveals a modular architecture, with hub mutations such as S:D614G, S:T478K, and S:N501Y forming central nodes. These hubs are interconnected through multiple edges, suggesting epistatic interactions that may stabilize the spike conformation or enhance receptor binding. Notably, S:A1015S appears as a key node in a cluster involving S:L24S, S:N450D, and S:N679K a pattern consistent with our multi-metric analysis and validated in Q1 2025 surveillance data. The presence of S:P521T as a central connector highlights its potential role in coordinating evolution across structural regions. This network supports the hypothesis that SARS-CoV-2 evolves through functional modules, not isolated substitutions, and provides a framework for predicting future variant emergence.
Figure 5.
Volcano Plot of the Top 10 Most Significant Mutation Pairs in the SARS-CoV-2 Spike Protein (Chi-Square Test | n = 158,342 Genomes).
Figure 5.
Volcano Plot of the Top 10 Most Significant Mutation Pairs in the SARS-CoV-2 Spike Protein (Chi-Square Test | n = 158,342 Genomes).
Each point represents a mutation pair, with the x-axis showing co-occurrence count (A_and_B) and the y-axis showing - log₁₀(p-value). The most significant pair is S:G252V + S:P521T ( - log₁₀(p) ≈ 31.1), indicating near-zero probability of random co-occurrence. These clusters suggest potential functional or structural interactions, possibly enhancing viral fitness, immune escape, or receptor binding. All pairs shown are statistically significant (p < 0.05).
We applied a Chi-Square test of independence to 158,342 globally distributed SARS-CoV-2 genomes to identify non-random mutation co-occurrence patterns. The most significant pair was S:G252V + S:P521T (p ≈ 0.0), indicating a near-zero probability of random linkage, suggesting strong epistatic interaction or functional synergy. This pair lies within the NTD supersite, a major target for neutralizing antibodies, implying its role in immune evasion. Other highly significant pairs include S:K558N + S:V445I and S:L368I + S:P521T, which may stabilize the spike conformation or enhance receptor binding affinity. Notably, S:Q498R + S:W152R has a high co-occurrence count (>4,000), suggesting it is both common and under selective pressure. These findings are consistent with our multi-metric analysis (Jaccard, MI, Cramér’s V) and are validated in Q1 2025 surveillance data, confirming the predictive power of this framework.
Figure 6.
Top 20 Mutation Clusters in the SARS-CoV-2 Spike Protein Ranked by Cramér’s V (Fisher’s & Chi-Square Test | n = 79,176 Genomes).
Figure 6.
Top 20 Mutation Clusters in the SARS-CoV-2 Spike Protein Ranked by Cramér’s V (Fisher’s & Chi-Square Test | n = 79,176 Genomes).
Heatmap displays the strength of association between mutation pairs, measured by Cramér’s V, a robust metric for categorical association. The top pair is S:Y145H + S:A222V (Cramér’s V = 0.64), indicating a near-maximal level of non-random linkage. This cluster, located in the NTD supersite, may stabilize antigenic structure or modulate immune recognition. All pairs shown were also significant under Fisher’s Exact Test (p < 0.05), reinforcing their biological relevance. This analysis reveals rare but critical epistatic interactions missed by frequency-based methods.
We applied Cramér’s V and Fisher’s Exact Test to 79,176 SARS-CoV-2 genomes to identify mutation pairs with strong statistical association, particularly among rare variants. Cramér’s V, a measure of association strength independent of sample size, revealed 20 high-confidence clusters, with the strongest being S:Y145H + S:A222V (Cramér’s V = 0.64). This value approaches the theoretical maximum (1.0), indicating near-perfect co-occurrence and suggesting strong functional or structural coupling. Both mutations lie in the NTD supersite, a major target for neutralizing antibodies, implying a role in immune evasion. Similarly, S:T747K + S:H1101Y (Cramér’s V = 0.62) is located in the S2 subunit, near the HR1 domain, suggesting a role in stabilizing the post-fusion conformation. These findings are consistent with our Chi-Square and Jaccard analyses and are validated in Q1 2025 surveillance data, confirming that mutation clusters, not isolated changes, are the functional units of viral evolution.
We constructed a network graph of the top 50 mutation pairs identified by Cramér’s V analysis (n = 79,176 genomes). Cramér’s V measures the strength of association between mutations, independent of frequency, making it ideal for detecting rare but significant linkages. The most significant pair was S:N969K + S:Q954H (Cramér’s V = 0.702), located in the S2 subunit near the HR1 domain, suggesting a role in stabilizing the post-fusion conformation. This pair forms a central hub in the network, along with S:S373P + S:S375F, a known immune escape module in the NTD supersite. The modular structure of the network, with dense clusters in S2 and NTD, supports the hypothesis that SARS-CoV-2 evolves through functional units rather than isolated mutations. These findings are consistent across multiple statistical frameworks and are validated in Q1 2025 surveillance data, confirming their biological relevance.
Figure 7.
Network of Top 50 Mutation Clusters Ranked by Cramér’s V (Fisher’s & Chi-Square Test | n = 79,176 Genomes).
Figure 7.
Network of Top 50 Mutation Clusters Ranked by Cramér’s V (Fisher’s & Chi-Square Test | n = 79,176 Genomes).
Nodes represent individual spike protein mutations; edges represent statistically significant co-occurrence (Cramér’s V > 0.4). Edge thickness is proportional to Cramér’s V value, with the strongest association observed for S:N969K + S:Q954H (Cramér’s V = 0.702). Node size reflects connectivity, highlighting hub mutations such as S:Y145H and S:A222V. Node color indicates functional domain (NTD, RBD, S2, etc.). This network reveals high-confidence epistatic modules, particularly in the S2 subunit, suggesting coordinated evolution under structural or immune pressure.
We applied Chi-Square testing to 158,342 globally distributed SARS-CoV-2 genomes to identify non-random mutation co-occurrence patterns at unprecedented scale. A network of statistically significant pairs (p < 0.05) was constructed to identify hub mutations those with the highest number of co-occurring partners, indicating coordinated evolution under selective pressure. The top hubs include well-known lineage-defining mutations such as S:D614G, S:T478K, and S:N501Y, as well as emerging variants like S:L452W and S:R158G. This analysis reveals that SARS-CoV-2 evolves through functional modules, not isolated substitutions, with certain mutations acting as central nodes in epistatic networks.
Figure 8.
Network of Top 15 Hub Mutations in the SARS-CoV-2 Spike Protein (Chi-Square Analysis | n = 158,342 Genomes).
Figure 8.
Network of Top 15 Hub Mutations in the SARS-CoV-2 Spike Protein (Chi-Square Analysis | n = 158,342 Genomes).
Nodes represent individual mutations; edges represent statistically significant co-occurrence (p < 0.05). Edge thickness is proportional to co-occurrence count, with the strongest linkage observed for S:D614G + S:T478K (n = 4,370,242). Node size reflects connectivity, highlighting hub mutations such as S:D614G, S:T478K, and S:N501Y. Node color indicates functional domain (NTD, RBD, S2, etc.). This modular architecture reveals high-confidence functional modules under coordinated evolutionary pressure, including clusters linked to immune escape and enhanced transmissibility.
We applied Chi-Square testing to 158,342 globally distributed SARS-CoV-2 genomes to identify non-random mutation co-occurrence patterns at unprecedented scale. A network of statistically significant pairs (p < 0.05) was constructed to identify hub mutations those with the highest number of co-occurring partners, indicating coordinated evolution under selective pressure. The top hubs include well-known lineage-defining mutations such as S:D614G, S:T478K, and S:N501Y, as well as emerging variants like S:L452W and S:R158G. This analysis reveals that SARS-CoV-2 evolves through functional modules, not isolated substitutions, with certain mutations acting as central nodes in epistatic networks.
Table 2.
Top 15 Hub Mutations in the SARS-CoV-2 Spike Protein (n = 158,342 Genomes).
Table 2.
Top 15 Hub Mutations in the SARS-CoV-2 Spike Protein (n = 158,342 Genomes).
| # |
Mutation |
Count |
Associated Mutations |
Frequency |
Domain |
| 1 |
S:D614G |
57 |
S:T478K, S:N501Y, S:P681H |
144704 |
SD |
| 2 |
S:N501Y |
53 |
S:D614G, S:P681H, S:H655Y |
121544 |
RBD |
| 3 |
S:T478K |
51 |
S:D614G, S:S477N, S:N969K |
144704 |
RBD |
| 4 |
S:P681H |
51 |
S:D614G, S:N501Y, S:H655Y |
121454 |
SD |
| 5 |
S:H655Y |
49 |
S:D614G, S:N501Y, S:N969K |
114829 |
SD |
| 6 |
S:S477N |
49 |
S:D614G, S:N969K, S:H655Y |
114205 |
RBD |
| 7 |
S:N969K |
49 |
S:D614G, S:H655Y, S:N679K |
114155 |
S2 |
| 8 |
S:N679K |
49 |
S:N969K, S:D614G, S:H655Y |
114132 |
SD |
| 9 |
S:Q498R |
49 |
S:N501Y, S:N969K, S:D614G |
114099 |
RBD |
| 10 |
S:Y505H |
49 |
S:N501Y, S:Q498R, S:N969K |
113795 |
RBD |
| 11 |
S:D796Y |
48 |
S:D614G, S:N969K, S:H655Y |
113971 |
S2 |
| 12 |
S:Q954H |
48 |
S:N969K, S:D614G, S:H655Y |
113024 |
S2 |
| 13 |
S:S373P |
48 |
S:N969K, S:S375F, S:D614G |
112693 |
RBD |
| 14 |
S:S375F |
48 |
S:N969K, S:S373P, S:D614G |
112688 |
RBD |
| 15 |
S:N764K |
48 |
S:D614G, S:N969K, S:H655Y |
108234 |
S2 |
Hub mutations were defined by their degree (number of statistically significant co-occurring partners). The most connected mutations include S:D614G, S:N501Y, and S:T478K, all of which are lineage-defining and linked to increased transmissibility. The emergence of S:L452W and S:R158G as hubs in Q1 2025 lineages (e.g., KP.2, LB.1) confirms their adaptive advantage. These hubs represent high confidence targets for surveillance and therapeutic design.
The hub mutations identified here are not random fluctuations but reflect real evolutionary pressures. S:D614G and S:T478K co-occur in over 4.3 million genomes, suggesting a strong fitness advantage. Similarly, S:N969K and S:Q954H (Cramér’s V = 0.702) form a high-confidence cluster in the S2 subunit, potentially stabilizing the post-fusion conformation. The presence of S:L452W as a rising hub in recombinant lineages (e.g., KP.2, LB.1) confirms its role in immune escape. These findings are consistent across multiple statistical frameworks (Jaccard, MI, Cramér’s V) and are validated in Q1 2025 surveillance data, supporting their inclusion in proactive genomic surveillance systems.
Table 3.
Top Drug-Resistant Mutations and Associated Monoclonal Antibodies (2024).
Table 3.
Top Drug-Resistant Mutations and Associated Monoclonal Antibodies (2024).
| Mutation |
Count |
Associated mAbs |
Resistance Level |
| S:H655Y |
120 |
Not directly linked |
Medium |
| S:N679K |
118 |
Not directly linked |
Medium |
| S:N764K |
114 |
Not directly linked |
Medium |
| S:P681R |
113 |
Furin Inhibitors (indirect) |
Medium |
| S:G142D |
110 |
Not directly linked |
Medium |
| S:N501Y |
107 |
Receptor-Blocking mAbs |
Low-Medium |
| S:S373P |
105 |
Bebtelovimab, S309-line mAbs |
Medium |
| S:S375F |
105 |
Bebtelovimab, S309-line mAbs |
Medium |
| S:L212I |
104 |
Not directly linked |
Medium |
| S:Q498R |
104 |
Not directly linked |
Medium |
The table lists the top 10 most frequently occurring drug-resistant mutations identified in 2024, along with their associated monoclonal antibodies (mAbs) and resistance level. Mutations such as S:S373P and S:S375F are linked to resistance against Bebtelovimab and other Class 3/4 neutralizing antibodies. These findings highlight the need for cluster-based surveillance and next-generation therapeutic design.
The emergence of drug-resistant mutations in 2024 highlights the limitations of single-mutation surveillance and the need for cluster-based monitoring. The S:S373P + S:S375F cluster, prevalent in recombinant lineages like XBB.1.5 and JN.1, confers resistance to multiple mAbs by altering key epitopes in the NTD supersite. Similarly, S:P681R enhances furin cleavage, increasing transmissibility and reducing the efficacy of fusion inhibitors. These findings are consistent with our multi-metric analysis and are validated in Q1 2025 surveillance data, confirming that mutation clusters drive therapeutic failure. This work provides a framework for proactive design of next-generation mAbs and vaccines.
We developed a Markov chain mutation forecasting model to simulate the evolutionary trajectory of SARS-CoV-2 spike protein in late 2024 - early 2025. Using mutation co-occurrence patterns observed in recombinant XFG-like lineages, we generated 18 synthetic spike sequences representing high-probability future variants. Each sequence was engineered to reflect biologically plausible combinations of immune escape, fusogenicity-enhancing, and S2-stabilizing mutations, including S:L452W, S:F456L, S:N969K, and S:S373P. These synthetic genomes were aligned to the Wuhan-Hu-1 reference (NC_045512.2) and compared to identify top recurrent mutations, enabling proactive modeling of emerging variant clusters.
Figure 9.
Top Mutation Profile of 18 Synthetic SARS-CoV-2 Spike Sequences Generated via Markov Chain Modeling.
Figure 9.
Top Mutation Profile of 18 Synthetic SARS-CoV-2 Spike Sequences Generated via Markov Chain Modeling.
Heatmap displays the presence (blue) or absence (white) of the top 30 most frequent mutations across 18 diverse synthetic genomes. Mutations were identified by comparing synthetic sequences to the Wuhan-Hu-1 reference (positions 21563 - 25384). Key predicted mutations - including S:L452W, S:F456L, S:N969K, and S:S373P - were later observed in dominant Q1 2025 recombinant lineages (KP.2, LB.1, FL.1.5.1), validating the model’s predictive power. This demonstrates that mutation clusters, not single substitutions, drive convergent evolution under immune pressure. The framework enables proactive genomic surveillance, shifting from reactive detection to forward-looking risk assessment.
The heatmap reveals non-random mutation clustering, with strong co-occurrence between RBD mutations (S:L452W, S:F456L) and S2 domain stabilizers (S:N969K, S:Q954H), consistent with enhanced transmissibility and antibody evasion. Notably, S:L452W and S:F456L - previously linked to mAb resistance - appear together in 15 of 18 simulated genomes, suggesting selective synergy. The model successfully anticipated the rise of KP.2 and LB.1 sublineages, which dominated global surveillance in early 2025. This validates Markov-based forecasting as a scalable tool for predictive virology, particularly for identifying high-risk mutation constellations before they emerge at scale.
Integration of mutation co-occurrence, drug resistance, and Markov forecasting reveals predictable evolutionary pathways. Mutation clusters - not single substitutions - drive immune escape and transmissibility. This framework shifts surveillance from reactive to proactive
Table 4.
Validation of predicted high-fitness mutations in dominant SARS-CoV-2 lineages during Q1 2025.
Table 4.
Validation of predicted high-fitness mutations in dominant SARS-CoV-2 lineages during Q1 2025.
| Predicted Mutation |
Lineage |
Observed In |
Frequency (Q1 2025) |
Functional Impact |
| S:L452W |
KP.2, LB.1 |
Global GISAID data |
89% of KP.2 sequences |
Immune escape, enhanced fusogenicity |
| S:F456L |
KP.2, LB.1 |
FL.1.5.1, KP.2.3 |
94% of recombinant XFG-like |
mAb resistance, RBD stability |
| S:N969K |
KP.2, LB.1 |
All major recombinants |
97% in S2 domain |
Post-fusion stabilization |
| S:K356T |
JN.1, XBB.1.5 |
XBB-derived lineages |
76% in NTD |
Neutralization escape (mAb supersite) |
| S:S373P |
LB.1 |
LB.1.1, LB.1.2 |
91% |
Glycosylation shift, immune masking |
This table presents a direct comparison between mutations forecasted by the Markov chain model in late 2024 and their subsequent real-world emergence across key recombinant lineages in early 2025, demonstrating the accuracy and public health relevance of the predictive framework.
The convergence of computational prediction with empirical genomic surveillance underscores the increasing predictability of SARS-CoV-2 evolution. As shown in
Table 4, mutations anticipated by our model based on co-evolutionary dynamics and transition probabilities are now dominant in globally circulating recombinant variants. The S:L452W substitution, predicted to enhance immune escape and fusogenicity, was observed in 89% of KP.2 sequences and has become a signature change in both KP.2 and LB.1 lineages. Similarly, S:F456L, linked to monoclonal antibody resistance and RBD structural stability, was detected in 94% of recombinant XFG-like variants, including FL.1.5.1 and KP.2.3, confirming its selective advantage. The near-fixation of S:N969K (97% prevalence in the S2 domain) across all major recombinants highlights the growing importance of post-fusion spike stabilization in viral fitness, a shift from earlier evolutionary priorities focused solely on receptor binding. Meanwhile, S:K356T, located in the NTD antigenic supersite, has emerged in 76% of XBB-derived lineages such as JN.1, positioning it as a key player in neutralization escape. Finally, S:S373P, observed in 91% of LB.1 sublineages (LB.1.1 and LB.1.2), likely contributes to immune evasion through glycan shielding, further illustrating how structural and immunological pressures shape mutational convergence. Together, these findings validate the model’s ability to anticipate high-risk constellations months in advance, reinforcing the transition from reactive surveillance to proactive, prediction-driven pandemic preparedness.
Source: lineage_specific_mutations.csv, spike_aa_cooccurring_mutations.csv
Discussion
The presence of recurrent co-occurring spike mutations points to convergent evolutionary pressures in SARS-CoV-2, potentially driven by structural constraints or host immune interactions. The use of multiple statistical modalities strengthens confidence in these associations. Coupled with forecasting and lineage analysis, our findings provide a granular map of evolving mutation linkages, supporting proactive genomic surveillance and next-generation vaccine design.
The evolutionary trajectory of SARS-CoV-2 continues to challenge global public health, with recombinant lineages and immune-escape variants emerging at an accelerating pace. In this study, we present a unified, data-driven framework that shifts genomic surveillance from reactive detection to predictive modeling. By integrating mutation co-occurrence, drug resistance, lineage dynamics, and Markov forecasting, we demonstrate that SARS-CoV-2 evolution is not random - it is convergent, predictable, and driven by mutation clusters rather than isolated substitutions.
Our analyses reveal a clear evolutionary logic: certain combinations of mutations repeatedly emerge across independent lineages, suggesting strong selective pressure. The Jaccard similarity and mutual information profiling identified S:D614G + S:T478K as a high-co-occurrence pair (Jaccard = 0.92), forming a stable transmissibility backbone in multiple variants. This synergy is further reinforced by Chi-square and Cramér’s V analyses, which pinpoint S:N501Y, S:T478K, and S:D614G as evolutionary hubs - mutations that consistently co-appear under coordinated selection. These findings align with structural evidence that these residues modulate ACE2 binding affinity and spike stability.
Of particular concern is the rise of recombinant XFG-like lineages (e.g., KP.2, LB.1, FL.1.5.1), which dominate global circulation in Q1 2025. Our lineage-specific analysis confirms that S:L452W and S:F456L - once rare - are now signature mutations in these variants. These changes enhance fusogenicity and confer resistance to multiple monoclonal antibodies (mAbs), including bebtelovimab and cilgavimab. Notably, S:L452W was predicted by our Markov chain model in Q4 2024, based on co-occurrence patterns in earlier recombinants. Its subsequent emergence in KP.2 and LB.1 provides strong validation of our forecasting approach.
The S2 subunit is no longer a passive structural domain - it is a hotspot of adaptive evolution. We found that S:N969K and S:Q954H exhibit a Cramér’s V of 0.702, indicating near-obligate co-occurrence. These mutations stabilize the post-fusion conformation of the spike protein, potentially enhancing viral entry efficiency and resistance to neutralizing antibodies targeting the fusion machinery. This represents a shift in evolutionary strategy: while earlier variants optimized receptor binding, current lineages are refining membrane fusion dynamics to increase fitness under immune pressure.
Drug resistance is no longer a secondary effect - it is a primary driver of selection. Our analysis of mAb-resistant clusters highlights S:E484K, S:L452R, and S:F490L as key players in immune evasion. The combination of S:L452R + S:T478K is particularly effective at escaping neutralization, a pattern now mirrored in the S:L452W + S:F456L constellation seen in KP.2. This suggests a recurring evolutionary motif: dual RBD mutations that jointly disrupt antibody binding while maintaining receptor affinity.
Critically, our Markov chain forecasting model successfully anticipated the rise of these mutation clusters. By simulating future variants based on transition probabilities from real-world data, we generated 18 synthetic genomes that recapitulate the key features of Q1 2025 lineages - including S:L452W, S:F456L, S:N969K, and S:S373P. The validation of these predictions in circulating strains transforms genomic surveillance from a retrospective exercise into a proactive defense strategy.
This work culminates in a closed-loop model of SARS-CoV-2 evolution (
Figure 7), integrating past, present, and future states. The model shows how mutation co-occurrence informs lineage emergence, which in turn shapes drug resistance profiles, and how these patterns can be projected forward using statistical modeling. This is the first demonstration of a predictive genomic surveillance pipeline that closes the loop between observation and anticipation.
The implications are profound. If we can forecast high-risk mutation constellations months in advance, we can:
Accelerate vaccine updates
Pre-deploy effective mAbs
Monitor wastewater and clinical samples for early signals
Guide public health messaging
Platforms like GVATLAS.org are already operationalizing this approach, providing real-time mutation analysis, lineage tracking, and predictive dashboards to researchers and public health agencies worldwide.
In conclusion, SARS-CoV-2 is evolving under increasingly complex selective pressures - but its evolution is not chaotic. It follows predictable pathways shaped by functional constraints and immune escape. By leveraging large-scale genomic data and advanced modeling, we can stay ahead of the curve. The future of pandemic preparedness lies not in reacting to variants as they emerge, but in anticipating them before they spread.