3. Results
3.1. TAG is the Dominant Stop Codon
The most striking finding? TAG is the most used stop codon in Spike, accounting for 43.1% of all stop signals.
Table 1.
Stop Codon Usage in the SARS-CoV-2 Spike Gene.
Table 1.
Stop Codon Usage in the SARS-CoV-2 Spike Gene.
| Stop Codon |
Count |
Frequency |
| TAG |
219,154,491 |
43.12% |
| TAA |
178,582,919 |
35.13% |
| TGA |
110,543,330 |
21.75% |
The TAG stop codon is the most frequently used in the spike gene (43.1%), despite being suboptimal in human cells due to higher readthrough risk. TAA (35.1%) and TGA (21.8%) are less common. This dominance of TAG challenges expectations of translational optimization and suggests that codon usage in SARS-CoV-2 is shaped more by mutational pressure, RNA structure, or evolutionary constraints than by host adaptation.
This is surprising because:
In humans, TAA is the most efficient stop codon.
TAG has higher readthrough risk which means it can lead to extended proteins.
Human tRNA and release factor efficiency favor TAA.
Yet, SARS-CoV-2 consistently prefers TAG even in later variants.
3.2. Strong Amino Acid Specific Biases
Codon bias isn’t uniform. Some amino acids show extreme preferences:
Arginine (R): AGA (48.1%) >> AGG (33.7%) >> CGx (all <6%)
Proline (P): CCA (65.8%) >> CCT (18.6%) >> CCC (11.1%)
Serine (S): TCA (39.0%) >> AGT (19.7%) >> TCT (18.3%)
These biases are reflected in RSCU values:
Table 2.
Top Five Overused Codons in the SARS-CoV-2 Spike Gene.
Table 2.
Top Five Overused Codons in the SARS-CoV-2 Spike Gene.
| Codon |
RSCU |
| R-AGA |
2.88 |
| P-CCA |
2.63 |
| S-TCA |
2.34 |
| V-GTG |
2.08 |
| T-ACA |
2.03 |
Relative synonymous codon usage (RSCU) values >1 indicate overrepresentation relative to equal use. The most strongly biased codons include AGA (Arg, RSCU = 2.88), CCA (Pro, RSCU = 2.63), and TCA (Ser, RSCU = 2.34), all of which are rare in highly expressed human genes and decoded by low-abundance tRNAs. This suggests that translational optimization is not the primary driver of codon choice in Spike.
While codon usage in SARS-CoV-2 is shaped by mutational and structural constraints, certain synonymous codons are used far more frequently than expected under neutrality. To identify the most strongly biased codons, we computed Relative Synonymous Codon Usage (RSCU) across ~9.3 million global SARS-CoV-2 genomes, aggregating data from 188 sequence chunks (see cub_results/). RSCU > 1 indicates overuse relative to equal distribution among synonymous codons. Here, we present the top 10 codons by RSCU, highlighting those with the strongest deviation from random use.
Figure 2.
Top 10 relative synonymous codon usage (RSCU) values in the SARS-CoV-2 spike gene across ~9.3 million genomes. RSCU > 1 indicates overuse relative to expectation under equal codon distribution. Codons AGA (Arg), CCA (Pro), and TCA (Ser) are highlighted in red and show strong bias, all of which are rare in highly expressed human genes. Values were computed from aggregated codon counts in cub_spike_global.tsv using standard RSCU formula. Error bars represent 95% confidence intervals based on bootstrapped sampling (n=1000).
Figure 2.
Top 10 relative synonymous codon usage (RSCU) values in the SARS-CoV-2 spike gene across ~9.3 million genomes. RSCU > 1 indicates overuse relative to expectation under equal codon distribution. Codons AGA (Arg), CCA (Pro), and TCA (Ser) are highlighted in red and show strong bias, all of which are rare in highly expressed human genes. Values were computed from aggregated codon counts in cub_spike_global.tsv using standard RSCU formula. Error bars represent 95% confidence intervals based on bootstrapped sampling (n=1000).
This figure is derived from rscu_spike.tsv, which was generated by aggregating codon usage from all 188 chunks (cub_spike_chunk_*.tsv) and computing RSCU as:
The most overrepresented codons include AGA (Arg, RSCU = 2.88), CCA (Pro, RSCU = 2.63), and TCA (Ser, RSCU = 2.34) all of which are rare in highly expressed human genes and decoded by low-abundance tRNAs. This pattern contradicts expectations of host translational optimization and instead suggests that codon bias in Spike is driven by non-adaptive forces such as mutational pressure (e.g., APOBEC-driven C→U bias), RNA secondary structure, or historical founder effects. Notably, the stop codon TAG also ranks highly (RSCU = 1.29), reinforcing its dominance despite being suboptimal in human cells. These biases, derived from the global aggregation in cub_spike_global.tsv, indicate that SARS-CoV-2 Spike evolution is not converging toward host-like codon preferences.
These stress on that these codons are not optimal in human cells. For example:
AGA is decoded by a low-abundance tRNA
CCA is rare in human structural proteins
Their overuse suggests non-translational pressures or perhaps RNA structure or mutational bias.
3.3. U-Ending Codons Are Favored
A genome-wide trend: U-ending codons are consistently overrepresented.
Examples:
Phenylalanine: TTT (65.4%) vs TTC (34.6%) → RSCU = 1.31
Isoleucine: ATT (48.6%) >> ATC (25.3%), ATA (26.1%)
Valine: GTT (24.1%) > GTG (52.0%) wait, GTG is G-ending but still high due to VGx bias
This U/A bias aligns with the known AU-richness of the SARS-CoV-2 genome, likely driven by host RNA-editing enzymes like APOBEC (C→U) and ADAR (A→I, read as G→A).
3.4. Corrected ENc Shows Strong Bias
The initial report listed ENc = 111.72 which is impossible (max = 61).
After correction:
Corrected ENc ≈ 42
This means:
Codon usage is far from random
There is strong bias, but not toward human optimization
The bias likely reflects selection on RNA structure, mutation pressure, or historical constraints
Derivation of the Effective Number of Codons (ENc)
The Effective Number of Codons (ENc) is a widely used metric in codon usage bias (CUB) analysis to quantify the degree of non-random usage of synonymous codons in a gene. It ranges from 20 (extreme bias) to 61 (no bias, equal use of all codons).
An ENc value greater than 61 is mathematically impossible which was a red flag indicating a computational error.
So we revised our approach for the calculations as corrcted as follows:
For each amino acid with k synonymous codons we computed it for homozygosity (F)
where pi is the frequency of codon i within that amino acid group. So to compute ENc for that amino acid
So if all k codons are used equally then
the maximum possible ENc for an amino acid is k , regardless of bias.
But in practice, strong bias reduces ENc because increases → decreases.
For example, Leucine has 6 codons, so for:
But in the first case, bias is low and in the second, bias is high. However, ENc per amino acid caps at k, so both give 6.
The global ENc reflects overall bias by averaging across amino acids.
So the Global calculated ENc value would be:
where
is the total number of codons for amino acid
Since:
Each amino acid contributes at most k to ENc
The sum of maximum possible ENc values across all amino acids is 61
Any value of is invalid.
Some studies (including early versions of this analysis) report due to:
Averaging ENc values without proper weighting
Misapplying the formula across the whole gene instead of per amino acid & failing to cap contributions at so in our calculations in report-on-cub.txt file (included with manuscript), an erroneous ENc of 111.72 was calculated which is impossible and stems from incorrect aggregation.
Corrected ENc for SARS-CoV-2 Spike:
This indicates:
This value is consistent with known mutational and structural constraints in SARS-CoV-2.
(Ref. Wright, F. (1990). “The ‘effective number of codons’ used in a gene.” Gene, 87(1), 23–29.)
Figure 3.
Why ENc Cannot Exceed 61: The Effective Number of Codons (ENc) is a measure of codon usage bias that ranges from 20 (extreme bias) to 61 (no bias, equal use of all codons). An ENc value >61 is mathematically impossible. This box explains why and corrects the erroneous value of 111.72 reported in early versions of this analysis.
Figure 3.
Why ENc Cannot Exceed 61: The Effective Number of Codons (ENc) is a measure of codon usage bias that ranges from 20 (extreme bias) to 61 (no bias, equal use of all codons). An ENc value >61 is mathematically impossible. This box explains why and corrects the erroneous value of 111.72 reported in early versions of this analysis.
3.5. GC3s = 0.50 - No Strong GC Pressure
GC3s (GC content at third codon positions) was 0.500, indicating:
No strong mutational pressure toward GC or AT
Usage patterns are not driven by extreme nucleotide bias
Other forces (e.g., RNA folding, tRNA availability) may dominate