3.2. Dinucleotide Composition Resolves into Five Universal Grammar Classes—The Panchamahābhūta States
To characterize the compositional grammar within the GPI framework, we computed dinucleotide profiles for 829,145 non-overlapping 200 bp windows across chromosome 17. Chromosome 17 was selected as the reference chromosome for grammar inference on the basis of its intermediate gene density, comprehensive ClinVar variant coverage at disease loci (BRCA1, TP53, NF1), and its use as reference chromosome in our prior analyses.
Principal component analysis of the 16-dimensional dinucleotide frequency space revealed a stable five-dimensional structure. The first five principal components capture the dominant axes of compositional variation across the 829,145 chr17 windows, with the PC1–PC2, PC1–PC3, and PC4–PC5 projections showing progressive geometric separation of compositional classes (
Figure 2A–C). Gaussian mixture modelling of this five-dimensional space identifies two biologically meaningful solutions: k=3 (Tridosha;
Figure 2D) and k=5 (Panchamahābhūta;
Figure 2E). BIC model selection confirmed five as the optimal number of Gaussian mixture components, with the BIC elbow at k=5 (
Figure 2F). The 200 bp window was selected as optimal across window sizes from 50 bp to 5,000 bp: the eigenvalue elbow at PC5 is stable at 200 bp (Extended Data
Figure 1A), PC1 dominance reaches 26.8% — the smallest window at which the grammar signal is resolvable from compositional noise (Extended Data
Figure 1B), six PCs explain 80% of dinucleotide variance at 200 bp, satisfying the Panchamahābhūta dimensionality threshold (Extended Data
Figure 1C), and BIC model selection confirms k = 3 (Tridosha) and k = 5 (Panchamahābhūta) as the two biologically meaningful solutions (Extended Data
Figure 1D). We term these five classes the Panchamahābhūta grammar states and designate them S1 (Jala), S2 (Vāyu), S3 (Agni), S4 (Prithvi), and S5 (Ākāśa).
The five states were validated as universal across chromosomes: when the grammar was independently inferred on six additional chromosomes (chr1, chr7, chr11, chr19, chr22, chrX) without reference to chr17, the resulting states showed mean cosine similarity of 0.9596 ± 0.032 to the chr17 reference (p = 4.65×10⁻⁸;
Figure S2). The lowest similarity was observed on chrX (0.896), where the LINE-rich, AT-rich repeat-dominant class is markedly over-represented (54.3% vs. 12.2% on chr17), consistent with the known enrichment of LINE elements on the inactive X chromosome for dosage compensation spreading [
16]. This biologically motivated deviation validates rather than undermines the framework: the grammar correctly identifies the unusual repeat architecture of chrX as a deviation from the autosomal norm.
State proportions vary systematically with chromosomal biology (
Figure S2). The gene-dense chromosome 19 shows marked enrichment of the promoter-associated Agni-like class (37.5% vs. 3.7% on chr17), consistent with its high promoter density. Chromosome 22 is dominated by the Ākāśa class (50.8%), reflecting its compact, gene-rich structure. These chromosome-specific distributions confirm that the grammar captures genuine compositional biology rather than chr17-specific artifacts.
Each state maps to a distinct biological identity without prior annotation. State abundance on chr17 is shown in
Figure 3A: S1 (Jala) 14.0%, S2 (Vāyu) 35.0%, S3 (Agni) 4.3%, S4 (Prithvi) 20.4%, S5 (Ākāśa) 26.2%. The repeat composition of each state (
Figure 3B) and gene feature enrichment (
Figure 3C) confirm distinct functional identities: S3 (Agni) is enriched 7.73-fold at ENCODE promoters (
Figure 3C) and 13.1-fold for ENCODE promoter elements (
Figure 3D), and 3.72-fold at GPID breaks — it marks sites of transcription initiation. S4 (Prithvi), occupying 20.4%, is LINE-rich (32.3%), AT-rich, and depleted at promoters (0.42-fold), marking structurally rigid scaffold sequence (
Figure 3B–C) [
17]. S2 (Vāyu) and S5 (Ākāśa), together comprising 61.3% of the genome, are Alu-rich and characterize regulatory domain boundary sequence (
Figure 3B) [
18]. S1 (Jala) marks intergenic structural intervals. Physical dinucleotide fingerprints — AT frequency, GC frequency, CpG density, and unique sequence proportion — further distinguish each state (
Figure 3E), and the complete Grammar→Biology correspondence is summarised in
Figure 3F. The spatial distribution of all five states across chr17 shows focal enrichment of Agni at the TP53, NF1, and BRCA1 disease loci (
Figure 3G). These biological identities are consistent across all chromosomes tested (
Figure S2), confirming that the grammar states are universal rather than chr17-specific. Condensing the five Panchamahābhūta states to the three Tridosha classes (Pitta = Agni; Vāta = Vāyu + Ākāśa; Kapha = Jala + Prithvi), chromosome 17 is 4.3% Pitta, 61.2% Vāta, and 34.5% Kapha — a Vāta-dominant chromosome with focal Pitta concentrated at functional GPID break positions, as described in
Section 3.7.
Extended Data Figure 1 | Window size validation — the grammar is sharpest at 200 bp.
Figure 1.
Genomic Periodicity Index (GPI) — A Physical Grammar of the Human Genome. (A) GPI dominant period per chromosome (kb). Chromosomes are ordered 1–22 and X. Dashed line = genome mean. Outliers chr4 (13.5 kb) and chr7 (12.0 kb) reflect large segmental duplication blocks. (B) Repeat fraction per chromosome (%). Mean = 77.7 ± 6.4% across all 23 chromosomes, consistent with a conserved nucleosome packaging requirement independent of gene content. (C) GPID functional overlap per chromosome. GPID breaks in 17 of 23 chromosomes yielding reliable autocorrelation signal overlapped known functional elements at 60–93% (mean 81.4%), compared to a 30% genome-wide random baseline (p = 3×10⁻¹²⁷, binomial test; enrichment = 2.71-fold). Six chromosomes (chr9, chr13–chr15, chr21–chr22) showed high autocorrelation noise consistent with centromeric repeat masking and were excluded. (D) Human-specific GPI period elongation at disease loci vs. chimpanzee and gorilla. APOE: 98.5 kb human vs. 5.0 kb chimp (20-fold); HBB: 48.0 kb vs. 2.5 kb (19-fold); CFTR: 42.5 kb vs. 13.0 kb (3-fold). TP53 (control): no elongation (~6 kb across all species).
Figure 1.
Genomic Periodicity Index (GPI) — A Physical Grammar of the Human Genome. (A) GPI dominant period per chromosome (kb). Chromosomes are ordered 1–22 and X. Dashed line = genome mean. Outliers chr4 (13.5 kb) and chr7 (12.0 kb) reflect large segmental duplication blocks. (B) Repeat fraction per chromosome (%). Mean = 77.7 ± 6.4% across all 23 chromosomes, consistent with a conserved nucleosome packaging requirement independent of gene content. (C) GPID functional overlap per chromosome. GPID breaks in 17 of 23 chromosomes yielding reliable autocorrelation signal overlapped known functional elements at 60–93% (mean 81.4%), compared to a 30% genome-wide random baseline (p = 3×10⁻¹²⁷, binomial test; enrichment = 2.71-fold). Six chromosomes (chr9, chr13–chr15, chr21–chr22) showed high autocorrelation noise consistent with centromeric repeat masking and were excluded. (D) Human-specific GPI period elongation at disease loci vs. chimpanzee and gorilla. APOE: 98.5 kb human vs. 5.0 kb chimp (20-fold); HBB: 48.0 kb vs. 2.5 kb (19-fold); CFTR: 42.5 kb vs. 13.0 kb (3-fold). TP53 (control): no elongation (~6 kb across all species).

The 200 bp window was selected as optimal across window sizes from 50 bp to 5,000 bp: the eigenvalue elbow at PC5 is stable at 200 bp (Extended Data
Figure 1A), PC1 dominance reaches 26.8% — the smallest window at which the grammar signal is resolvable from compositional noise (Extended Data
Figure 1B), six PCs explain 80% of dinucleotide variance at 200 bp, satisfying the Panchamahābhūta dimensionality threshold (Extended Data
Figure 1C), and BIC model selection confirms k = 3 (Tridosha) and k = 5 (Panchamahābhūta) as the two biologically meaningful solutions (Extended Data
Figure 1D).
Extended Data Figure 2 | Two-layer predictor: grammar state × evolutionary conservation.
Figure 2.
Dinucleotide composition resolves into five universal Panchamahābhūta grammar states. (A–C) PCA of dinucleotide profiles for 829,145 non-overlapping 200 bp windows across chr17. (A) PC1 vs. PC2 density hexplot; (B) PC1 vs. PC3; (C) PC4 vs. PC5. Five natural clusters are visible in raw PCA space without prior annotation. (D) k = 3 (Tridosha) Gaussian mixture model. (E) k = 5 (Panchamahābhūta) Gaussian mixture model — the biologically optimal solution. States are colour-coded: S1 Jala (blue), S2 Vāyu (green), S3 Agni (red), S4 Prithvi (purple), S5 Ākāśa (orange). (F) BIC model selection over k = 2–10; the elbow at k = 5 confirms the Panchamahābhūta solution.
Figure 2.
Dinucleotide composition resolves into five universal Panchamahābhūta grammar states. (A–C) PCA of dinucleotide profiles for 829,145 non-overlapping 200 bp windows across chr17. (A) PC1 vs. PC2 density hexplot; (B) PC1 vs. PC3; (C) PC4 vs. PC5. Five natural clusters are visible in raw PCA space without prior annotation. (D) k = 3 (Tridosha) Gaussian mixture model. (E) k = 5 (Panchamahābhūta) Gaussian mixture model — the biologically optimal solution. States are colour-coded: S1 Jala (blue), S2 Vāyu (green), S3 Agni (red), S4 Prithvi (purple), S5 Ākāśa (orange). (F) BIC model selection over k = 2–10; the elbow at k = 5 confirms the Panchamahābhūta solution.
(A) Mean phyloP100way conservation score per grammar state for pathogenic vs. benign variants. (B) High-conservation variants are more pathogenic within the same grammar state. (C) Conservation-pathogenicity relationship within each grammar state (quartile analysis). (D) Combined predictor: grammar state × conservation. Prithvi+high conservation = 65% pathogenic; Vāyu+high conservation = 63%; low conservation = 35%. (E) gnomAD validation (requires genome-wide analysis — in preparation). (F) Summary: grammar state provides a sequence-based prior; conservation provides the evolutionary constraint layer.
Figure 3.
Panchamahābhūta grammar states — structural and functional identity. (A) State abundance on chr17: S1 Jala 14.0%, S2 Vāyu 35.0%, S3 Agni 4.3%, S4 Prithvi 20.4%, S5 Ākāśa 26.2%. (B) Repeat landscape per state. Percentage of each repeat class within each grammar state. S3 Agni is dominated by unique sequence (66%); S4 Prithvi by LINE elements (32%). (C) Gene feature enrichment per state (log₂ fold change). S3 Agni shows 7.73-fold promoter enrichment. (D) ENCODE promoter enrichment per state. S3 Agni shows 13.1-fold enrichment — identifying it as the site of transcriptional ignition. (E) Physical fingerprints: AT dinucleotide, GC dinucleotide, CpG frequency (×10), and unique sequence proportion per state. (F) Grammar→Biology summary table. (G) Grammar state distribution across chr17 (0–83.5 Mb). Each point represents a 200 bp window assigned to the indicated state. Red dashed vertical lines mark TP53 (7.8 Mb), NF1 (29.4 Mb), and BRCA1 (43 Mb). S3 Agni (red) is enriched at known disease gene loci, consistent with its promoter-associated identity; S2 Vāyu (green) forms a continuous background across the entire chromosome.
Figure 3.
Panchamahābhūta grammar states — structural and functional identity. (A) State abundance on chr17: S1 Jala 14.0%, S2 Vāyu 35.0%, S3 Agni 4.3%, S4 Prithvi 20.4%, S5 Ākāśa 26.2%. (B) Repeat landscape per state. Percentage of each repeat class within each grammar state. S3 Agni is dominated by unique sequence (66%); S4 Prithvi by LINE elements (32%). (C) Gene feature enrichment per state (log₂ fold change). S3 Agni shows 7.73-fold promoter enrichment. (D) ENCODE promoter enrichment per state. S3 Agni shows 13.1-fold enrichment — identifying it as the site of transcriptional ignition. (E) Physical fingerprints: AT dinucleotide, GC dinucleotide, CpG frequency (×10), and unique sequence proportion per state. (F) Grammar→Biology summary table. (G) Grammar state distribution across chr17 (0–83.5 Mb). Each point represents a 200 bp window assigned to the indicated state. Red dashed vertical lines mark TP53 (7.8 Mb), NF1 (29.4 Mb), and BRCA1 (43 Mb). S3 Agni (red) is enriched at known disease gene loci, consistent with its promoter-associated identity; S2 Vāyu (green) forms a continuous background across the entire chromosome.

Extended Data Figure 3 | Grammar states predict cis-regulatory element identity.
Analysis of ENCODE cCREs on chromosome 17 by grammar state and GPID context. (A) Grammar state enrichment (odds ratio) at each cCRE class: Agni enriched 35.1× at promoters (PLS), 8.2× at proximal enhancers (pELS), 1.9× at distal enhancers (dELS); Agni depleted 0.66× at CTCF/insulator elements; Ākāśa enriched 1.34× at CTCF sites. (B) GPID break fraction per cCRE class: promoters 27.8% (3.38× baseline); CTCF sites 5.0% (0.61×, grammar-stable), demonstrating that insulators are grammar-stable positions defined by the absence of rhythm violation. (C) Agni+GPID windows by TSS distance: within 500 bp, 32.8% overlap promoter elements; beyond 2 kb, 38.6% overlap distal enhancers — demonstrating that the same grammar state predicts element type from genomic context alone. (D) Grammar state composition per cCRE class (stacked bar). (E) Grammar signature summary table and prediction rules mapping Agni+GPID+TSS distance to regulatory element type.
3.3. Sandhi Transition Rules Define a Formal Positional Grammar
We computed the 5×5 transition matrix between adjacent grammar windows across chr17 and compared observed to expected transition frequencies under independence. The resulting log-odds matrix reveals a formal Sandhi grammar with three properties characteristic of natural language grammars (
Figure 4A–B).
Extended DataFigure 4| Tridosha segmentation of the human genome and Prakriti of the regulome.
Tridosha condensation (Pitta = Agni; Vāta = Vāyu + Ākāśa; Kapha = Jala + Prithvi) applied to chromosome 17. (A) Tridosha genome map (0–83.5 Mb): log₂ enrichment of Pitta (red), Vāta (blue), and Kapha (green) relative to chromosomal mean, with TP53, NF1, and BRCA1 landmarks. (B) Chromosomal Tridosha composition: 4.3% Pitta, 61.2% Vāta, 34.5% Kapha. (C) GPID break fraction per Dosha: Pitta 30.6% (3.73× baseline); Kapha 5.6% (depleted). (D) Ternary plot of regulatory element Prakriti: each cCRE class occupies a distinct position in Pitta–Vāta–Kapha space; trajectory arrow marks the Pitta depletion gradient from Promoter to CTCF/Insulator. (E) Tridosha stacked bar per cCRE class. (F) Prakriti classification table: element type, Dosha fractions, and constitutional classification.
First, certain transitions are strongly forbidden. The Agni→Prithvi transition (and its reverse) shows log-odds of −4.67, the most negative in the matrix (
Figure 4B), consistent with the biochemical incompatibility of CpG-rich, nucleosome-depleted promoter sequence with AT-rich, LINE-dense scaffold sequence: the two most compositionally distinct classes in the genome are the least likely to occur in adjacent 200 bp windows [
19]. Second, state persistence follows a gradient of structural rigidity visible on the diagonal of the log-odds matrix (
Figure 4B). Agni shows the highest self-persistence (log-odds = +3.83), consistent with the maintenance of open chromatin at active promoters across multiple nucleosome positions [
20]. Jala shows the lowest self-persistence (log-odds = +1.72), consistent with the well-established higher compositional heterogeneity and nucleotide diversity of intergenic sequence relative to genic regions — intergenic DNA is under the least selective constraint and shows the highest rate of chromatin state transitions between adjacent genomic positions [
21].
Third, the grammar is context-dependent at gene body boundaries. Metagene profiles across transcription start sites, transcription end sites, and splice junctions reveal systematic grammar modulation (
Figure 5A–D, 5F): Agni probability crescendos from 0.30 to 0.65 across the TSS (
Figure 5A); at the transcription end site, Agni declines and Jala rises entering the downstream intergenic interval, marking the handoff from active transcription to structural sequence (
Figure 5B); at splice donor sites, Agni relaxes from 0.106 to 0.084 entering the intron (
Figure 5C); at splice acceptor sites, Agni rises from 0.101 to 0.087 in anticipation of the exon (
Figure 5D) — a forward-context grammar rule with no parallel in current computational models of splice site recognition. The composite gene body profile confirms that Agni is the dominant grammar signal at all gene body boundaries, with Vāyu carrying the signal across introns (
Figure 5F).
Extended DataFigure 5| Tridosha grammar predicts evolutionary constraint (gnomAD v4.1).
gnomAD v4.1 constraint analysis of 2,171 chromosome 17 genes classified by Tridosha grammar at their TSS. (A) pLI score distribution per Tridosha class (violin plots): Pitta median pLI = 0.001, highest of the three Doshas. (B) LOEUF per Tridosha class: Pitta = 0.844; Vāta = 1.018; Kapha = 1.071 (Kruskal–Wallis p = 5.4×10⁻¹⁰). (C) Fraction of highly constrained genes (pLI ≥ 0.9): Pitta 28.0%, Kapha 18.2%, Vāta 12.7% (Pitta vs Vāta OR = 2.64×, p = 7.6×10⁻¹¹). (D) Continuous Pitta fraction vs pLI across all 2,171 genes; binned medians (black line) show a monotonic relationship. (E) GPID break × Tridosha × pLI: Pitta+GPID genes show highest median pLI (0.005); Vāta+GPID genes remain unconstrained (median pLI = 0.000). Summary of constraint metrics by Dosha and GPID context.
3.4. Grammatically Constrained Positions Are Enriched for Pathogenic Variants
The combination of GPID context (Level 1) and grammar class (Level 2) identifies positions of maximal structural constraint — where the rhythm demands function and the local composition is most rigid. We tested whether ClinVar single-nucleotide variants on chromosome 17 (8,550 pathogenic, 14,845 benign) are non-uniformly distributed across this two-dimensional grammar landscape [
22]. Variants falling at Sandhi junctions — boundaries between adjacent grammar states — show elevated pathogenicity relative to state-stable positions, consistent with the structural instability predicted at grammar transition points (
Figure 6A).
The overall pathogenic fraction across all states and contexts was 36.5%. Within GPID-active positions (sites where structural periodicity is locally violated), the pathogenic fraction varied systematically by sequence class (
Figure 6B): the Prithvi class (LINE-rich, AT-rich structurally rigid scaffold sequence) showed the highest above-baseline pathogenicity in state interiors (53.3%); within GPID-active positions specifically, Prithvi-class variants reached 69.0% pathogenic (OR = 2.64, p = 10⁻³²;
Figure 7E), reflecting the positional entropy model — structurally committed positions at functional sites tolerate few sequence configurations. The compositional state transition boundary heatmap (
Figure 6C) confirms that transitions involving the Prithvi class (Sandhi junctions) show the highest pathogenicity across all state pairs. Prithvi-class junctions show significantly higher pathogenicity than non-Prithvi junctions (
Figure 6D). Conversely, the Agni class (CpG-rich, nucleosome-depleted promoter-associated sequence) combined with GPID breaks showed below-baseline pathogenicity (29.8%; OR = 0.88, p = 0.03), consistent with the functional tolerance of promoter sequence for compositional variation.
This chr17 finding is supported by per-chromosome analysis (ED
Figure 5B): Prithvi+GPID positions show above-baseline pathogenic enrichment on 10 of 17 chromosomes with sufficient data, with the strongest signals on chromosomes enriched for structural disease gene clusters (chr16: +30%, OR = 3.56, p = 4.87 × 10⁻¹³; chr20: +22%, OR = 2.69, p = 4.42 × 10⁻⁷). The effect is absent on gene-dense chromosomes where the promoter-associated Agni class dominates over the rigid scaffold Prithvi class (chr19: 9% below chromosomal baseline, OR = 0.57, p = 9.54 × 10⁻¹¹), consistent with the prediction that the constraint signal requires structurally rigid sequence context. The relationship between structural rigidity and pathogenicity follows a consistent gradient across all five sequence classes (
Figure 6E).
Non-B DNA structure analysis independently supports the positional entropy model (
Figure S6). G-quadruplex (G4) positions — where guanine-rich sequences fold into four-stranded secondary structures at promoters and replication origins — showed markedly reduced pathogenic variant fraction (7.1%; p = 0.006). This is biologically expected: G4 structures can form from multiple alternative sequence configurations, are dynamically resolved by specialized helicases (FANCJ, DHX36) [
23], and are subject to positive selection for sequence variability in regulatory contexts.
The genome tolerates — and in some cases exploits — sequence variation at G4 positions precisely because the functional output (structural folding) does not depend on a single committed sequence. Microsatellite positions tell the same story from a different angle: tandem repeat length polymorphism at these sites is itself a normal regulatory mechanism, with STR length variation at promoters modulating transcription factor binding and gene expression levels across individuals [
24]. Here, the genome’s tolerance for variation is not incidental but functional.
Both G4 and microsatellite positions showed pathogenic fractions (7.1% and 27.6% respectively) well below the 42.6% chromosome-wide baseline, consistent with the positional entropy model: positions with high configurational freedom — where multiple sequence states are compatible with function — tolerate variation, whereas positions of low configurational freedom — where a single structural commitment is required, as at Prithvi+GPID sites — are intolerant of it. These findings are consistent with, but do not in themselves establish, a causal link between grammar class and variant intolerance; establishing the direction of this relationship requires genome-wide analysis with allele frequency data, which is addressed in the Discussion. Grammar displacement analysis confirms that pathogenic variants move grammar windows toward forbidden Sandhi zones more frequently than benign variants (OR = 0.81×, p = 0.04;
Figure S7), providing a continuous, vector-valued measure of grammar violation that complements the discrete state-based predictor.
Extended DataFigure 6| Tridosha grammar of human-specific transposable element insertions.
Grammar state and Tridosha composition at human-specific transposable element insertion sites, derived using the Genome Presence/Absence Compiler (GPAC) [
25]. Human-specific elements were defined as present in human (hg19/hg38) but absent at the syntenic position in chimpanzee (panTro6) and gorilla (gorGor6). (A) Grammar state distribution at human-specific Alu insertion sites (n = 786) compared to genome baseline. Human-specific Alus fall predominantly in Vāyu (48%) and Ākāśa (50%) states — 98% combined Vāta Dosha — a 1.4× and 1.9× enrichment over baseline respectively (both p < 0.01). (B) Vāta fraction comparison across three contexts: genome baseline (61%), distal enhancers/dELS (70%), and human-specific Alus (98%), demonstrating that human-specific Alu insertions share the Vāta grammar signature of distal enhancers and expanded the distal regulatory architecture of the human genome. Human-specific LINE elements (L1HS, L1PA2; n = 450) show the complementary Pitta pattern: 55% Agni-class (12.7× baseline enrichment, p < 0.001), indicating that lineage-specific LINE insertions preferentially targeted promoter-associated sequence.
3.5. Human-Specific GPI Reorganization Marks Evolutionary Innovation at Disease and Trait-Divergence Loci
The GPI framework provides a quantitative measure of regulatory architecture scale: longer GPI periods reflect more distributed, long-range regulatory organization. We compared GPI periods at five genomic regions across four primate species (human, chimpanzee, gorilla, rhesus macaque) using orthologous RepeatMasker data from syntenic assemblies (hg38, panTro6, gorGor6, rheMac10).
Three loci of human-specific disease burden show dramatically elevated GPI periods in human compared to all other primates (
Figure 7A, S4): APOE (human 98,500 bp; primate mean 6,167 bp; 16-fold); HBB (human 48,000 bp; primate mean 3,167 bp; 15-fold); CFTR (human 42,500 bp; primate mean 6,833 bp; 6-fold). The control locus TP53, whose cancer biology is conserved across mammals, showed no human-specific period elevation (human 6,000 bp; primate mean 4,000 bp).
Conversely, nine of 42 McLean 2011 hCONDELs on chromosome 7 show GPID breaks present in the chimpanzee genome at syntenic positions that are absent in the human genome (
Figure 7B). The genes flanking these lost grammar breaks include OR6V1 and TAS2R41 (olfactory and taste receptors), MEST (an imprinted growth regulator), ZNF804B (a schizophrenia GWAS locus), and POT1 (telomere maintenance). As with the gains at APOE/HBB/CFTR, the spatial grammar changed while the compositional inventory was preserved.
Crucially, the repeat composition is nearly identical across species at these loci; only the spatial organization differs — GPI gains and losses therefore reflect grammar reorganization of existing repeat elements, not compositional change (
Figure 7C). Together, the human-specific GPI gains and losses identify a grammar reorganization programme that corresponds to known biological differences between humans and other primates. The consequence of this reorganization is a larger target for pathogenic variation at human disease loci and a reduced regulatory grammar at sites of human trait divergence (
Figure 7D). Within the GPID break regions created by these grammar gains, Prithvi-class variants show 69% pathogenicity compared to 29.1% for Agni-class variants at the same positions, consistent with the positional entropy model (
Figure 7E). Together, the three-level grammar framework — GPI periodicity, Panchamahābhūta state class, and Sandhi transition rules — provides a unified sequence-based account of human-specific regulatory innovation and disease vulnerability (
Figure 7F).
The molecular mechanism of grammar gain is consistent with the Tridosha composition of human-specific transposable elements. Human-specific Alu elements (n = 786) fall predominantly in Vāyu- and Ākāśa-class sequence (48% and 50% respectively; 98% combined Vāta), the same two grammar states that characterise distal enhancers (dELS: 70% Vāta;
Section 3.6). Human-specific LINE elements (L1HS, L1PA2; n = 450) show the complementary pattern: 55% fall in Agni-class sequence, a 12.7-fold enrichment over the genome baseline (p < 0.001), consistent with preferential insertion into promoter-proximal, CpG-rich regions. Together, these data indicate that human-specific Alu insertions expanded the distal enhancer grammar of the human genome, while LINE insertions were enriched at promoter-associated positions.
This compositional shift — absent in chimpanzee and gorilla at syntenic positions — provides a candidate sequence-level mechanism for the human-specific elongation of the GPI period observed at APOE, HBB, and CFTR (Extended Data
Figure 2). To test whether these grammar-class Alu insertions create functional enhancer-promoter contacts in the human brain, we intersected the 786 human-specific Alu positions with published chromatin loop calls from four human brain tissues: fetal cortical plate (CP; post-mitotic neurons), germinal zone (GZ; neural progenitors), fetal cerebral cortex, and adult anterior temporal cortex . Of 786 Alu elements, 193–219 (24.6–27.9%) overlapped HiC loop anchors across the four tissues; 53.5–66.5% of these contacts connected to gene promoters (p ≈ 0, binomial test versus 2% random expectation).
Critically, 68 genes were contacted by human-specific Alu-anchored loops across all four brain tissue types, including DISC1 and TSNAX-DISC1 (Disrupted in Schizophrenia 1 locus), NRG3 (Neuregulin-3; schizophrenia and bipolar disorder GWAS), GRIA4 (glutamate receptor; autism and epilepsy), NCAM1 (neural cell adhesion molecule; cognitive function), and PBX1 (transcription factor; autism and intellectual disability). Because the Alu insertions are human-specific — absent at syntenic positions in chimpanzee and gorilla — these enhancer-promoter contacts are also human-specific. The genes themselves are conserved across primates; what changed in the human lineage is the regulatory architecture connecting them to distal enhancers. The grammar framework provides a sequence-level description of this architectural change: human-specific Alu insertions occupy Vāyu- and Ākāśa-class sequence (98% Vāta), the grammar states that define distal enhancer positions and are enriched at GPID breaks. Whether this regulatory expansion causally contributes to human cognitive complexity or psychiatric disease risk requires functional validation beyond the scope of the present analysis; the observation is consistent with, but does not establish, such a causal relationship. [
20,
21].
3.7. Tridosha Segmentation of the Genome and the Prakriti of the Regulome
Applying the Tridosha condensation genome-wide on chromosome 17, the genomic landscape is overwhelmingly Vāta-dominant (61.2%), with Kapha forming the structural background (34.5%) and Pitta occupying only 4.3% of the chromosome — concentrated at GPID break positions (Extended Data
Figure 4A–B). Of all Pitta-class positions on chromosome 17, 30.6% overlap GPID breaks — 3.7-fold above the 8.2% genome baseline (OR = 15.2×, p ≈ 0) — confirming that Pitta-class sequence is not uniformly distributed but is specifically recruited at positions of functional grammar demand. Kapha is reciprocally depleted: only 5.6% of Kapha-class positions overlap GPID breaks, below the genome baseline, consistent with the structural commitment of Kapha-class sequence being incompatible with functional flexibility (Extended Data
Figure 4C).
The Tridosha framework reveals the Prakriti (inherent constitution) of each class of cis-regulatory element (Extended Data
Figure 4D–F). The promoter is the only Pitta-dominant element in the regulome (61.4% Pitta), consistent with its role as the primary site of transcriptional ignition. All other regulatory elements are Vāta-dominant, but differ systematically in their Pitta-to-Kapha ratio: proximal enhancers show a Vāta-Pitta Dvandva (dual constitution; 61.2% Vāta, 26.2% Pitta), reflecting their residual ignition capacity; distal enhancers and CTCF insulators both show Vāta-Kapha Dvandva (67% Vāta, 7–26% Kapha), consistent with their roles as long-range mobile signals and structural boundary elements respectively. Protein-coding exons occupy an intermediate position (66% Vāta, 18.4% Pitta), reflecting the functional but non-ignition character of coding sequence. The Prakriti gradient from promoter to insulator thus follows a continuous Pitta depletion along a Vāta carrier, with Kapha increasing as genomic sequence transitions from functional to structural.
The low classification accuracy of a grammar-based element classifier (33.4% overall,
Section 3.6) is therefore not a failure of the framework but an accurate reflection of the biology: promoters are Pitta-dominant (61.4% Agni), while distal enhancers are Vāta-dominant (70% Vāyu+Ākāśa) — both contain Pitta, but in different dosages and genomic contexts. The distinction between promoter and enhancer is not solely a sequence property — it is a positional property determined by TSS distance and three-dimensional chromatin contact, which no sequence grammar alone can resolve. What the Tridosha framework provides instead is a continuous measure of regulatory potential: Pitta fraction quantifies regulation via ignition capacity, Vāta fraction quantifies regulatory mobility, and Kapha fraction quantifies structural commitment — three independent axes that together constitute the sequence-based Prakṛiti of any genomic position.
The Tridosha grammar predicts evolutionary constraint independently of annotation. Assigning the Tridosha class to 2,171 chromosome 17 genes based on their TSS grammar alone, and cross-referencing with gnomAD v4.1 constraint metrics [
26], Pitta-class genes show markedly elevated intolerance to loss-of-function variation: 28.0% of Pitta-class genes have pLI ≥ 0.9 (highly constrained), compared to 12.7% of Vāta-class genes (OR = 2.64×, p = 7.6×10⁻¹¹, Mann-Whitney U test) and 18.2% of Kapha-class genes (Extended Data
Figure 5A–C). Median LOEUF scores follow the same gradient: Pitta 0.844 – Vata 1.018 – Kapha 1.071 (Kruskal-Wallis p = 5.4×10⁻¹⁰), confirming that the Tridosha grammar hierarchy is consistent across two independent constraint metrics. The association is strengthened at GPID break positions: Pitta genes whose TSS overlaps a GPID break show the highest median pLI of any group (0.005), while Vāta genes at GPID breaks remain unconstrained (median pLI = 0.000; Extended Data
Figure 5D–E). A continuous analysis confirms a monotonic relationship between Pitta fraction at the TSS and pLI score across all 2,171 genes (Extended Data
Figure 5D). These results demonstrate that the Tridosha grammar encodes evolutionary constraint from sequence alone: Pitta marks the sites of transcriptional ignition that evolution cannot afford to vary.
3.9. Genome-Wide Grammar Validation: Tridosha Biochemical Hierarchy and Disease Associations
To address the possibility that the five-state grammar is a chr17-specific artefact, we trained a genome-wide GMM on 1,116,212 non-overlapping 200 bp windows sampled uniformly across all 23 human chromosomes (50,000 windows per chromosome, hg38 local FASTA). The genome-wide model independently recovers the same five-state structure (BIC minimum at k=5, all 10 initialisations converging), with consistent state compositions across chromosomes (
Figure S5A–B).
The genome-wide grammar states recapitulate the Tridosha biochemical hierarchy without any prior annotation. The Pitta state (S4) is characterised by the highest CpG frequency (0.038) and GC content (0.307) across all states, and is concentrated on gene-dense chromosomes: chr19 (18.1%), chr22 (14.9%), chr17 (11.3%) — exactly the chromosomes predicted to be Pitta-dominant by the Āyurvedic framework (
Figure S5B). The Kapha states (S2+S3) show the highest AT content (0.434) and lowest CpG (0.042), concentrated on gene-poor chromosomes (chr4, chr21). The Vāta states (S0+S1) show intermediate composition and are enriched on chrX and chr18 (
Figure S5A).
Strikingly, the disease associations of each Tridosha class, derived entirely from ClinVar pathogenic variant positions matched to genome-wide grammar states, recapitulate classical Āyurvedic clinical taxonomy. Pitta-class positions (n = 9,593 pathogenic variants) are most strongly associated with metabolic and cardiovascular disorders: familial hypercholesterolaemia (n=708), tuberous sclerosis (n=531), long QT syndrome (n=292), and Von Hippel-Lindau syndrome (n=166). Vāta-class positions (n = 2,650) are enriched for neural and movement disorders: primary ciliary dyskinesia (n=100), glycine encephalopathy (n=82), RASopathy (n=51), and spastic paraplegia (n=49). Kapha-class positions (n = 6,682) are enriched for structural tumour suppressor and DNA repair disorders: Li-Fraumeni syndrome/TP53 (n=582), Lynch syndrome (n=295), and beta-thalassaemia (n=141). This correspondence — Pitta with metabolic/fire disorders, Vāta with neural/movement disorders, Kapha with structural/growth disorders — was described in the Charaka Samhita approximately 300 BCE and is here recovered from raw DNA sequence without clinical annotation (
Figure S5C).
Per-chromosome Pitta+GPID enrichment for pathogenic variants (
Figure S5D) identifies chromosomes where structural disease gene clusters drive the positional constraint signal: chr16 (OR=2.83, p=1.0×10⁻¹³), chr20 (OR=2.18, p=2.7×10⁻⁷), chr12 (OR=1.62, p=2.3×10⁻⁵), and chr17 (OR=1.56, p=1.1×10⁻⁶). Chromosomes dominated by promoter-associated Agni grammar show Pitta+GPID depletion, consistent with the prediction that the positional constraint signal requires structurally rigid sequence context rather than promoter-associated grammar.