Experimental Setup.
We evaluate CoMT on natural language processing tasks via language modeling on the LLaMA model family ranging from 130M to 1B parameters on the C4 dataset [
24], using the experimental scripts and settings provided by [
11]. We employ the Adam optimizer [
25], with a learning rate of
for models up to 350M parameters and
for the 1B parameter model. CoMT introduces a single additional hyperparameter—the hidden dimension of the modulator. Throughout experiments shown in
Table 3, we set the hidden dimension to 2. We also conduct a sensitivity analysis on the hidden dimension, as shown in
Table 5, observing that while increasing the hidden size yields marginal improvements, the primary performance gains of CoMT stem from the introduction of contextual modulation rather than from larger hidden dimensions.
Main Results for Language Modeling.
Table 3 compares perplexities achieved by various normalization strategies on LLaMA models of different scales. Across all settings, our proposed CoMT consistently outperforms prior approaches, achieving the lowest perplexity on every model size.
Table 3.
Perplexity (↓) comparison of various layer normalization methods across various LLaMA sizes. The original LLaMA model is the Pre-LN, which we term the baseline method. The best results for each column are highlighted in bold.
Table 3.
Perplexity (↓) comparison of various layer normalization methods across various LLaMA sizes. The original LLaMA model is the Pre-LN, which we term the baseline method. The best results for each column are highlighted in bold.
| |
LLaMA-130M |
LLaMA-250M |
LLaMA-350M |
LLaMA-1B |
| Training Tokens |
2.2B |
3.9B |
6.0B |
8.9B |
| Post-LN |
26.95 |
1409.79 |
1368.33 |
1390.75 |
| DeepNorm |
27.17 |
22.77 |
1362.59 |
1409.08 |
| Post-LN + CoMT |
26.69 |
21.45 |
19.11 |
17.68 |
| Mix-LN |
26.07 |
21.39 |
1363.21 |
1414.78 |
| Pre-LN (baseline) |
26.73 |
21.92 |
19.58 |
17.02 |
| Pre-LN + LayerNorm Scaling |
25.76 |
20.35 |
18.20 |
15.71 |
| Pre-LN + CoMT |
23.77 (11.07%↓) |
19.74 (9.95%↓) |
17.98 (8.17%↓) |
15.08 (11.40%↓) |
Moreover, we implement CoMT on both Pre-LN and Post-LN scenarios. Notably, traditional Post-LN exhibits severe degradation as model size increases, reflecting known instability issues in deep transformers when normalization is applied post-residual addition [
11,
15]. DeepNorm, which also gives learnable scaling weights to the sublayers of the residual parts, also crushed down on larger-scale models like LLaMA-350M and LLaMA-1b. The reason could be that although it gives a careful initialization, it remains input-agnostic and static, limiting its capacity for adaptive feature modulation.
In contrast, CoMT directly addresses these limitations by introducing fine-grained, token-specific modulation of functional residual pathways, enabling dynamic adaptation to input semantics without disrupting identity-preserving residual flows. By learning contextual scaling gates jointly with network parameters, CoMT allows the model to amplify or suppress feature transformations based on the content of each input, resulting in more effective gradient flow and better feature utilization throughout depth. Therefore, even incorporating the Post-LN, CoMT is still able to stabilize the training even when the layers go deep. Empirically, compared to the original model with Post-LN and Pre-LN, CoMT achieves substantial perplexity reductions. In addition, CoMT outperforms the strongest previous baseline at all scales.
Table 4.
Ablation study on modulators on different functional modules for CoMT.
Table 4.
Ablation study on modulators on different functional modules for CoMT.
| Modules |
LLaMA60M |
LLaMA130M |
| Baseline |
33.28 |
26.07 |
| self-attention |
34.86 |
25.21 |
| mlp |
33.35 |
24.40 |
| only_last |
32.96 |
24.47 |
| only_first |
32.71 |
24.10 |
| only_qk |
33.67 |
24.74 |
| all |
32.69 |
23.77 |
Table 5.
Ablation study on modulator hidden dimension for CoMT.
Table 5.
Ablation study on modulator hidden dimension for CoMT.
| Hidden Size |
LLaMA60M |
LLaMA130M |
| Baseline |
33.28 |
26.07 |
| 2 |
32.69 |
23.83 |
| 4 |
32.58 |
23.76 |
| 8 |
32.21 |
23.72 |
| 16 |
32.56 |
23.82 |
| 32 |
32.36 |
23.59 |
Contribution of Functional Components.
To investigate the contribution of different components to functional flow, we selectively apply CoMT modulators to individual modules within each decoder.
Table 4 summarizes the ablation results on LLaMA-60M and LLaMA-130M models. As shown in
Table 4, applying modulators to all components (“all”) consistently yields the best perplexity on both LLaMA-60M and LLaMA-130M, indicating that full-path modulation provides the most comprehensive control over residual transformation dynamics.
Interestingly, selectively modulating only the query and key projections (“only_qk”) or applying modulation solely to the self-attention block (“self-attention”) yields inconsistent results across model scales and, in the case of LLaMA-60M, even underperforms the baseline. This suggests that partial modulation within isolated components may disrupt the balance of residual information flow. In contrast, modulating only the output layers (“only_last,” including o_proj and down_proj) or only the input layers (“only_first,” including q_proj, k_proj, v_proj, up_proj, and gate_proj) leads to moderate improvements, demonstrating that both early and late transformations contribute to effective modulation. Nevertheless, the best performance is consistently achieved when modulators are applied to all layers within each block. These findings emphasize that CoMT’s strength lies in its comprehensive, token-aware modulation across the full residual transformation path, encompassing both the attention and feedforward modules.