Submitted:
21 July 2025
Posted:
30 July 2025
You are already at the latest version
Abstract

Keywords:
1. Introduction
- We introduce Zarvan, a novel architecture with linear complexity that uses a hybrid gating mechanism informed by dual context vectors.
- We detail its core components: the Holistic Context Extractor for capturing the sequence's essence and the Associative Context Extractor for focused, long-range memory.
- We provide a comprehensive empirical evaluation across five distinct benchmarks, demonstrating that Zarvan is accurate, fast, and versatile, making it a viable alternative to the Transformer.
2. Related Work
2.1. The Transformer Architecture
2.2. Efficient Transformer Alternatives
-
Linear Attention: This family of models argues that the full self-attention matrix is redundant and can be effectively approximated. For instance, Linformer [6] projects the Key () and Value () matrices into a smaller, fixed dimension before the matrix multiplication, reducing complexity to . Other methods use kernel functions to approximate the softmax attention without explicitly forming the matrix.Comparison with Zarvan: While these methods are effective, they are fundamentally approximations of the original attention mechanism. Zarvan takes a conceptually different path. It does not approximate attention; it replaces it entirely. Zarvan operates on the hypothesis that for many tasks, direct, pairwise token interactions are unnecessary. Instead, it posits that explicit, global summary vectors can provide sufficient context for updating token representations, thereby avoiding potential approximation errors inherent in low-rank or kernel-based methods.
-
State Space Models (SSMs): A highly successful line of work, including models like S4 [2] and the more recent Mamba [1], models a sequence as a linear time-invariant system. They achieve linear-time complexity for inference and near-linear time for parallelized training by using a selective state compression mechanism. This allows them to maintain a "memory" of the sequence in a compact state that evolves over time.Comparison with Zarvan: Zarvan is architecturally distinct from SSMs. It is not based on state-space formalism or recurrence. Instead of compressing history into an evolving hidden state, Zarvan's core block performs a stateless, parallel computation on the full sequence at each layer to derive its global context vectors. This information is then used in a local, token-wise update. While both achieve linear complexity, Zarvan's approach offers a non-recurrent, conceptually simpler framework for global information aggregation.
-
Recurrent Architectures with Linear Complexity: Recent models like RWKV [3] have successfully blended the strengths of RNNs and Transformers. They achieve O(S) complexity by formulating the attention mechanism in a recurrent manner, which allows for both parallelizable training (similar to Transformers) and efficient, constant-memory inference (similar to RNNs).Comparison with Zarvan: Although Zarvan's gating mechanism is inspired by RNNs, it is fundamentally a non-recurrent architecture during both training and inference. The computation within a Zarvan block does not depend on the output of the previous token in the same layer. The entire sequence is processed in parallel to compute the global contexts, making its operational paradigm fully parallel, much like the original Transformer.
3. The Zarvan Architecture

3.1. Dual Context Extractors
- Project the input sequence into query-like scores () and values (): where are learnable weight matrices.
- Reshape and to accommodate multiple heads ():
- Compute weights and aggregate values for each head:
- Combine the head outputs to form the final context vector:
3.2. The Gated Update Mechanism
- Expand Contexts: Expand the global context vectors to match the sequence length .
- Form Gate Input: For each token, create the input for the gate network.
- Compute Gates: Generate the input and forget gates using the GateNet.
- Apply Gated Update: Modulate the information flow for each token.where is the sigmoid function, denotes element-wise multiplication, and is a learnable linear projection (). This gated update mechanism is inspired by similar concepts in recurrent architectures like LSTMs [7] and GRUs [8]. The 'input gate' () dynamically controls how much of the original token representation () is passed through, while the 'forget gate' () modulates the flow of the contextually transformed information (). This allows each token to intelligently decide whether to preserve its local identity or to incorporate a richer, sequence-aware perspective provided by the global contexts.
- Apply FFN and Residual Connection: The final output of the block is computed through a standard feed-forward network (FFN) and a residual connection with layer normalization, a common practice for stabilizing deep networks.
4. Experiments and Results
4.0. Experimental Setup
4.1. Scalability on IMDb Sentiment Classification

4.2. Generalization to Vision as a Sequence (MNIST)

4.3. Information Retrieval on MS MARCO
4.4. Long-Range Dependency Benchmarks
4.5. Advanced Component Analysis and Sensitivity on Synthetic Tasks
4.5.1. The Role of Associative Context in Information Retrieval: The Selective Copy Task
4.5.2. Decoupling Memory and Reasoning: The Stateful Parity Check Task
4.5.3. Multi-Step Reasoning and Sensitivity Analysis: The Categorical Sum Task
| Model | Ranking Accuracy (%) | Total Time (s) |
| Zarvan | 63.85 | 217.93 |
| Transformer | 64.40 | 235.04 |
| Task | Model | Final Accuracy (%) |
| Selective Copy | Zarvan | 100.00 |
| Adding Problem | Zarvan | 98.80 |
| Adding Problem | Transformer | 99.00 |
| Model | Best Validation Accuracy (%) |
| Zarvan (Full) | 99.88 |
| Transformer | 99.92 |
| Zarvan -No Gating | 99.12 |
| Zarvan -No Holistic | 98.62 |
| Zarvan -No Associative | 3.52 |
5. Discussion
5.1. Implications and Strengths
5.2. Limitations
5.3. Future Work
6. Conclusion
References
- Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752. [CrossRef]
- Gu, A., Goel, K., & Re, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. International Conference on Learning Representations (ICLR) 2022.
- Peng, B., Alcaide, E., Anthony, Q., Al-Ghamdi, A., Fan, W., & Wang, L. (2023). RWKV: Reinventing RNNs for the Transformer Era. arXiv preprint arXiv:2305.13048. [CrossRef]
- Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham, P., Rao, J., & Metzler, D. (2020). Long Range Arena: A Benchmark for Efficient Transformers. arXiv preprint arXiv:2011.04006. [CrossRef]
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017).
- Wang, S., Li, B.Z., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-Attention with Linear Complexity. arXiv preprint arXiv:2006.04768. [CrossRef]
- Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780. [CrossRef]
- Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078. [CrossRef]




| Hyperparameter | IMDb | MNIST | MS MARCO | Adding Problem | Selective Copy |
| Sequence Length | 128, 256, 512 | 784 | 128 | 200 | 256 |
| Embedding Dim | 128 | 128 | 128 | 128 | 128 |
| Hidden Dim (Zarvan) | 256 | 256 | 256 | 256 | 256 FF |
| Dim (Transformer) | 512 | 512 | 512 | 256 | 256 |
| Num Layers | 2 | 2 | 2 | 4 | 2 |
| Num Heads | 4 | 4 | 4 | 4 | 4 |
| Batch Size | 128 | 128 | 128 | 128 | 64 |
| Num Epochs | 5 | 10 | 3 | 5 | 10 |
| Learning Rate | 1e-4 | 1e-3 | 5e-5 | 2e-4 | 1e-4 |
| Weight Decay | 1e-2 | 1e-2 | 0 | 0 | 0 |
| Loss Function | CrossEntropy | CrossEntropy | TripletMargin | CrossEntropy | CrossEntropy |
| Model | IMDb | MNIST | MS MARCO |
| Zarvan | ~3.5M | ~0.5M | ~4.2M |
| Transformer | ~3.9M | ~0.8M | ~4.5M |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).