Submitted:
16 February 2025
Posted:
28 February 2025
You are already at the latest version
Abstract
Transformers have become the backbone of numerous advancements in deep learning, excelling across domains such as natural language processing, computer vision, and scientific modeling. Despite their remarkable performance, the high computational and memory costs of the standard Transformer architecture pose significant challenges, particularly for long sequences and resource-constrained environments. In response, a wealth of research has been dedicated to improving the efficiency of Transformers, resulting in a diverse array of innovative techniques. This survey provides a comprehensive overview of these efficiency-driven advancements. We categorize existing approaches into four major areas: (1) approximating or sparsifying the self-attention mechanism, (2) reducing input or intermediate representation dimensions, (3) leveraging hierarchical and multiscale architectures, and (4) optimizing hardware utilization through parallelism and quantization. For each category, we discuss the underlying principles, representative methods, and the trade-offs involved. We also identify key challenges in the field, including balancing efficiency with performance, scaling to extremely long sequences, addressing hardware constraints, and mitigating the environmental impact of large-scale models. To guide future research, we highlight promising directions such as unified frameworks, dynamic and sparse architectures, energy-aware designs, and cross-domain adaptations. By synthesizing the latest advancements and providing insights into unresolved challenges, this survey aims to serve as a valuable resource for researchers and practitioners seeking to develop or apply efficient Transformer models. Ultimately, the pursuit of efficiency is crucial for ensuring that the transformative potential of Transformers can be realized in a sustainable, accessible, and impactful manner.
Keywords:
1. Introduction
2. Background
2.1. The Standard Transformer Architecture
2.2. Computational Challenges of Transformers
2.3. Applications Driving the Need for Efficiency
3. Approaches to Efficiency
3.1. Approximating or Sparsifying the Self-Attention Mechanism
- Low-Rank Factorization: Techniques such as Linformer and Performers approximate the full attention matrix by decomposing it into lower-dimensional representations. These methods leverage low-rank properties of attention matrices, significantly reducing computational overhead.
- Sparse Attention: Models like Longformer and Big Bird adopt sparse attention patterns, where only a subset of token interactions is computed. These patterns include local windows, global tokens, and dilated spans, offering flexibility and scalability for long sequences.
- Clustering and Grouping: Methods like Reformer use locality-sensitive hashing to cluster similar tokens, reducing the number of pairwise comparisons [31]. Cluster-based approaches improve efficiency without significantly compromising performance.
3.2. Reducing Input or Intermediate Representations
- Token Pruning: Techniques like Early Exit and token pruning selectively drop tokens that contribute little to the final output, effectively shortening the sequence length during processing.
- Pooling and Compression: Models such as Funnel Transformer and PoolFormer compress token representations at different stages, introducing hierarchical structures that reduce computational requirements [33].
- Dimensionality Reduction: Reducing the dimensionality of embeddings or intermediate feature maps can also decrease memory usage and computation without significant performance degradation [34].
3.3. Hierarchical and Multiscale Architectures
- Hierarchical Transformers: These models, such as Swin Transformer, process information at multiple levels of granularity, capturing both local and global patterns. This structure is especially effective in computer vision tasks.
- Progressive Token Merging: Hierarchical models progressively merge tokens at higher layers, reducing sequence length and focusing computational resources on salient features [35].
- Multiscale Attention: Attention mechanisms that operate across multiple scales ensure that global dependencies are captured while maintaining efficiency.
3.4. Optimizing Hardware Utilization
- Model and Pipeline Parallelism: Approaches like model and pipeline parallelism split computations across multiple devices, enabling the training of larger models while maintaining efficiency [37].
- Quantization: Quantization reduces the precision of model parameters, decreasing memory usage and accelerating computation [38]. Techniques such as post-training quantization and quantization-aware training are commonly used.
- Hardware-Aware Designs: Custom hardware accelerators and frameworks, such as TensorRT and DeepSpeed, optimize model performance by tailoring computations to specific hardware.
4. Challenges and Future Directions
4.1. Challenges
4.1.1. Trade-offs Between Efficiency and Performance
4.1.2. Scalability to Extremely Long Sequences
4.1.3. Lack of Standard Benchmarks for Efficiency
4.1.4. Hardware Constraints
4.1.5. Environmental and Energy Concerns
4.2. Future Directions
4.2.1. Unified Frameworks for Efficiency
4.2.2. Learning Sparse and Dynamic Structures
4.2.3. Energy-Aware Model Design
4.2.4. Cross-Domain Applications
4.2.5. Neuroscience-Inspired Architectures
4.2.6. Better Integration with Hardware
4.2.7. Open-Source Contributions and Collaboration
4.3. Conclusion
5. Conclusion
References
- Akyürek, E.; Schuurmans, D.; Andreas, J.; Ma, T.; Zhou, D. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661 2022.
- Bellman, R. The theory of dynamic programming. Bulletin of the American Mathematical Society 1954, 60, 503–515.
- Xie, S.M.; Raghunathan, A.; Liang, P.; Ma, T. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080 2021.
- Garg, S.; Tsipras, D.; Liang, P.S.; Valiant, G. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems 2022, 35, 30583–30598.
- Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 2019.
- Garg, S.; Tsipras, D.; Liang, P.; Valiant, G. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes. In Proceedings of the Advances in Neural Information Processing Systems, 2022.
- Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient Transformers: A Survey, 2022, [arXiv:cs.LG/2009.06732].
- Pérez, J.; Marinković, J.; Barceló, P. On the turing completeness of modern neural network architectures. arXiv preprint arXiv:1901.03429 2019.
- Sanford, C.; Hsu, D.; Telgarsky, M. Representational Strengths and Limitations of Transformers. arXiv preprint arXiv:2306.02896 2023.
- Zniyed, Y.; Nguyen, T.P.; et al. Efficient tensor decomposition-based filter pruning. Neural Networks 2024, 178, 106393.
- Kitaev, N.; Kaiser, .; Levskaya, A. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 2020.
- Alman, J.; Song, Z. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214 2023.
- Tay, Y.; Dehghani, M.; Abnar, S.; Shen, Y.; Bahri, D.; Pham, P.; Rao, J.; Yang, L.; Ruder, S.; Metzler, D. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006 2020.
- Dai, D.; Sun, Y.; Dong, L.; Hao, Y.; Ma, S.; Sui, Z.; Wei, F. Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers. In Proceedings of the ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
- Tay, Y.; Bahri, D.; Yang, L.; Metzler, D.; Juan, D.C. Sparse sinkhorn attention. In Proceedings of the International Conference on Machine Learning. PMLR, 2020, pp. 9438–9447.
- Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 2019.
- Furst, M.; Saxe, J.B.; Sipser, M. Parity, circuits, and the polynomial-time hierarchy. Mathematical systems theory 1984, 17, 13–27. [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the Advances in neural information processing systems, 2020, Vol. 33, pp. 1877–1901.
- Merrill, W.; Sabharwal, A.; Smith, N.A. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics 2022, 10, 843–856.
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Proceedings of the Advances in Neural Information Processing Systems, 2022.
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 2019, pp. 4171–4186.
- Hahn, M.; Goyal, N. A Theory of Emergent In-Context Learning as Implicit Structure Induction. arXiv preprint arXiv:2303.07971 2023.
- Hao, Y.; Angluin, D.; Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics 2022, 10, 800–810. [CrossRef]
- Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic Chain of Thought Prompting in Large Language Models. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023.
- Olsson, C.; Elhage, N.; Nanda, N.; Joseph, N.; DasSarma, N.; Henighan, T.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; et al. In-context Learning and Induction Heads. Transformer Circuits Thread 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- Nye, M.; Andreassen, A.J.; Gur-Ari, G.; Michalewski, H.; Austin, J.; Bieber, D.; Dohan, D.; Lewkowycz, A.; Bosma, M.; Luan, D.; et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models. In Proceedings of the Deep Learning for Code Workshop, 2022.
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 2023.
- Luo, S.; Li, S.; Cai, T.; He, D.; Peng, D.; Zheng, S.; Ke, G.; Wang, L.; Liu, T.Y. Stable, fast and accurate: Kernelized attention with relative positional encoding. In Proceedings of the Advances in Neural Information Processing Systems, 2021, Vol. 34, pp. 22795–22807.
- Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; et al. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712 2023.
- Li, Y.; Ildiz, M.E.; Papailiopoulos, D.; Oymak, S. Transformers as Algorithms: Generalization and Stability in In-context Learning. arXiv preprint arXiv:2301.07067 2023.
- Hahn, M. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics 2020, 8, 156–171. [CrossRef]
- Liu, B.; Ash, J.T.; Goel, S.; Krishnamurthy, A.; Zhang, C. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749 2022.
- Wei, J.; Wei, J.; Tay, Y.; Tran, D.; Webson, A.; Lu, Y.; Chen, X.; Liu, H.; Huang, D.; Zhou, D.; et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846 2023.
- Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 2020.
- Choromanski, K.M.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.Q.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. In Proceedings of the International Conference on Learning Representations, 2021.
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv:2004.05150 2020.
- Qiu, J.; Ma, H.; Levy, O.; Yih, W.t.; Wang, S.; Tang, J. Blockwise Self-Attention for Long Document Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 2555–2565.
- Alberti, S.; Dern, N.; Thesing, L.; Kutyniok, G. Sumformer: Universal Approximation for Efficient Transformers. In Proceedings of the Topological, Algebraic and Geometric Learning Workshops 2023. PMLR, 2023, pp. 72–86.
- Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to algorithms; MIT press, 2022.
- Weiss, G.; Goldberg, Y.; Yahav, E. Thinking like transformers. In Proceedings of the International Conference on Machine Learning. PMLR, 2021, pp. 11080–11090.
- Peng, H.; Pappas, N.; Yogatama, D.; Schwartz, R.; Smith, N.A.; Kong, L. Random feature attention. arXiv preprint arXiv:2103.02143 2021.
- Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam 2017.
- Liu, B.; Ash, J.T.; Goel, S.; Krishnamurthy, A.; Zhang, C. Transformers Learn Shortcuts to Automata. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; et al. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9.
- Zniyed, Y.; Nguyen, T.P.; et al. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 2024.
- Wei, C.; Chen, Y.; Ma, T. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances in Neural Information Processing Systems 2022, 35, 12071–12083.
- Tarzanagh, D.A.; Li, Y.; Thrampoulidis, C.; Oymak, S. Transformers as Support Vector Machines. arXiv preprint arXiv:2308.16898 2023.
- Goel, S.; Kakade, S.M.; Kalai, A.T.; Zhang, C. Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A.H.; Agarwal, A.; Belgrave, D.; Cho, K., Eds., 2022.
- Sun, Y.; Dong, L.; Huang, S.; Ma, S.; Xia, Y.; Xue, J.; Wang, J.; Wei, F. Retentive Network: A Successor to Transformer for Large Language Models (2023). URL http://arxiv. org/abs/2307.08621 v1.
- OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 2023.
- Von Oswald, J.; Niklasson, E.; Randazzo, E.; Sacramento, J.; Mordvintsev, A.; Zhmoginov, A.; Vladymyrov, M. Transformers learn in-context by gradient descent. In Proceedings of the International Conference on Machine Learning. PMLR, 2023, pp. 35151–35174.
- Keles, F.D.; Wijewardena, P.M.; Hegde, C. On the computational complexity of self-attention. In Proceedings of the International Conference on Algorithmic Learning Theory. PMLR, 2023, pp. 597–619.
- Yao, S.; Peng, B.; Papadimitriou, C.; Narasimhan, K. Self-Attention Networks Can Process Bounded Hierarchical Languages. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 3770–3785.
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 2016.
- Chan, S.C.; Santoro, A.; Lampinen, A.K.; Wang, J.X.; Singh, A.K.; Richemond, P.H.; McClelland, J.; Hill, F. Data Distributional Properties Drive Emergent In-Context Learning in Transformers. In Proceedings of the Advances in Neural Information Processing Systems, 2022.
- Eldan, R.; Shamir, O. The power of depth for feedforward neural networks. In Proceedings of the Conference on learning theory. PMLR, 2016, pp. 907–940.
- Feng, G.; Gu, Y.; Zhang, B.; Ye, H.; He, D.; Wang, L. Towards Revealing the Mystery behind Chain of Thought: a Theoretical Perspective. Advances in Neural Information Processing Systems 2023.
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 2023.
- Prystawski, B.; Goodman, N.D. Why think step-by-step? Reasoning emerges from the locality of experience. arXiv preprint arXiv:2304.03843 2023.
- Vyas, A.; Katharopoulos, A.; Fleuret, F. Fast transformers with clustered attention. In Proceedings of the Advances in Neural Information Processing Systems, 2020, Vol. 33, pp. 21665–21674.
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 2023.
- Merrill, W.; Sabharwal, A. The Parallelism Tradeoff: Limitations of Log-Precision Transformers. Transactions of the Association for Computational Linguistics 2023.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
