Submitted:
09 March 2025
Posted:
10 March 2025
You are already at the latest version
Abstract
The increasing complexity and scale of modern machine learning models have led to growing computational demands, raising concerns about efficiency, scalability, and adaptability. Traditional deep learning architectures often struggle to balance computational cost with model expressiveness, particularly in tasks requiring specialization across diverse data distributions. One promising solution is the use of modular architectures that allow selective activation of parameters, enabling efficient resource allocation while maintaining high performance. Mixture of Experts (MoE) is a widely adopted modular approach that partitions the model into multiple specialized experts, dynamically selecting a subset of them for each input. This technique has demonstrated remarkable success in large-scale machine learning applications, including natural language processing, computer vision, speech recognition, and recommendation systems. By leveraging sparse activation, MoE architectures achieve significant computational savings while scaling to billions of parameters. This survey provides a comprehensive overview of MoE, covering its fundamental principles, architectural variations, training strategies, and key applications. Additionally, we discuss the major challenges associated with MoE, including training stability, expert imbalance, interpretability, and hardware constraints. Finally, we explore potential future research directions aimed at improving efficiency, fairness, and real-world deployability. As machine learning continues to advance, MoE is poised to play a crucial role in the development of scalable and adaptive AI systems.
Keywords:
1. Introduction
2. Background and Theoretical Foundations
2.1. Historical Perspective
2.2. Mathematical Formulation
2.3. Training Strategies
- Load Balancing: To ensure all experts contribute meaningfully, additional loss terms are often introduced to encourage balanced expert usage [18].
- Gradient Routing: Efficient gradient flow through the gating mechanism is crucial to ensure stable training and avoid expert underutilization.
2.4. Comparison with Other Architectures
2.5. Challenges and Limitations
- Expert Specialization and Mode Collapse: Some experts may become over-specialized or completely inactive, reducing the model’s diversity.
- Computational Complexity: Although sparse activation reduces computation, routing decisions introduce overhead.
- Scalability Issues: Training large-scale MoE models requires careful optimization strategies to prevent communication bottlenecks in distributed environments [22].
3. Architectural Variants of Mixture of Experts
3.1. Classic MoE Architecture
3.2. Hierarchical MoE
- Improved model capacity and flexibility by enabling hierarchical decision-making [30].
- Better scalability as experts at different levels specialize in finer-grained subproblems.
- Potential for more structured representations, making the model more interpretable.
- Increased training complexity due to multiple gating functions.
- Potential vanishing gradient issues in deep hierarchies.
3.3. Sparse MoE with Top-K Gating
- Significant reduction in computational cost as only a subset of experts are used per input.
- Improved efficiency for large-scale models, particularly in distributed environments.
- Reduced risk of expert overfitting, as sparsity encourages better generalization.
- Load balancing issues, where certain experts may receive disproportionately more inputs than others [32].
- Increased variance in training, as fewer experts contribute to each update.
3.4. Soft MoE vs [33]. Hard MoE
- Uses a smooth softmax gating function to assign continuous weights to all experts.
- Allows for multiple experts to contribute to each prediction [34].
- Generally results in more stable training but requires more computation.
- Selects a discrete subset (often one) of experts per input, making inference more efficient [35].
- Can be implemented using techniques such as hard thresholding or reinforcement learning-based routing.
- More challenging to train due to non-differentiability, requiring techniques such as the Gumbel-Softmax trick.
3.5. Recent Advances in MoE Architectures
- Routing Networks: Dynamic routing strategies, such as reinforcement learning-based gating, have been explored to optimize expert selection based on long-term learning objectives.
- Distillation-Augmented MoE: Some architectures leverage knowledge distillation to enhance expert generalization and reduce redundancy among experts.
3.6. Comparison of MoE Architectures
3.7. Summary
4. Training Methodologies and Optimization Techniques
4.1. Standard Training Procedure
4.2. Challenges in Training MoE Models
- Expert Imbalance: Some experts may receive significantly more data than others, leading to inefficient learning and underutilized model capacity.
- Mode Collapse: The gating function may learn to favor only a subset of experts, resulting in reduced diversity among experts.
- Gradient Routing Issues: Sparse activation of experts can lead to unstable gradients, making optimization difficult.
- Scalability Concerns: Large-scale MoE models require careful coordination across multiple GPUs or TPUs to avoid communication bottlenecks [43].
4.3. Load Balancing and Regularization Techniques
4.3.1. Entropy-Based Regularization
4.3.2. Load Balancing Loss
4.3.3. Noisy Gating
4.4. Sparse and Efficient Training Strategies
4.4.1. Top-K Expert Selection
4.4.2. Gradient Routing and Backpropagation
- Sparse Gradient Updates: Only the active experts receive weight updates, reducing computational overhead [49].
- Gradient Clipping: Large MoE models often suffer from unstable gradients. Clipping gradient norms prevents exploding gradients [50].
- Auxiliary Gradient Losses: Additional losses on expert selection can encourage better gradient flow through the gating network.
4.4.3. Efficient Parallelization
- Expert Parallelism: Experts are distributed across different GPUs/TPUs, with each device handling only a subset of experts [52].
- Model Parallelism: Instead of replicating experts on every device, different devices specialize in different subsets of experts.
- Token-Level Routing: Instead of assigning an entire batch to a single expert, token-level routing assigns individual tokens to different experts, improving efficiency in sequence models [53].
4.5. Advanced Optimization Techniques
4.5.1. Knowledge Distillation for MoE
4.5.2. Reinforcement Learning-Based Gating
4.5.3. Multi-Task Learning with MoE
4.6. Summary
5. Applications of Mixture of Experts
5.1. Natural Language Processing
5.1.1. Large Language Models
- GShard [24]: A framework for training large-scale multilingual models using MoE, enabling efficient parallelism across distributed hardware.
5.1.2. Multilingual and Cross-Lingual Models
5.2. Computer Vision
5.2.1. Efficient Image Classification
- Vision MoE [38]: A scalable MoE-based vision transformer that achieves state-of-the-art results on ImageNet while reducing computational overhead.
5.2.2. Object Detection and Segmentation
5.3. Speech Processing
5.3.1. Automatic Speech Recognition (ASR)
5.3.2. Text-to-Speech (TTS)
5.4. Recommendation Systems
5.4.1. Personalized Recommendations
5.4.2. Google’s YouTube and Google Play Recommendations
5.5. Scientific Computing and Simulation
5.5.1. Climate Modeling
5.5.2. Protein Folding and Drug Discovery
5.6. Robotics and Control Systems
5.6.1. Autonomous Vehicles
5.6.2. Reinforcement Learning
5.7. Summary
6. Challenges and Future Directions
6.1. Scalability and Computational Efficiency
6.1.1. Memory and Communication Overhead
- Sparse Communication Strategies: Reducing inter-GPU communication by updating only a subset of experts per step [78].
- Efficient Parameter Sharing: Using techniques like expert merging or weight tying to reduce memory footprint [79].
- Hierarchical MoE Architectures: Introducing multi-level expert selection to balance computational cost [80].
6.1.2. Training Stability and Convergence
- Improved Load Balancing Mechanisms: Advanced regularization techniques to encourage even usage of experts.
- Adaptive Expert Routing: Dynamically adjusting expert selection criteria based on training progress [83].
- Gradient Clipping and Normalization: Preventing extreme weight updates that cause instability.
6.2. Interpretability and Explainability
6.2.1. Understanding Expert Behavior
- Visualization Techniques: Using attention maps or activation patterns to understand expert specialization.
- Explainable AI (XAI) Methods: Applying techniques like SHAP values to analyze gating decisions.
- Explicit Expert Constraints: Encouraging experts to learn distinct, non-overlapping feature representations.
6.3. Fairness and Bias in Expert Selection
- Diversity Regularization: Enforcing constraints that encourage diverse expert activation across different demographic groups.
- Bias Detection in Expert Assignments: Analyzing whether specific user groups consistently get routed to the same experts.
- Fairness-Aware MoE Models: Designing MoE architectures that explicitly account for fairness objectives [85].
6.4. Robustness and Generalization
6.4.1. Generalization to Unseen Data
- Meta-Learning for MoE: Training experts to generalize across a broader range of tasks [86].
- Uncertainty-Aware Gating Mechanisms: Introducing probabilistic gating models to handle unseen scenarios more effectively [87].
- Data Augmentation for Expert Diversity: Ensuring each expert is exposed to a more diverse set of inputs [88].
6.5. Hardware and Deployment Challenges
6.5.1. Efficient Inference
- Edge-Friendly MoE Architectures: Developing lightweight MoE variants optimized for edge and mobile devices [64].
- Distilled MoE Models: Using knowledge distillation to create smaller, more efficient models without sacrificing performance.
- Latency-Aware Expert Selection: Adapting the gating mechanism to prioritize faster experts under real-time constraints [89].
6.6. Future Research Directions
- Self-Supervised and Unsupervised MoE: Exploring how MoE models can be effectively trained without labeled data [90].
- Neuroscientific Inspiration for MoE Design: Drawing insights from biological neural networks to develop more efficient and adaptive expert selection mechanisms.
- Hybrid MoE Architectures: Combining MoE with other architectural paradigms, such as reinforcement learning or meta-learning, to improve adaptability [20].
- Theoretical Understanding of MoE: Developing a deeper mathematical foundation to explain MoE’s efficiency and performance [91].
6.7. Summary
7. Conclusionx
7.1. Key Insights
- Scalability and Efficiency: MoE models allow the training and deployment of large-scale models with billions of parameters while maintaining manageable computational costs through sparse activation [96].
- Domain-Specific Advantages: MoE has demonstrated remarkable success in NLP, particularly in large-scale language models, as well as in computer vision, recommendation systems, and scientific simulations.
- Challenges in Training and Deployment: Despite its advantages, MoE introduces issues related to training stability, expert imbalance, and increased memory and communication overhead.
- Interpretability and Fairness Considerations: Understanding expert behavior and ensuring fairness in expert routing are critical concerns for deploying MoE in real-world applications [97].
- Future Research Directions: Promising areas for future work include improving load balancing, developing fairness-aware MoE architectures, exploring unsupervised and self-supervised MoE, and optimizing MoE for real-time inference and deployment on edge devices [98].
7.2. Broader Implications
7.3. Final Remarks
References
- Rosenbaum, C.; Klinger, T.; Riemer, M. Routing networks: Adaptive selection of non-linear functions for multi-task learning. arXiv preprint arXiv:1711.01239 2017.
- Chi, Z.; Dong, L.; Huang, S.; Dai, D.; Ma, S.; Patra, B.; Singhal, S.; Bajaj, P.; Song, X.; Mao, X.L.; et al. On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems 2022, 35, 34600–34613.
- Rajbhandari, S.; Rasley, J.; Ruwase, O.; He, Y. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16.
- Team, Q. Introducing Qwen1.5, 2024.
- Wang, H.; Polo, F.M.; Sun, Y.; Kundu, S.; Xing, E.; Yurochkin, M. Fusing Models with Complementary Expertise. In Proceedings of the The Twelfth International Conference on Learning Representations, 2023.
- Chen, C.; Li, M.; Wu, Z.; Yu, D.; Yang, C. Ta-moe: Topology-aware large scale mixture-of-expert training. Advances in Neural Information Processing Systems 2022, 35, 22173–22186.
- Gale, T.; Narayanan, D.; Young, C.; Zaharia, M. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems 2023, 5.
- Muqeeth, M.; Liu, H.; Raffel, C. Soft merging of experts with adaptive routing. arXiv preprint arXiv:2306.03745 2023.
- Li, D.; Ma, Y.; Wang, N.; Cheng, Z.; Duan, L.; Zuo, J.; Yang, C.; Tang, M. MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts. arXiv preprint arXiv:2404.15159 2024.
- Roller, S.; Sukhbaatar, S.; Weston, J.; et al. Hash layers for large sparse models. Advances in Neural Information Processing Systems 2021, 34, 17555–17566.
- Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural computation 1991, 3, 79–87. [CrossRef]
- Shi, S.; Pan, X.; Chu, X.; Li, B. PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining. In Proceedings of the IEEE INFOCOM 2023-IEEE Conference on Computer Communications. IEEE, 2023, pp. 1–10.
- Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 2019, 32.
- Zhang, X.; Shen, Y.; Huang, Z.; Zhou, J.; Rong, W.; Xiong, Z. Mixture of Attention Heads: Selecting Attention Heads Per Token. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 4150–4162.
- Kim, Y.J.; Awan, A.A.; Muzio, A.; Salinas, A.F.C.; Lu, L.; Hendy, A.; Rajbhandari, S.; He, Y.; Awadalla, H.H. Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465 2021.
- Wu, J.; Hu, X.; Wang, Y.; Pang, B.; Soricut, R. Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts. arXiv preprint arXiv:2312.00968 2023.
- Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 2017, 114, 3521–3526. [CrossRef]
- Wei, J.; Bosma, M.; Zhao, V.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models are Zero-Shot Learners. In Proceedings of the International Conference on Learning Representations, 2021.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, .; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30.
- Lample, G.; Sablayrolles, A.; Ranzato, M.; Denoyer, L.; Jégou, H. Large memory layers with product keys. Advances in Neural Information Processing Systems 2019, 32.
- Du, N.; Huang, Y.; Dai, A.M.; Tong, S.; Lepikhin, D.; Xu, Y.; Krikun, M.; Zhou, Y.; Yu, A.W.; Firat, O.; et al. Glam: Efficient scaling of language models with mixture-of-experts. In Proceedings of the International Conference on Machine Learning. PMLR, 2022, pp. 5547–5569.
- Du, Y.; Zhao, S.; Zhao, D.; Ma, M.; Chen, Y.; Huo, L.; Yang, Q.; Xu, D.; Qin, B. MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability. arXiv preprint arXiv:2405.14488 2024.
- Shen, Y.; Zhang, Z.; Cao, T.; Tan, S.; Chen, Z.; Gan, C. Moduleformer: Learning modular large language models from uncurated data. arXiv preprint arXiv:2306.04640 2023.
- Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 2020.
- Dou, S.; Zhou, E.; Liu, Y.; Gao, S.; Zhao, J.; Shen, W.; Zhou, Y.; Xi, Z.; Wang, X.; Fan, X.; et al. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment. arXiv preprint arXiv:2312.09979 2023.
- Narayanan, D.; Shoeybi, M.; Casper, J.; LeGresley, P.; Patwary, M.; Korthikanti, V.; Vainbrand, D.; Kashinkunti, P.; Bernauer, J.; Catanzaro, B.; et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
- Xue, F.; Shi, Z.; Wei, F.; Lou, Y.; Liu, Y.; You, Y. Go wider instead of deeper. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2022, Vol. 36, pp. 8779–8787.
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 2018, 41, 423–443.
- Dua, D.; Bhosale, S.; Goswami, V.; Cross, J.; Lewis, M.; Fan, A. Tricks for Training Sparse Translation Models. In Proceedings of the Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 3340–3345.
- Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 2022, 23, 1–39.
- He, X.O. Mixture of A Million Experts. arXiv preprint arXiv:2407.04153 2024.
- Zheng, L.; Li, Z.; Zhang, H.; Zhuang, Y.; Chen, Z.; Huang, Y.; Wang, Y.; Xu, Y.; Zhuo, D.; Xing, E.P.; et al. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 559–578.
- Fedus, W.; Dean, J.; Zoph, B. A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667 2022.
- Komatsuzaki, A.; Puigcerver, J.; Lee-Thorp, J.; Ruiz, C.R.; Mustafa, B.; Ainslie, J.; Tay, Y.; Dehghani, M.; Houlsby, N. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. In Proceedings of the The Eleventh International Conference on Learning Representations, 2022.
- Puigcerver, J.; Ruiz, C.R.; Mustafa, B.; Houlsby, N. From Sparse to Soft Mixtures of Experts. In Proceedings of the The Twelfth International Conference on Learning Representations, 2023.
- Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 2020.
- Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Hanna, E.B.; Bressand, F.; et al. Mixtral of experts. arXiv preprint arXiv:2401.04088 2024.
- Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Susano Pinto, A.; Keysers, D.; Houlsby, N. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems 2021, 34, 8583–8595.
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 2023.
- Zhang, Z.; Liu, S.; Yu, J.; Cai, Q.; Zhao, X.; Zhang, C.; Liu, Z.; Liu, Q.; Zhao, H.; Hu, L.; et al. M3oE: Multi-Domain Multi-Task Mixture-of Experts Recommendation Framework. In Proceedings of the Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 893–902.
- Zhai, M.; He, J.; Ma, Z.; Zong, Z.; Zhang, R.; Zhai, J. {SmartMoE}: Efficiently Training {Sparsely-Activated} Models through Combining Offline and Online Parallelization. In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 961–975.
- Antoniak, S.; Jaszczur, S.; Krutul, M.; Pióro, M.; Krajewski, J.; Ludziejewski, J.; Odrzygóźdź, T.; Cygan, M. Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation. arXiv preprint arXiv:2310.15961 2023.
- Zhou, Y.; Du, N.; Huang, Y.; Peng, D.; Lan, C.; Huang, D.; Shakeri, S.; So, D.; Dai, A.M.; Lu, Y.; et al. Brainformers: Trading simplicity for efficiency. In Proceedings of the International Conference on Machine Learning. PMLR, 2023, pp. 42531–42542.
- He, S.; Fan, R.Z.; Ding, L.; Shen, L.; Zhou, T.; Tao, D. Merging Experts into One: Improving Computational Efficiency of Mixture of Experts. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 14685–14691.
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 2017.
- Jacobs, S.A.; Tanaka, M.; Zhang, C.; Zhang, M.; Song, L.; Rajbhandari, S.; He, Y. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509 2023.
- Xiao, D.; Zhang, H.; Li, Y.; Sun, Y.; Tian, H.; Wu, H.; Wang, H. ERNIE-GEN: an enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In Proceedings of the Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2021, pp. 3997–4003.
- Ren, X.; Zhou, P.; Meng, X.; Huang, X.; Wang, Y.; Wang, W.; Li, P.; Zhang, X.; Podolskiy, A.; Arshinov, G.; et al. Pangu-Σ: Towards trillion parameter language model with sparse heterogeneous computing. arXiv preprint arXiv:2303.10845 2023.
- Lieber, O.; Lenz, B.; Bata, H.; Cohen, G.; Osin, J.; Dalmedigos, I.; Safahi, E.; Meirom, S.; Belinkov, Y.; Shalev-Shwartz, S.; et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887 2024.
- Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.; Bansal, M.; Raffel, C.A. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 2022, 35, 1950–1965.
- Uppal, S.; Bhagat, S.; Hazarika, D.; Majumder, N.; Poria, S.; Zimmermann, R.; Zadeh, A. Multimodal research in vision and language: A review of current and emerging trends. Information Fusion 2022, 77, 149–171. [CrossRef]
- Gross, S.; Ranzato, M.; Szlam, A. Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6865–6873.
- Dai, D.; Deng, C.; Zhao, C.; Xu, R.; Gao, H.; Chen, D.; Li, J.; Zeng, W.; Yu, X.; Wu, Y.; et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 2024.
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2015.
- Zuo, S.; Liu, X.; Jiao, J.; Kim, Y.J.; Hassan, H.; Zhang, R.; Gao, J.; Zhao, T. Taming Sparsely Activated Transformer with Stochastic Experts. In Proceedings of the International Conference on Learning Representations, 2021.
- Xue, F.; Zheng, Z.; Fu, Y.; Ni, J.; Zheng, Z.; Zhou, W.; You, Y. Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 2024.
- Clark, A.; de Las Casas, D.; Guy, A.; Mensch, A.; Paganini, M.; Hoffmann, J.; Damoc, B.; Hechtman, B.; Cai, T.; Borgeaud, S.; et al. Unified scaling laws for routed language models. In Proceedings of the International conference on machine learning. PMLR, 2022, pp. 4057–4086.
- Shuster, K.; Xu, J.; Komeili, M.; Ju, D.; Smith, E.M.; Roller, S.; Ung, M.; Chen, M.; Arora, K.; Lane, J.; et al. Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188 2022.
- Dat, D.H.; Mao, P.Y.; Nguyen, T.H.; Buntine, W.; Bennamoun, M. HOMOE: A Memory-Based and Composition-Aware Framework for Zero-Shot Learning with Hopfield Network and Soft Mixture of Experts. arXiv preprint arXiv:2311.14747 2023.
- Chen, T.; Zhang, Z.; JAISWAL, A.K.; Liu, S.; Wang, Z. Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers. In Proceedings of the The Eleventh International Conference on Learning Representations, 2022.
- Li, Y.; Hui, B.; Yin, Z.; Yang, M.; Huang, F.; Li, Y. PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 13402–13416.
- Huang, C.; Liu, Q.; Lin, B.Y.; Pang, T.; Du, C.; Lin, M. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269 2023.
- Zniyed, Y.; Nguyen, T.P.; et al. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 2024. [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv preprint arXiv:2310.06825 2023.
- Bi, X.; Chen, D.; Chen, G.; Chen, S.; Dai, D.; Deng, C.; Ding, H.; Dong, K.; Du, Q.; Fu, Z.; et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 2024.
- Xue, F.; He, X.; Ren, X.; Lou, Y.; You, Y. One student knows all experts know: From sparse to dense. arXiv preprint arXiv:2201.10890 2022.
- Yoo, K.M.; Han, J.; In, S.; Jeon, H.; Jeong, J.; Kang, J.; Kim, H.; Kim, K.M.; Kim, M.; Kim, S.; et al. HyperCLOVA X Technical Report. arXiv preprint arXiv:2404.01954 2024.
- Artetxe, M.; Bhosale, S.; Goyal, N.; Mihaylov, T.; Ott, M.; Shleifer, S.; Lin, X.V.; Du, J.; Iyer, S.; Pasunuru, R.; et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684 2021.
- Ainslie, J.; Lee-Thorp, J.; de Jong, M.; Zemlyanskiy, Y.; Lebron, F.; Sanghai, S. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4895–4901.
- Dai, D.; Dong, L.; Ma, S.; Zheng, B.; Sui, Z.; Chang, B.; Wei, F. StableMoE: Stable Routing Strategy for Mixture of Experts. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 7085–7095.
- Zhao, H.; Qiu, Z.; Wu, H.; Wang, Z.; He, Z.; Fu, J. HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts. arXiv preprint arXiv:2402.12656 2024.
- Rajbhandari, S.; Ruwase, O.; Rasley, J.; Smith, S.; He, Y. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the Proceedings of the international conference for high performance computing, networking, storage and analysis, 2021, pp. 1–14.
- Mao, Y.; Mathias, L.; Hou, R.; Almahairi, A.; Ma, H.; Han, J.; Yih, S.; Khabsa, M. UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 6253–6264.
- Shen, Y.; Guo, Z.; Cai, T.; Qin, Z. JetMoE: Reaching Llama2 Performance with 0.1 M Dollars. arXiv preprint arXiv:2404.07413 2024.
- Zeng, Z.; Miao, Y.; Gao, H.; Zhang, H.; Deng, Z. AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models. arXiv preprint arXiv:2406.13233 2024.
- Tang, H.; Liu, J.; Zhao, M.; Gong, X. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the Proceedings of the 14th ACM Conference on Recommender Systems, 2020, pp. 269–278.
- Team, L.M. LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training, 2023.
- Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.M.; Chen, W.; et al. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904 2022.
- Costa-jussà, M.R.; Cross, J.; Çelebi, O.; Elbayad, M.; Heafield, K.; Heffernan, K.; Kalbassi, E.; Lam, J.; Licht, D.; Maillard, J.; et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672 2022.
- Nie, X.; Zhao, P.; Miao, X.; Zhao, T.; Cui, B. HetuMoE: An efficient trillion-scale mixture-of-expert distributed training system. arXiv preprint arXiv:2203.14685 2022.
- Shi, S.; Pan, X.; Wang, Q.; Liu, C.; Ren, X.; Hu, Z.; Yang, Y.; Li, B.; Chu, X. ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling. In Proceedings of the Proceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 236–249.
- Chen, Z.; Shen, Y.; Ding, M.; Chen, Z.; Zhao, H.; Learned-Miller, E.G.; Gan, C. Mod-squad: Designing mixtures of experts as modular multi-task learners. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11828–11837.
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. International Journal of Computer Vision 2022, 130, 2337–2348. [CrossRef]
- Dou, S.; Zhou, E.; Liu, Y.; Gao, S.; Zhao, J.; Shen, W.; Zhou, Y.; Xi, Z.; Wang, X.; Fan, X.; et al. The Art of Balancing: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment. arXiv preprint arXiv:2312.09979 2023.
- Almahairi, A.; Ballas, N.; Cooijmans, T.; Zheng, Y.; Larochelle, H.; Courville, A. Dynamic capacity networks. In Proceedings of the International Conference on Machine Learning. PMLR, 2016, pp. 2549–2558.
- Zhang, Q.; Zou, B.; An, R.; Liu, J.; Zhang, S. MoSA: Mixture of Sparse Adapters for Visual Efficient Tuning. arXiv preprint arXiv:2312.02923 2023.
- Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; Chi, E.H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 1930–1939.
- Zhang, Z.; Lin, Y.; Liu, Z.; Li, P.; Sun, M.; Zhou, J. MoEfication: Transformer Feed-forward Layers are Mixtures of Experts. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 877–890.
- Li, Z.; You, C.; Bhojanapalli, S.; Li, D.; Rawat, A.S.; Reddi, S.J.; Ye, K.; Chern, F.; Yu, F.; Guo, R.; et al. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. arXiv preprint arXiv:2210.06313 2022.
- Wang, Y.; Agarwal, S.; Mukherjee, S.; Liu, X.; Gao, J.; Awadallah, A.H.; Gao, J. AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Goldberg, Y.; Kozareva, Z.; Zhang, Y., Eds., Abu Dhabi, United Arab Emirates, 2022; pp. 5744–5760. [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 2022, 35, 27730–27744.
- Liu, Y.; Zhang, R.; Yang, H.; Keutzer, K.; Du, Y.; Du, L.; Zhang, S. Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning. arXiv preprint arXiv:2404.08985 2024.
- Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 2019.
- Zhong, Z.; Xia, M.; Chen, D.; Lewis, M. Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training. arXiv preprint arXiv:2405.03133 2024.
- Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A Survey on Large Language Models for Code Generation. arXiv preprint arXiv:2406.00515 2024.
- Zniyed, Y.; Nguyen, T.P.; et al. Efficient tensor decomposition-based filter pruning. Neural Networks 2024, 178, 106393. [CrossRef]
- Choi, J.Y.; Kim, J.; Park, J.H.; Mok, W.L.; Lee, S. SMoP: Towards Efficient and Effective Prompt Tuning with Sparse Mixture-of-Prompts. In Proceedings of the The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Wu, S.; Luo, J.; Chen, X.; Li, L.; Zhao, X.; Yu, T.; Wang, C.; Wang, Y.; Wang, F.; Qiao, W.; et al. Yuan 2.0-M32: Mixture of Experts with Attention Router. arXiv preprint arXiv:2405.17976 2024.
- Ma, X.; Zhao, L.; Huang, G.; Wang, Z.; Hu, Z.; Zhu, X.; Gai, K. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In Proceedings of the The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 1137–1140.
- Zhang, Y.; Cai, R.; Chen, T.; Zhang, G.; Zhang, H.; Chen, P.Y.; Chang, S.; Wang, Z.; Liu, S. Robust Mixture-of-Expert Training for Convolutional Neural Networks. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 90–101.
- Li, Y.; Jiang, S.; Hu, B.; Wang, L.; Zhong, W.; Luo, W.; Ma, L.; Zhang, M. Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts. arXiv preprint arXiv:2405.11273 2024. [CrossRef]
- Zhang, R.; Han, J.; Liu, C.; Zhou, A.; Lu, P.; Li, H.; Gao, P.; Qiao, Y. LLaMA-Adapter: Efficient Fine-tuning of Large Language Models with Zero-initialized Attention. In Proceedings of the The Twelfth International Conference on Learning Representations, 2024.
| Architecture | Computational Cost | Specialization | Scalability |
|---|---|---|---|
| Classic MoE | High | Moderate | Limited |
| Hierarchical MoE | High | High | Moderate |
| Sparse MoE (Top-K) | Low | Moderate-High | High |
| Soft MoE | High | High | Moderate |
| Hard MoE | Low | Moderate | High |
| Switch Transformers | Very Low | Moderate | Very High |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
