Submitted:
14 April 2025
Posted:
16 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Scope and Contributions
- A unified taxonomy of Mixture of Experts models, covering both classical and contemporary formulations.
- A detailed analysis of gating mechanisms, including recent developments in differentiable routing, reinforcement learning-based gating, and top-k selection [10].
- A comparative study of training paradigms, highlighting trade-offs between sparse and dense expert activation, load balancing, and convergence behavior [11].
- An overview of prominent applications across NLP, vision, recommendation systems, and multi-modal learning.
- A critical discussion of current limitations and future research directions, with a particular focus on interpretability, robustness, and deployment challenges.
1.2. Organization of the Survey
Notation
2. Theoretical Foundations of Mixture of Experts
2.1. Probabilistic Formulation
2.2. Deterministic and Neural Formulation
2.3. Sparse Gating and Conditional Computation
2.4. Function Approximation and Universal Representation
2.5. Connections to Related Models
- Ensemble Learning: Unlike traditional ensembles that average over fixed experts, MoE employs input-dependent gating.
- Decision Trees: MoE can be viewed as a soft generalization of decision trees, where routing decisions are probabilistic and differentiable.
- Hierarchical Models: Hierarchical MoEs stack multiple levels of expert selection, enabling deep modular hierarchies.
- Meta-learning: The gating function can be interpreted as a meta-learner that decides which expert(s) to deploy.
2.6. Challenges in Theoretical Analysis
- Non-convex optimization: Jointly learning experts and gating functions introduces highly non-convex loss landscapes.
- Expert collapse: Without careful regularization, experts may converge to similar behaviors, reducing diversity [21].
- Load imbalance: Sparse activation can lead to uneven expert utilization, causing inefficiencies in training.
3. Architectures and Design Patterns for Mixture of Experts
3.1. Classical vs. Deep MoE Architectures
3.2. Flat vs. Hierarchical MoE
3.2.1. Flat MoE
3.2.2. Hierarchical MoE
3.3. Expert Types
- Linear experts: Lightweight and analytically tractable; useful for early-stage MoE research and toy problems [28].
- Multi-layer perceptrons (MLPs): Widely used in feed-forward MoE layers in modern transformer-based models.
- Convolutional experts: Suitable for image and video tasks; support spatial inductive biases [29].
- Task-specific modules: Experts designed for specific modalities or domains, such as visual reasoning, tabular data, or graph neural networks.
3.4. Gating Configurations and Routing Granularity
- Token-level routing: Each token in a sequence selects its own experts independently. Common in NLP models with large batch sizes and variable input lengths.
- Sample-level routing: A single gating decision is made for the entire input sample. Useful when inputs are short or semantically cohesive (e.g., images).
- Feature-group routing: Experts are specialized based on feature subspaces (e.g., visual vs textual features in multi-modal inputs).
- Layer-wise MoE: MoE layers are inserted at specific positions in a deep network, such as the feedforward layers in transformers.
3.5. Sparsity Mechanisms
- Top-k gating: The gating function selects the k experts with highest scores for each input and zeros out others.
- Sparsemax: Uses a differentiable projection onto the simplex to achieve sparsity [34].
- Routing via reinforcement learning: Treats expert selection as a discrete decision problem.
3.6. Load Balancing and Expert Utilization
3.7. Scalability and Distributed Implementation
- Expert parallelism: Experts are distributed across devices, with each device hosting a subset of experts.
- Token dispatching: Input tokens are routed to appropriate devices based on gating decisions.
- All-to-all communication: Used to efficiently dispatch tokens and gather outputs in distributed settings.
- Caching and quantization: Employed to reduce memory and bandwidth requirements.
3.8. Design Trade-offs
- Modularity vs. coordination: Modular experts facilitate specialization, but require coordination via gating [36].
- Sparsity vs. smoothness: Sparse activation improves efficiency but may lead to optimization instability [37].
- Scalability vs. communication overhead: Distributed execution enables scaling but incurs communication costs [38].
4. Gating Mechanisms and Routing Strategies
4.1. Overview of Gating Functions
- Soft routing (dense gating): All experts contribute to the output with non-zero weights [41].
- Sparse routing (hard gating): Only a subset of experts are activated per input, typically chosen via top-k selection.
- Stochastic routing: Expert selection involves sampling from a learned or fixed probability distribution.
- Learned discrete routing: Expert assignments are treated as latent variables and learned via REINFORCE, Gumbel-softmax, or variational methods.
4.2. Softmax and Dense Gating
4.3. Top-k Sparse Gating
4.4. Noisy Top-k Routing
4.5. Gumbel-Softmax and Differentiable Sampling
4.6. REINFORCE-Based Routing
4.7. Fixed and Data-Independent Routing
4.8. Token-Level vs. Sample-Level Routing
- Token-level gating: Each token (or feature) in the input independently chooses its experts [48]. This is prevalent in sequence models like Transformers, where each token can exploit a different subset of expert knowledge.
- Sample-level gating: A single gating decision is made per input sample [49]. This reduces routing overhead but may limit expressivity.
4.9. Auxiliary Losses for Routing Stability
4.10. Routing in Multi-Modal and Multi-Task Settings
4.11. Challenges in Routing Design
- Optimization instability: Discrete and sparse routing leads to non-smooth loss surfaces [51].
- Expert underutilization: Without proper regularization, experts may remain idle or collapse to similar behaviors.
- Inference overhead: Routing introduces computational and communication bottlenecks during deployment.
- Routing bias: Gating functions may overfit to training patterns, reducing generalization.
5. Training Challenges and Optimization Techniques
5.1. Expert Collapse and Load Imbalance
- Entropy regularization: Maximizing the entropy of the gating distribution to prevent early convergence to degenerate solutions [56].
- KL divergence to uniform: Penalizing divergence from uniform routing to spread load more evenly.
- Temperature annealing: Starting with high-temperature (soft) routing and gradually reducing it to allow smoother transitions in expert utilization.
5.2. Gradient Sparsity and Routing Discontinuities
- Stochastic routing: Adding noise to gating logits (e.g., Noisy Top-k) encourages broader exploration and smoother gradients.
- Soft or continuous approximations: Methods like Gumbel-softmax allow for differentiable approximations to discrete routing decisions.
- Warm-up schedules: Begin training with dense or soft routing and gradually increase sparsity to allow experts to develop meaningful representations [57].
- Straight-through estimators (STE): Use hard routing in the forward pass and a soft surrogate for the backward pass to stabilize learning [58].
5.3. Training Instability and Routing Bias
- Slow gating updates: Apply lower learning rates to gating network parameters to reduce oscillations [60].
- Batch-level balancing: Normalize routing scores within mini-batches to reduce variance across training steps [61].
- Frozen routing phases: Temporarily freeze gating decisions to stabilize learning within expert subnetworks.
- Expert normalization: Use batch norm or layer norm within experts to mitigate shifts in input distributions [62].
5.4. Expert Capacity and Token Dropping
5.5. Parallelism and Distributed Training
- Expert parallelism: Experts are sharded across devices, enabling scalable computation with local memory constraints [68].
- All-to-all token dispatching: After routing, tokens must be sent to the appropriate device holding the selected experts.
- Communication-efficient routing: Use compact routing indices and fused communication kernels to reduce latency [69].
- Gradient accumulation: Accumulate gradients locally before global synchronization to minimize network traffic.
5.6. Regularization and Generalization
- Dropout at expert level: Randomly deactivate entire experts during training [70].
- Expert dropout / switch dropout: Randomly mask or shuffle expert assignments to encourage robustness.
- Weight sharing across experts: Partially share parameters (e.g., input/output layers) to reduce overparameterization.
- Inter-expert consistency losses: Penalize disagreement among experts for similar inputs to enforce coherent behavior.
5.7. Curriculum Learning and Progressive Sparsification
5.8. Monitoring and Debugging MoE Training
- Expert utilization histograms: Visualize token assignments across experts.
- Routing entropy metrics: Quantify the diversity and sharpness of gating outputs [74].
- Token routing traces: Track token flow through experts over time [75].
- Gradient heatmaps: Identify sparsity and imbalance in expert updates.
5.9. Summary of Training Strategies
6. Applications of Mixture of Experts
6.1. Natural Language Processing (NLP)
Language Modeling and Pretraining.
Multilingual Models.
Fine-tuning and Adaptation.
6.2. Computer Vision
Vision Transformers (ViTs) with MoE.
Multi-Scale and Region-Based Routing.
Challenges in Vision.
6.3. Speech and Audio Processing
Speaker and Dialect Specialization.
Conditional Generative Models.
6.4. Multi-Modal Learning
Modality-Specific Experts.
Cross-Modal Routing.
Alignment and Synchronization.
6.5. Personalization and Recommendation Systems
User/Item-Specific Experts.
Scalability and Latency.
6.6. Reinforcement Learning and Robotics
Policy Composition.
Exploration and Diversity.
6.7. Emerging Use Cases and Research Directions
- Federated learning: Experts are trained locally on decentralized data, and gating coordinates knowledge sharing across clients.
- Continual learning: MoEs offer a framework for allocating new experts to novel tasks without catastrophic forgetting [97].
- Neurosymbolic systems: Experts implement differentiable modules for logic, planning, or rule-based inference.
6.8. Summary and Deployment Considerations
7. Open Challenges and Future Directions
7.1. Scalability Limits and Inference Trade-offs
- Communication overhead: All-to-all dispatching between devices becomes a bottleneck at extreme scales [100].
- Expert duplication: Memory footprint grows linearly with the number of experts unless sharing mechanisms are introduced [101].
- Batch fragmentation: Sparse routing can lead to uneven mini-batch sizes per expert, impacting throughput and utilization [102].
- Latency variance: Routing-dependent compute paths introduce variability in inference latency, which is problematic in real-time systems [103].
Future Directions.
7.2. Generalization vs. Specialization Trade-off
Research Questions.
- What is the optimal number of experts per input distribution?
- How does expert specialization impact robustness to distribution shift?
- Can we learn interpretable expert functions without explicit supervision [107]?
Directions Forward.
7.3. Routing Interpretability and Trustworthiness
Key Challenges.
Possible Solutions.
7.4. Sparse Training and Inference Synergy
Research Goals.
- Align sparse training dynamics with efficient inference pipelines [113].
- Develop sparsity-aware hardware primitives and compilers.
- Train MoEs with explicit cost-aware objectives (e.g., latency, FLOPs) [114].
7.5. Foundation Models and Modular Scaling
- Expert reuse across tasks: How should experts be reused, adapted, or frozen across tasks [117]?
- Backward compatibility: Can expert modules be updated independently without retraining the full model?
- Composable modularity: Can experts be composed dynamically at runtime based on user intent or context [118]?
7.6. Continual, Lifelong, and Federated Learning
Open Problems.
- Expert proliferation: How can we limit unbounded expert growth over time?
- Dynamic capacity allocation: Can the model adaptively allocate compute to evolving distributions [119]?
- Federated MoEs: Can experts be learned locally across devices and aggregated without centralization [120]?
7.7. Neurosymbolic and Compositional Reasoning
Challenges and Questions.
- Can routing networks learn to compose experts to simulate algorithmic pipelines [123]?
- How do we supervise or regularize symbolic behavior in experts?
- What are the limits of compositionality achievable with MoEs [124]?
7.8. Unified Evaluation Benchmarks
Proposal.
- Diverse tasks across NLP, vision, speech, and RL.
- Routing diagnostics and expert efficiency metrics [127].
- Stress tests for robustness, sparsity, and capacity overload.
7.9. Summary
8. Conclusion
References
- Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural computation 1991, 3, 79–87. [Google Scholar] [CrossRef]
- Shahbaba, B.; Neal, R. Nonlinear models using Dirichlet process mixtures. Journal of Machine Learning Research 2009, 10. [Google Scholar]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 2022, 23, 1–39. [Google Scholar]
- Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv 2020, arXiv:2006.16668. [Google Scholar]
- Shuster, K.; Xu, J.; Komeili, M.; Ju, D.; Smith, E.M.; Roller, S.; Ung, M.; Chen, M.; Arora, K.; Lane, J.; et al. Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv 2022, arXiv:2208.03188. [Google Scholar]
- Gross, S.; Ranzato, M.; Szlam, A. Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6865–6873.
- Zhou, Y.; Lei, T.; Liu, H.; Du, N.; Huang, Y.; Zhao, V.; Dai, A.M.; Le, Q.V.; Laudon, J.; et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 2022, 35, 7103–7114. [Google Scholar]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 315–323.
- Tan, S.; Shen, Y.; Chen, Z.; Courville, A.; Gan, C. Sparse Universal Transformer. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 169–179.
- Xu, J.; Lai, J.; Huang, Y. MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models. arXiv 2024, arXiv:2405.13053. [Google Scholar]
- Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Shazeer, N. Glu variants improve transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar]
- Singh, S.; Ruwase, O.; Awan, A.A.; Rajbhandari, S.; He, Y.; Bhatele, A. Awan, A.A.; Rajbhandari, S.; He, Y.; Bhatele, A. A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training. In Proceedings of the Proceedings of the 37th International Conference on Supercomputing, 2023, pp. 203–214.
- Zheng, N.; Jiang, H.; Zhang, Q.; Han, Z.; Ma, L.; Yang, Y.; Yang, F.; Zhang, C.; Qiu, L.; Yang, M.; et al. Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation. In Proceedings of the Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 331–347.
- Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efficient large scale language modeling with mixtures of experts. arXiv 2021, arXiv:2112.10684. [Google Scholar]
- Team, Q. Introducing Qwen1.5, 2024.
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 2018, 41, 423–443. [Google Scholar] [CrossRef]
- Wu, J.; Hu, X.; Wang, Y.; Pang, B.; Soricut, R. Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts. arXiv 2023, arXiv:2312.00968. [Google Scholar]
- Shazeer, N.; Cheng, Y.; Parmar, N.; Tran, D.; Vaswani, A.; Koanantakool, P.; Hawkins, P.; Lee, H.; Hong, M.; Young, C.; et al. Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems 2018, 31. [Google Scholar]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Tang, H.; Liu, J.; Zhao, M.; Gong, X. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. Proceedings of the Proceedings of the 14th ACM Conference on Recommender Systems, 2020, pp. 331–34.
- Gou, Y.; Liu, Z.; Chen, K.; Hong, L.; Xu, H.; Li, A.; Yeung, D.Y.; Kwok, J.T.; Zhang, Y. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv 2023, arXiv:2312.12379. [Google Scholar]
- Chen, T.; Zhang, Z.; JAISWAL, A.K.; Liu, S.; Wang, Z. Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers. In Proceedings of the The Eleventh International Conference on Learning Representations; 2023. [Google Scholar]
- Zhang, B.; Sennrich, R. Root mean square layer normalization. Advances in Neural Information Processing Systems 2019, 32. [Google Scholar]
- Puigcerver, J.; Ruiz, C.R.; Mustafa, B.; Houlsby, N. From Sparse to Soft Mixtures of Experts. In Proceedings of the The Twelfth International Conference on Learning Representations; 2023. [Google Scholar]
- Lialin, V.; Deshpande, V.; Rumshisky, A. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv 2023, arXiv:2303.15647. [Google Scholar]
- Lin, B.; Tang, Z.; Ye, Y.; Cui, J.; Zhu, B.; Jin, P.; Zhang, J.; Ning, M.; Yuan, L. Moe-llava: Mixture of experts for large vision-language models. arXiv 2024, arXiv:2401.15947. [Google Scholar]
- Jiang, C.; Tian, Y.; Jia, Z.; Zheng, S.; Wu, C.; Wang, Y. Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping. arXiv 2024, arXiv:2404.19429. [Google Scholar]
- Huang, C.; Liu, Q.; Lin, B.Y.; Pang, T.; Du, C.; Lin, M. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv 2023, arXiv:2307.13269. [Google Scholar]
- Wang, Y.; Agarwal, S.; Mukherjee, S.; Liu, X.; Gao, J.; Awadallah, A.H.; Gao, J. AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Goldberg, Y.; Kozareva, Z.; Zhang, Y., Eds., Abu Dhabi, United Arab Emirates, 2022; pp. 5744–5760. [CrossRef]
- Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.; Bansal, M.; Raffel, C.A. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 2022, 35, 1950–1965. [Google Scholar]
- Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; Chi, E.H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 1930–1939.
- Rajbhandari, S.; Rasley, J.; Ruwase, O.; He, Y. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE; 2020; pp. 1–16. [Google Scholar]
- Theis, L.; Bethge, M. Generative image modeling using spatial lstms. Advances in neural information processing systems 2015, 28. [Google Scholar]
- Zheng, Y.; Wang, D.X. A survey of recommender systems with multi-objective optimization. Neurocomputing 2022, 474, 141–153. [Google Scholar] [CrossRef]
- Rosenbaum, C.; Klinger, T.; Riemer, M. Routing networks: Adaptive selection of non-linear functions for multi-task learning. arXiv 2017, arXiv:1711.01239. [Google Scholar]
- Li, Y.; Hui, B.; Yin, Z.; Yang, M.; Huang, F.; Li, Y. PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 13402–13416.
- Zuo, S.; Zhang, Q.; Liang, C.; He, P.; Zhao, T.; Chen, W. MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation. In Proceedings of the Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 1610–1623.
- Zhou, Y.; Du, N.; Huang, Y.; Peng, D.; Lan, C.; Huang, D.; Shakeri, S.; So, D.; Dai, A.M.; Lu, Y.; et al. Brainformers: Trading simplicity for efficiency. In Proceedings of the International Conference on Machine Learning. PMLR; 2023; pp. 42531–42542. [Google Scholar]
- He, J.; Qiu, J.; Zeng, A.; Yang, Z.; Zhai, J.; Tang, J. Fastmoe: A fast mixture-of-expert training system. arXiv 2021, arXiv:2103.13262. [Google Scholar]
- Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611–626.
- Wu, S.; Luo, J.; Chen, X.; Li, L.; Zhao, X.; Yu, T.; Wang, C.; Wang, Y.; Wang, F.; Qiao, W.; et al. Yuan 2.0-M32: Mixture of Experts with Attention Router. arXiv 2024, arXiv:2405.17976. [Google Scholar]
- Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar]
- Zuo, S.; Liu, X.; Jiao, J.; Kim, Y.J.; Hassan, H.; Zhang, R.; Gao, J.; Zhao, T. Taming Sparsely Activated Transformer with Stochastic Experts. In Proceedings of the International Conference on Learning Representations; 2021. [Google Scholar]
- Hwang, C.; Cui, W.; Xiong, Y.; Yang, Z.; Liu, Z.; Hu, H.; Wang, Z.; Salas, R.; Jose, J.; Ram, P.; et al. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems 2023, 5. [Google Scholar]
- Zoph, B.; Bello, I.; Kumar, S.; Du, N.; Huang, Y.; Dean, J.; Shazeer, N.; Fedus, W. St-moe: Designing stable and transferable sparse expert models. arXiv 2022, arXiv:2202.08906. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Zniyed, Y.; Nguyen, T.P.; et al. Efficient tensor decomposition-based filter pruning. Neural Networks 2024, 178, 106393. [Google Scholar]
- Diao, S.; Xu, T.; Xu, R.; Wang, J.; Zhang, T. Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models’ Memories. In Proceedings of the The 61st Annual Meeting Of The Association For Computational Linguistics; 2023. [Google Scholar]
- Narayanan, D.; Shoeybi, M.; Casper, J.; LeGresley, P.; Patwary, M.; Korthikanti, V.; Vainbrand, D.; Kashinkunti, P.; Bernauer, J.; Catanzaro, B.; et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
- Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; Steinhardt, J. Measuring mathematical problem solving with the math dataset. arXiv 2021, arXiv:2103.03874. [Google Scholar]
- Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597.
- Ren, J.; Rajbhandari, S.; Aminabadi, R.Y.; Ruwase, O.; Yang, S.; Zhang, M.; Li, D.; He, Y. {Zero-offload}: Democratizing {billion-scale} model training. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC 21); 2021; pp. 551–564. [Google Scholar]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. International Journal of Computer Vision 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning. PMLR; 2019; pp. 2790–2799. [Google Scholar]
- Mustafa, B.; Riquelme, C.; Puigcerver, J.; Jenatton, R.; Houlsby, N. Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems 2022, 35, 9564–9576. [Google Scholar]
- Gao, Z.F.; Liu, P.; Zhao, W.X.; Lu, Z.Y.; Wen, J.R. Parameter-efficient mixture-of-experts architecture for pre-trained language models. arXiv 2022, arXiv:2203.01104. [Google Scholar]
- Choi, J.Y.; Kim, J.; Park, J.H.; Mok, W.L.; Lee, S. SMoP: Towards Efficient and Effective Prompt Tuning with Sparse Mixture-of-Prompts. In Proceedings of the The 2023 Conference on Empirical Methods in Natural Language Processing; 2023. [Google Scholar]
- Muqeeth, M.; Liu, H.; Raffel, C. Soft merging of experts with adaptive routing. arXiv 2023, arXiv:2306.03745. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Dou, S.; Zhou, E.; Liu, Y.; Gao, S.; Zhao, J.; Shen, W.; Zhou, Y.; Xi, Z.; Wang, X.; Fan, X.; et al. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment. arXiv 2023, arXiv:2312.09979. [Google Scholar]
- Zhu, Y.; Wichers, N.; Lin, C.C.; Wang, X.; Chen, T.; Shu, L.; Lu, H.; Liu, C.; Luo, L.; Chen, J.; et al. Sira: Sparse mixture of low rank adaptation. arXiv 2023, arXiv:2311.09179. [Google Scholar]
- Nie, X.; Miao, X.; Cao, S.; Ma, L.; Liu, Q.; Xue, J.; Miao, Y.; Liu, Y.; Yang, Z.; Cui, B. Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate. arXiv 2021, arXiv:2112.14397. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
- Liu, Q.; Wu, X.; Zhao, X.; Zhu, Y.; Xu, D.; Tian, F.; Zheng, Y. Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv 2023, arXiv:2310.18339. [Google Scholar]
- Chen, Z.; Shen, Y.; Ding, M.; Chen, Z.; Zhao, H.; Learned-Miller, E.G.; Gan, C. Mod-squad: Designing mixtures of experts as modular multi-task learners. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4582–4597.
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
- Shen, L.; Wu, Z.; Gong, W.; Hao, H.; Bai, Y.; Wu, H.; Wu, X.; Bian, J.; Xiong, H.; Yu, D.; et al. Se-moe: A scalable and efficient mixture-of-experts distributed training and inference system. arXiv 2022, arXiv:2205.10034. [Google Scholar]
- Shen, S.; Hou, L.; Zhou, Y.; Du, N.; Longpre, S.; Wei, J.; Chung, H.W.; Zoph, B.; Fedus, W.; Chen, X.; et al. Mixture-of-experts meets instruction tuning: A winning combination for large language models. arXiv 2023, arXiv:2305.14705. [Google Scholar]
- Dat, D.H.; Mao, P.Y.; Nguyen, T.H.; Buntine, W.; Bennamoun, M. HOMOE: A Memory-Based and Composition-Aware Framework for Zero-Shot Learning with Hopfield Network and Soft Mixture of Experts. arXiv 2023, arXiv:2311.14747. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser. ; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Tan, S.; Shen, Y.; Panda, R.; Courville, A. Scattered Mixture-of-Experts Implementation. arXiv 2024, arXiv:2403.08245. [Google Scholar]
- Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.M.; Chen, W.; et al. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv 2022, arXiv:2203.06904. [Google Scholar]
- Li, D.; Ma, Y.; Wang, N.; Cheng, Z.; Duan, L.; Zuo, J.; Yang, C.; Tang, M. MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts. arXiv 2024, arXiv:2404.15159. [Google Scholar]
- Zhu, J.; Zhu, X.; Wang, W.; Wang, X.; Li, H.; Wang, X.; Dai, J. Uni-perceiver-moe: Learning sparse generalist models with conditional moes. Advances in Neural Information Processing Systems 2022, 35, 2664–2678. [Google Scholar]
- Li, Z.; You, C.; Bhojanapalli, S.; Li, D.; Rawat, A.S.; Reddi, S.J.; Ye, K.; Chern, F.; Yu, F.; Guo, R.; et al. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. arXiv 2022, arXiv:2210.06313. [Google Scholar]
- Du, N.; Huang, Y.; Dai, A.M.; Tong, S.; Lepikhin, D.; Xu, Y.; Krikun, M.; Zhou, Y.; Yu, A.W.; Firat, O.; et al. Glam: Efficient scaling of language models with mixture-of-experts. In Proceedings of the International Conference on Machine Learning. PMLR; 2022; pp. 5547–5569. [Google Scholar]
- Chen, G.; Liu, F.; Meng, Z.; Liang, S. Revisiting Parameter-Efficient Tuning: Are We Really There Yet? In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Goldberg, Y.; Kozareva, Z.; Zhang, Y., Eds., Abu Dhabi, United Arab Emirates; 2022; pp. 2612–2626. [Google Scholar] [CrossRef]
- OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt, 2022.
- Bahng, H.; Jahanian, A.; Sankaranarayanan, S.; Isola, P. Exploring visual prompts for adapting large-scale models. arXiv 2022, arXiv:2203.17274. [Google Scholar]
- Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V.; et al. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research 2021, 22, 1–48. [Google Scholar]
- Xue, F.; He, X.; Ren, X.; Lou, Y.; You, Y. One student knows all experts know: From sparse to dense. arXiv 2022, arXiv:2201.10890. [Google Scholar]
- Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Susano Pinto, A.; Keysers, D.; Houlsby, N. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems 2021, 34, 8583–8595. [Google Scholar]
- Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5579–5588.
- Ostapenko, O.; Caccia, L.; Su, Z.; Le Roux, N.; Charlin, L.; Sordoni, A. A Case Study of Instruction Tuning with Mixture of Parameter-Efficient Experts. In Proceedings of the NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following; 2023. [Google Scholar]
- Zhang, Z.; Zeng, Z.; Lin, Y.; Xiao, C.; Wang, X.; Han, X.; Liu, Z.; Xie, R.; Sun, M.; Zhou, J. Emergent Modularity in Pre-trained Transformers. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 4066–4083. [Google Scholar]
- Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A Survey on Large Language Models for Code Generation. arXiv 2024, arXiv:2406.00515. [Google Scholar]
- Zhao, H.; Qiu, Z.; Wu, H.; Wang, Z.; He, Z.; Fu, J. HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts. arXiv 2024, arXiv:2402.12656. [Google Scholar]
- Hu, E.J.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations; 2021. [Google Scholar]
- Qi, P.; Wan, X.; Huang, G.; Lin, M. Zero Bubble Pipeline Parallelism. In Proceedings of the The Twelfth International Conference on Learning Representations; 2023. [Google Scholar]
- Zniyed, Y.; Nguyen, T.P.; et al. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 2024. [Google Scholar]
- Li, M.; Gururangan, S.; Dettmers, T.; Lewis, M.; Althoff, T.; Smith, N.A.; Zettlemoyer, L. Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models. In Proceedings of the First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022, 2022. [Google Scholar]
- Yang, A.; Lin, J.; Men, R.; Zhou, C.; Jiang, L.; Jia, X.; Wang, A.; Zhang, J.; Wang, J.; Li, Y.; et al. M6-t: Exploring sparse expert models and beyond. arXiv 2021, arXiv:2105.15082. [Google Scholar]
- Sukhbaatar, S.; Golovneva, O.; Sharma, V.; Xu, H.; Lin, X.V.; Rozière, B.; Kahn, J.; Li, D.; Yih, W.t.; Weston, J.; et al. Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM. arXiv 2024, arXiv:2403.07816. [Google Scholar]
- Yoo, K.M.; Han, J.; In, S.; Jeon, H.; Jeong, J.; Kang, J.; Kim, H.; Kim, K.M.; Kim, M.; Kim, S.; et al. HyperCLOVA X Technical Report. arXiv 2024, arXiv:2404.01954. [Google Scholar]
- Guo, Y.; Cheng, Z.; Tang, X.; Lin, T. Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models. arXiv 2024, arXiv:2405.14297. [Google Scholar]
- Shi, S.; Pan, X.; Chu, X.; Li, B. PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining. In Proceedings of the IEEE INFOCOM 2023-IEEE Conference on Computer Communications. IEEE; 2023; pp. 1–10. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 2019, 32. [Google Scholar]
- Zhong, Z.; Xia, M.; Chen, D.; Lewis, M. Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training. arXiv 2024, arXiv:2405.03133. [Google Scholar]
- Chen, S.; Jie, Z.; Ma, L. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms. arXiv 2024, arXiv:2401.16160. [Google Scholar]
- Komatsuzaki, A.; Puigcerver, J.; Lee-Thorp, J.; Ruiz, C.R.; Mustafa, B.; Ainslie, J.; Tay, Y.; Dehghani, M.; Houlsby, N. Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. In Proceedings of the The Eleventh International Conference on Learning Representations; 2022. [Google Scholar]
- Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 1992, 8, 229–256. [Google Scholar] [CrossRef]
- Wang, H.; Polo, F.M.; Sun, Y.; Kundu, S.; Xing, E.; Yurochkin, M. Fusing Models with Complementary Expertise. In Proceedings of the The Twelfth International Conference on Learning Representations; 2023. [Google Scholar]
- He, S.; Dong, D.; Ding, L.; Li, A. Demystifying the Compression of Mixture-of-Experts Through a Unified Framework. arXiv 2024, arXiv:2406.02500. [Google Scholar]
- Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 2019, 32. [Google Scholar]
- Almahairi, A.; Ballas, N.; Cooijmans, T.; Zheng, Y.; Larochelle, H.; Courville, A. Dynamic capacity networks. In Proceedings of the International Conference on Machine Learning. PMLR; 2016; pp. 2549–2558. [Google Scholar]
- Liu, Y.; Zhang, R.; Yang, H.; Keutzer, K.; Du, Y.; Du, L.; Zhang, S. Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning. arXiv 2024, arXiv:2404.08985. [Google Scholar]
- Zhang, Z.; Xia, Y.; Wang, H.; Yang, D.; Hu, C.; Zhou, X.; Cheng, D. MPMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism. IEEE Transactions on Parallel and Distributed Systems 2024. [Google Scholar] [CrossRef]
- Nie, X.; Zhao, P.; Miao, X.; Zhao, T.; Cui, B. HetuMoE: An efficient trillion-scale mixture-of-expert distributed training system. arXiv 2022, arXiv:2203.14685. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 2023, 24, 1–113. [Google Scholar]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. arXiv 2020, arXiv:2009.03300. [Google Scholar]
- Raposo, D.; Ritter, S.; Richards, B.; Lillicrap, T.; Humphreys, P.C.; Santoro, A. Mixture-of-Depths: Dynamically allocating compute in transformer-based language models. arXiv 2024, arXiv:2404.02258. [Google Scholar]
- Wei, J.; Bosma, M.; Zhao, V.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models are Zero-Shot Learners. In Proceedings of the International Conference on Learning Representations; 2021. [Google Scholar]
- Nie, X.; Miao, X.; Wang, Z.; Yang, Z.; Xue, J.; Ma, L.; Cao, G.; Cui, B. Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement. Proceedings of the ACM on Management of Data 2023, 1, 1–19. [Google Scholar] [CrossRef]
- Shen, Y.; Guo, Z.; Cai, T.; Qin, Z. JetMoE: Reaching Llama2 Performance with 0.1 M Dollars. arXiv 2024, arXiv:2404.07413. [Google Scholar]
- DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, 2024, [arXiv:cs.CL/2405.04434].
- Ma, X.; Zhao, L.; Huang, G.; Wang, Z.; Hu, Z.; Zhu, X.; Gai, K. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In Proceedings of the The 41st International ACM SIGIR Conference on Research &, Development in Information Retrieval; 2018; pp. 1137–1140. [Google Scholar]
- Wu, X.; Huang, S.; Wei, F. Mixture of LoRA Experts. In Proceedings of the The Twelfth International Conference on Learning Representations; 2024. [Google Scholar]
- Gururangan, S.; Lewis, M.; Holtzman, A.; Smith, N.A.; Zettlemoyer, L. DEMix Layers: Disentangling Domains for Modular Language Modeling. In Proceedings of the Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 5557–5576.
- Mao, Y.; Mathias, L.; Hou, R.; Almahairi, A.; Ma, H.; Han, J.; Yih, S.; Khabsa, M. UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 6253–626.
- Li, S.; Xue, F.; Baranwal, C.; Li, Y.; You, Y. Sequence parallelism: Long sequence training from system perspective. arXiv 2021, arXiv:2105.13120. [Google Scholar]
- Roller, S.; Sukhbaatar, S.; Weston, J.; et al. Hash layers for large sparse models. Advances in Neural Information Processing Systems 2021, 34, 17555–17566. [Google Scholar]
- Bengio, Y.; Léonard, N.; Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv 2013, arXiv:1308.3432. [Google Scholar]
- Chen, W.; Zhou, Y.; Du, N.; Huang, Y.; Laudon, J.; Chen, Z.; Cui, C. Lifelong language pretraining with distribution-specialized experts. In Proceedings of the International Conference on Machine Learning. PMLR; 2023; pp. 5383–5395. [Google Scholar]
- Shi, C.; Yang, C.; Zhu, X.; Wang, J.; Wu, T.; Li, S.; Cai, D.; Yang, Y.; Meng, Y. Unchosen Experts Can Contribute Too: Unleashing MoE Models’ Power by Self-Contrast. arXiv 2024, arXiv:2405.14507. [Google Scholar]
- Uppal, S.; Bhagat, S.; Hazarika, D.; Majumder, N.; Poria, S.; Zimmermann, R.; Zadeh, A. Multimodal research in vision and language: A review of current and emerging trends. Information Fusion 2022, 77, 149–171. [Google Scholar] [CrossRef]
| Challenge | Mitigation Strategies |
|---|---|
| Expert collapse | Load-balancing loss, entropy regularization, capacity limits |
| Sparse gradients | Gumbel-softmax, stochastic routing, STE, warm-up |
| Routing instability | Learning rate schedules, frozen routing, batch normalization |
| Overloaded experts | Token dropping, backup routing, re-ranking fallback |
| Communication bottlenecks | All-to-all dispatching, gradient accumulation, fused kernels |
| Overfitting | Expert dropout, weight sharing, inter-expert consistency losses |
| Domain | MoE Advantages | Challenges and Constraints |
|---|---|---|
| NLP | Scalability, language specialization | Load balancing, routing overhead |
| Vision | Token-wise routing, region specialization | Patch alignment, noise sensitivity |
| Speech | Speaker adaptability, temporal gating | Capacity planning, modality shifts |
| Multi-modal | Modality-specific routing, fusion flexibility | Synchronization, expert sharing |
| Personalization | User-tailored modeling, CTR boost | Latency, memory footprint |
| RL/Robotics | Skill modularity, transfer learning | Non-stationarity, exploration bias |
| Challenge | Future Research Directions |
|---|---|
| Scalability limits | Expert caching, sparse communication graphs, memory sharing |
| Specialization vs. generalization | Meta-routing, expert consistency regularization |
| Interpretability | Symbolic gating, routing visualization tools |
| Training-inference mismatch | Cost-aware training, RL-based routing policies |
| Modular scaling for FMs | Compositional experts, backward-compatible updates |
| Continual/federated learning | Dynamic expert allocation, local/global parameter isolation |
| Neuro-symbolic MoEs | Programmatic experts, differentiable interpreters |
| Benchmarking | Unified metrics, routing entropy, utilization variance |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
