Submitted:
03 August 2025
Posted:
05 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Motivation for Model Compression
1.2. Model Compression as a Solution
Pruning
Quantization
Knowledge Distillation
Low-Rank Factorization
1.3. Challenges in Model Compression
1.4. Contributions of This Survey
- We present an in-depth analysis of key model compression methods, including pruning, quantization, knowledge distillation, and low-rank factorization, with a focus on their applicability to LLMs.
- We highlight state-of-the-art advancements in compression techniques, detailing their theoretical foundations, practical implementations, and empirical results.
- We explore emerging trends in model compression, such as lottery ticket hypothesis frameworks, sparse training techniques, and hardware-aware optimizations.
- We discuss practical challenges and trade-offs encountered in deploying compressed models for real-world NLP applications.
- We outline key open research questions and future directions, aiming to inspire continued progress in the field of efficient LLM inference [14].
2. Background
2.1. The Transformer Architecture
1. Multi-Head Self-Attention
2. Feedforward Networks
3. Positional Encoding
4. Layer Normalization and Residual Connections
5. Transformer Blocks
2.2. Scaling Laws and the Growth of LLMs
- Memory Footprint: Large models require substantial memory capacity to store parameters, activations, and optimizer states, often demanding specialized hardware such as GPUs with extensive VRAM or TPUs.
- Inference Latency: Larger models require more computations per forward pass, resulting in slower inference speeds that hinder real-time applications.
- Energy Consumption: The computational demands of LLMs result in increased power consumption, raising environmental and economic concerns [24].
- Deployment Constraints: Deploying billion-parameter models on edge devices, mobile platforms, or environments with limited resources becomes highly impractical without substantial compression and optimization [25].
2.3. The Need for Model Compression
1. Efficient Deployment
2. Reduced Latency
3. Lower Power Consumption
4. Cost Reductiony
2.4. Categories of Model Compression Techniques
- Quantization: Reducing the numerical precision of model parameters to minimize memory usage and accelerate computation.
- Knowledge Distillation: Training a smaller model (student) to replicate the behavior of a larger model (teacher), transferring knowledge to improve efficiency.
- Low-Rank Factorization: Decomposing large weight matrices into smaller components using techniques such as singular value decomposition (SVD) to reduce parameter redundancy.
2.5. Trade-Offs in Model Compression
1. Accuracy Degradation
2. Stability Challenges
3. Hardware Compatibility
2.6. Scope of This Survey
3. Compression Techniques for Large Language Models
3.1. Pruning
3.1.1. Types of Pruning
1. Unstructured Pruning
2. Structured Pruning
3. Dynamic Pruning
3.1.2. Pruning Strategies
- Post-training Pruning: Pruning is applied to a fully trained model, followed by fine-tuning to recover performance.
- During-training Pruning: Parameters are pruned progressively throughout the training process, often improving convergence stability.
3.1.3. Notable Advancements in Pruning
3.2. Quantization
3.2.1. Types of Quantization
1. Post-Training Quantization (PTQ)
2. Quantization-Aware Training (QAT)
3. Mixed-Precision Quantization
3.2.2. Hardware Compatibility
3.3. Knowledge Distillation
3.3.1. Types of Knowledge Distillation
1. Logit-Based Distillation
2. Feature-Based Distillation
3. Response-Based Distillation
3.3.2. Distillation Strategies
- Offline Distillation: The teacher is pre-trained, and its knowledge is transferred to the student through standard supervised learning.
- Online Distillation: Both teacher and student models are trained simultaneously, allowing the teacher to adapt dynamically during training [46].
- Self-Distillation: A single model trains itself by distilling knowledge from its earlier epochs into its later epochs [47].
3.4. Low-Rank Factorization
3.4.1. Matrix Decomposition Techniques
- Singular Value Decomposition (SVD): Decomposes a matrix into three components to approximate the original weight matrix with a lower-rank structure.
- CP Decomposition: Factorizes tensors into a sum of component rank-one tensors, ideal for compressing Transformer weights [52].
- Tensor Train Decomposition: Factorizes tensors into smaller core tensors connected in a chain, achieving higher compression rates for extremely large models [53].
3.4.2. Challenges in Low-Rank Factorization
3.5. Hybrid Approaches
3.6. Summary
| Technique | Compression Ratio | Accuracy Impact | Hardware Efficiency |
|---|---|---|---|
| Pruning | High (Structured) | Moderate | Requires Sparse Support |
| Quantization | Moderate to High | Low (with QAT) | High |
| Knowledge Distillation | Moderate to High | Low (if well-trained) | High |
| Low-Rank Factorization | Moderate | Moderate | Requires Optimization |
4. Applications of Compressed Language Models
4.1. Real-Time Conversational Agents
4.1.1. Benefits of Compression
4.1.2. Case Study: Chatbot Deployment
4.2. Edge AI and Mobile Applications
4.2.1. Benefits of Compression
4.2.2. Case Study: Mobile Translation Models
4.3. Search Engines and Recommendation Systems
4.3.1. Benefits of Compression
4.3.2. Case Study: E-commerce Recommendations
4.4. Healthcare and Biomedical NLP
4.4.1. Benefits of Compression
4.4.2. Case Study: Medical QA Systems
4.5. Autonomous Systems and Robotics
4.5.1. Benefits of Compression
4.5.2. Case Study: Autonomous Vehicle Commands
4.6. Content Moderation and Social Media Analysis
4.6.1. Benefits of Compression
4.6.2. Case Study: Content Moderation on Facebook
4.7. Financial Services and Fraud Detection
4.7.1. Benefits of Compression
4.7.2. Case Study: Fraud Detection Models
4.8. Trade-Offs and Practical Considerations
- Accuracy vs. Efficiency: Aggressive compression may introduce performance degradation, requiring fine-tuning to restore accuracy.
- Hardware Constraints: Compressed models must be optimized for target platforms to ensure efficient execution on CPUs, GPUs, or TPUs.
- Data Distribution Shifts: Compressed models may be less robust to distributional shifts, necessitating periodic re-evaluation and retraining.
4.9. Summary
5. Emerging Trends in Model Compression
5.1. Hardware-Aware Compression
5.1.1. Techniques for Hardware Optimization
1. Block Sparse Pruning
2. Mixed-Precision Quantization
3. Operator Fusion
5.1.2. Case Study: NVIDIA TensorRT
5.2. Sparse Training Techniques
5.2.1. Popular Sparse Training Strategies
- Sparse MoE (Mixture of Experts): Mixture of Experts architectures activate only a subset of expert layers during inference, reducing computational overhead while maintaining model flexibility.
- Dynamic Sparse Training (DST): DST gradually prunes and regrows model connections during training, identifying optimal sparse subnetworks without compromising performance.
- Rigged Lottery Ticket Training: This technique identifies sparse subnetworks early in training, directly optimizing the reduced model without additional pruning stages.
5.3. Neural Architecture Search (NAS) for Efficient Models
5.3.1. Techniques in NAS for Compression
- Weight-Sharing NAS: This method shares parameters across multiple candidate architectures to efficiently explore large design spaces [82].
- One-Shot NAS: A single supernet is trained, from which sub-networks can be extracted for efficient inference [83].
- Hardware-Aware NAS: This technique tailors model architectures to meet specific latency, power, or memory constraints on target hardware.
5.3.2. Case Study: Efficient Transformers
5.4. Low-Rank Adaptation (LoRA) and Parameter-Efficient Fine-Tuning (PEFT)
5.4.1. Key Characteristics of LoRA and PEFT
- Reduced Parameter Overhead: LoRA introduces trainable low-rank matrices into specific model layers, minimizing the number of updated parameters [85].
- Fast Adaptation: PEFT methods enable efficient fine-tuning by freezing most of the original model parameters, accelerating training and reducing memory requirements.
5.4.2. Case Study: GPT Fine-Tuning with LoRA
5.5. Knowledge Distillation with Reinforcement Learning
5.5.1. RL-Based Distillation Strategies
- Reward-Guided Learning: Distillation loss functions are augmented with RL-inspired reward signals to guide the student toward robust decision-making.
- Adaptive Sampling: RL techniques dynamically adjust the sampling strategy for distillation data, prioritizing harder examples that maximize learning efficiency.
5.5.2. Case Study: RLHF in Distilled Models
5.6. Data-Free Model Compression
5.6.1. Popular Data-Free Methods
- Zero-Shot Pruning: This method prunes models based on intrinsic properties such as weight magnitude or gradient statistics without requiring labeled data.
- Synthetic Data Generation: Using generative models to create data distributions that mimic the original training set, enabling effective compression even without the original data.
5.6.2. Case Study: Privacy-Conscious NLP Systems
5.7. Continual Learning and Adaptive Compression
5.7.1. Adaptive Compression Techniques
- Elastic Parameter Growth: Dynamically allocates or prunes model parameters based on task complexity and evolving data distributions.
- Lifelong Distillation: Periodically distills knowledge from evolving teacher models to maintain student model robustness [90].
5.8. Summary and Future Directions
- Enhancing compression strategies for multimodal models that combine text, vision, and audio.
- Developing robust evaluation metrics that balance model size, inference speed, and predictive accuracy.
- Integrating compression techniques into mainstream NLP pipelines to enable scalable and sustainable AI deployment.
6. Evaluation Metrics and Benchmarking
6.1. Performance Metrics
6.1.1. Accuracy and Task Performance
- Perplexity (PPL): Commonly used in language modeling tasks, perplexity measures how well a model predicts unseen text. Lower perplexity values indicate better performance [98].
- F1 Score: Widely used for classification tasks, the F1 score balances precision and recall, ensuring compressed models retain predictive accuracy.
- Exact Match (EM): In QA systems and text generation tasks, EM measures the percentage of predictions that exactly match the expected output.
- BLEU, ROUGE, and METEOR: These metrics assess text generation quality by comparing generated text with reference outputs.
6.1.2. Efficiency Metrics
- Model Size (MB/GB): The compressed model’s total size, which directly impacts storage and memory requirements.
- Parameter Count: The number of trainable model parameters, serving as a key indicator of model complexity [99].
- FLOPs (Floating Point Operations): FLOPs measure the computational cost of inference, providing insights into latency and energy efficiency [100].
- Inference Speed (ms/query): This metric directly reflects model responsiveness in real-world applications.
- Throughput (Tokens/Sec): Throughput measures the number of tokens processed per second, a critical factor for high-traffic NLP systems.
6.1.3. Robustness and Generalization
- Out-of-Distribution (OOD) Accuracy: Measures how well a compressed model performs on data that deviates from the original training distribution [102].
- Adversarial Robustness: Evaluates the model’s resistance to crafted adversarial examples that exploit weaknesses in language understanding.
- Calibration Error: Compressed models may become overconfident or underconfident in their predictions. Calibration metrics assess how well model probabilities reflect true likelihoods [103].
6.1.4. Energy and Environmental Impact
- Energy Consumption (kWh): Measures the energy consumed during model inference, reflecting deployment efficiency.
- Carbon Footprint (CO2 Emissions): This metric estimates the environmental impact of running compressed models on large-scale infrastructures.
6.2. Benchmarking Frameworks
6.2.1. Popular NLP Benchmarks
- GLUE (General Language Understanding Evaluation): A comprehensive benchmark suite for evaluating NLP models across diverse tasks such as sentiment analysis, entailment, and paraphrase detection.
- SuperGLUE: An advanced version of GLUE designed for challenging NLP tasks that require deep reasoning and contextual understanding.
- Eloquence Benchmark (ELOQ): A specialized benchmark that evaluates compressed models for text generation fluency and coherence.
- MLPerf Inference Benchmark: This framework evaluates compressed models in real-time inference scenarios, measuring latency, throughput, and energy consumption.
6.2.2. Custom Benchmarking Pipelines
- Edge-AI Benchmarks: Tailored for mobile and IoT devices, these frameworks measure compressed model performance in constrained environments.
- Latency-Aware NLP Pipelines: Designed to evaluate the real-time responsiveness of compressed models in production environments.
6.3. Trade-Off Analysis in Model Compression
6.3.1. Accuracy vs. Latency Trade-Off
6.3.2. Energy vs. Throughput Trade-Off
6.3.3. Memory Footprint vs [109]. Model Depth Trade-Off
6.4. Best Practices for Model Evaluation
- Conduct evaluations on diverse datasets that capture real-world data distributions.
- Report multiple metrics, including both task performance and resource efficiency indicators.
- Compare results against strong baselines, such as uncompressed models or popular efficient architectures.
- Provide open-source code and evaluation pipelines to encourage transparency and reproducibility.
6.5. Summary
7. Deployment Strategies for Compressed Language Models
7.1. Infrastructure Considerations
7.1.1. Cloud-Based Deployment
Best Practices for Cloud Deployment
- Leverage auto-scaling features to dynamically adjust resource allocation based on request volume.
- Optimize compressed models using GPU-accelerated inference frameworks like NVIDIA TensorRT or Amazon Inferentia for improved latency [115].
- Employ model sharding techniques to distribute large language models across multiple cloud instances.
7.1.2. Edge and Mobile Deployment
Best Practices for Edge Deployment
- Use quantization techniques to reduce model precision for efficient execution on low-power processors [116].
- Deploy models with frameworks like TensorFlow Lite, ONNX Runtime, or Core ML, which are optimized for mobile and embedded environments.
- Implement caching mechanisms to reduce redundant computation and minimize latency.
7.1.3. On-Premises and Enterprise Systems
Best Practices for On-Premises Deployment
- Optimize compressed models for CPU inference by applying operator fusion and vectorized computations.
- Utilize containerization tools such as Docker or Kubernetes to streamline deployment and ensure scalability [117].
- Integrate monitoring tools to track model performance, detect data drift, and ensure stable inference behavior.
7.2. Serving Frameworks and Model APIs
7.2.1. Popular Serving Frameworks
- FastAPI: A lightweight Python framework ideal for serving compressed models with minimal latency overhead.
- NVIDIA Triton Inference Server: Optimized for multi-GPU and multi-model environments, Triton is effective for deploying compressed LLMs in scalable systems.
- TorchServe: Developed by the PyTorch team, TorchServe simplifies model packaging, inference, and scaling for production systems.
7.2.2. Model API Design Best Practices
- Implement batch inference to maximize throughput by processing multiple requests simultaneously [118].
- Utilize asynchronous inference pipelines to manage concurrent user requests effectively [119].
- Optimize data serialization formats (e.g., Protocol Buffers, MessagePack) to minimize network overhead [120].
7.3. Latency Optimization Techniques
7.3.1. Techniques for Reducing Latency
- Model Batching: Efficiently groups multiple inference requests, improving GPU utilization and reducing overhead.
- Pipeline Parallelism: Splits model layers across multiple devices to maximize parallel execution in large-scale models.
- Distillation-Aware Caching: Stores frequently queried outputs from a distilled student model to accelerate common predictions.
7.3.2. Case Study: Real-Time Translation Systems
7.4. Resource Management and Scaling
7.4.1. Strategies for Resource Optimization
- Dynamic Scaling: Automatically adjusts computing resources based on inference workload fluctuations [123].
- Load Balancing: Distributes model requests across multiple servers to ensure even resource utilization.
- Memory-Efficient Deployment: Offloads inactive model components to disk and loads them dynamically during inference.
7.4.2. Case Study: E-Commerce Product Search
7.5. Security and Privacy Considerations
7.5.1. Techniques for Secure Model Deployment
- Model Encryption: Encrypts model weights to prevent unauthorized access during deployment.
- Secure Enclaves: Hardware-based security solutions such as Intel SGX enable encrypted model execution in trusted environments [124].
- Adversarial Defense Mechanisms: Apply adversarial training or robust optimization techniques to improve resistance against malicious inputs [125].
7.5.2. Case Study: Healthcare NLP Systems
7.6. Continuous Integration and Deployment (CI/CD) Pipelines
7.6.1. Key CI/CD Strategies for Compressed Models
- Model Versioning: Track compressed model versions to enable seamless rollbacks in case of performance degradation [126].
- Canary Deployment: Gradually deploy compressed models to a small subset of users before full rollout.
- Performance Monitoring Tools: Tools like Prometheus and Grafana provide real-time insights into latency, throughput, and resource utilization.
7.7. Post-Deployment Evaluation and Maintenance
7.7.1. Best Practices for Post-Deployment Monitoring
- Regularly monitor accuracy and latency metrics to detect performance drift.
- Implement alerting systems to respond to unusual spikes in resource consumption or model errors [127].
- Periodically retrain compressed models to adapt to evolving data distributions.
7.8. Summary
8. Challenges and Open Research Directions
8.1. Preserving Model Performance Post-Compression
8.1.1. Catastrophic Performance Degradation
Research Directions:
- Develop adaptive compression algorithms that automatically balance model sparsity with performance preservation.
- Investigate hybrid approaches that combine knowledge distillation with selective pruning to minimize information loss.
- Explore techniques to incorporate uncertainty estimation in compressed models to improve their robustness under ambiguous inputs.
8.1.2. Fine-Tuning Challenges
Research Directions:
- Design lightweight fine-tuning strategies tailored for compressed architectures [134].
- Develop methods for transfer learning in compressed models to improve generalization without extensive retraining.
8.2. Trade-Offs Between Efficiency and Model Robustness
8.2.1. Adversarial Vulnerability
Research Directions:
8.2.2. Robustness to Data Drift
Research Directions:
- Design adaptive compression frameworks that update model parameters incrementally to respond to distribution shifts.
- Develop self-correcting mechanisms that detect and mitigate drift-induced errors during inference.
8.3. Scalability and Large-Scale Model Compression
8.3.1. Scaling Compression Algorithms
Research Directions:
- Design scalable pruning frameworks that exploit distributed computing for compressing massive models.
- Develop low-rank approximation algorithms that adaptively scale with increasing model size [137].
8.3.2. Efficient Distributed Training for Compressed Models
Research Directions:
- Investigate decentralized training strategies that reduce communication overhead during compression.
- Explore federated learning techniques tailored for compressed models to enhance scalability and privacy.
8.4. Hardware and Deployment Limitations
8.4.1. Hardware-Aware Compression
Research Directions:
- Develop hardware-aware compression algorithms that account for the memory hierarchy, cache structure, and computational constraints of modern hardware.
- Investigate compiler-level optimizations that improve the inference efficiency of compressed models.
8.4.2. Edge Device Constraints
Research Directions:
8.5. Fairness, Bias, and Ethical Considerations
8.5.1. Bias Amplification in Compressed Models
Research Directions:
8.5.2. Privacy Risks in Compressed Models
Research Directions:
- Explore privacy-preserving compression techniques that mitigate information leakage while reducing model size.
- Investigate differentially private compression methods that enhance data confidentiality during training and deployment.
8.6. Lack of Standardized Evaluation Frameworks
8.6.1. Evaluation Benchmark Gaps
Research Directions:
- Develop comprehensive benchmarking frameworks that integrate task accuracy, latency, energy efficiency, and robustness metrics.
- Design dynamic evaluation pipelines that assess compressed models in real-time production environments [145].
8.7. Summary
9. Conclusion and Future Directions
9.1. Key Insights
- Compression Trade-Offs: Effective compression requires balancing model size reduction with minimal performance degradation [149]. Techniques such as structured pruning and mixed-precision quantization have shown promising results in preserving model accuracy while significantly reducing computational overhead [150].
- Distillation as a Core Strategy: Knowledge distillation has proven highly effective in transferring the capabilities of large teacher models to smaller, student models. Combining distillation with other compression techniques has emerged as a powerful strategy for improving efficiency without sacrificing task performance [151].
- Hardware-Aware Optimization: Efficient deployment requires tailoring compression techniques to the capabilities of target hardware, including GPUs, TPUs, and edge devices. Hardware-aware pruning and quantization frameworks have demonstrated considerable improvements in latency and energy efficiency.
- Evaluation Beyond Accuracy: While task accuracy remains essential, comprehensive evaluation frameworks that include latency, throughput, and robustness metrics are necessary to assess compressed models’ real-world performance [152].
- Security and Fairness Concerns: As compressed models are increasingly deployed in sensitive domains, ensuring fairness, privacy, and robustness against adversarial threats is critical to their responsible use.
9.2. Future Directions
9.2.1. Adaptive and Dynamic Compression Techniques
Potential Research Directions:
- Develop compression frameworks that automatically adapt model sparsity based on workload characteristics.
- Explore dynamic quantization techniques that modify precision levels in response to changing resource constraints.
9.2.2. Energy-Efficient Compression Strategies
Potential Research Directions:
- Design compression-aware training algorithms that incorporate energy consumption as an optimization criterion.
- Develop model-serving frameworks that intelligently allocate hardware resources to minimize power usage.
9.2.3. Robust and Fair Compression Algorithms
Potential Research Directions:
- Develop compression techniques that explicitly preserve model fairness and mitigate bias.
- Investigate adversarially robust compression algorithms to ensure model security in high-risk environments [154].
9.2.4. Scalable Compression for Large-Scale Models
Potential Research Directions:
- Develop distributed compression frameworks that leverage parallel processing for scalable model reduction [156].
- Investigate low-rank approximation methods capable of compressing extremely large transformer architectures.
9.2.5. Cross-Modal Compression Techniques
Potential Research Directions:
- Design unified compression frameworks that efficiently compress multimodal representations.
- Develop task-aware compression strategies that adapt compression techniques to specific cross-modal tasks [158].
9.2.6. Benchmarking and Evaluation Frameworks
Potential Research Directions:
- Develop end-to-end benchmarking pipelines that assess compressed model performance under real-world conditions.
- Establish community-driven leaderboards to track advancements in compression research [160].
9.3. Final Remarks
References
- Chen, Z.; Gao, Q.; Bosselut, A.; Sabharwal, A.; Richardson, K. Disco: distilling counterfactuals with large language models. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 5514–5528.
- Kaushal, A.; Vaidhya, T.; Rish, I. Lord: Low rank decomposition of monolingual code llms for one-shot compression. arXiv preprint arXiv:2309.14021 2023.
- Sanh, V.; Wolf, T.; Rush, A. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems 2020, 33, 20378–20389.
- Wu, X.; Yao, Z.; He, Y. ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats. arXiv preprint arXiv:2307.09782 2023.
- Li, B.; Kong, Z.; Zhang, T.; Li, J.; Li, Z.; Liu, H.; Ding, C. Efficient transformer-based large scale language representations using hardware-friendly block structured pruning. arXiv preprint arXiv:2009.08065 2020.
- Gu, Y.; Zhang, S.; Usuyama, N.; Woldesenbet, Y.; Wong, C.; Sanapathi, P.; Wei, M.; Valluri, N.; Strandberg, E.; Naumann, T.; et al. Distilling large language models for biomedical knowledge extraction: A case study on adverse drug events. arXiv preprint arXiv:2307.06439 2023.
- Li, X.; Meng, Y.; Zhou, M.; Han, Q.; Wu, F.; Li, J. Sac: Accelerating and structuring self-attention via sparse adaptive connection. Advances in Neural Information Processing Systems 2020, 33, 16997–17008.
- Dasgupta, S.; Cohn, T.; Baldwin, T. Cost-effective Distillation of Large Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2023.
- Lee, J.H.; Kim, J.; Kwon, S.J.; Lee, D. FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization. arXiv preprint arXiv:2306.00317 2023.
- Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 2021.
- Liu, J.; Gong, R.; Wei, X.; Dong, Z.; Cai, J.; Zhuang, B. Qllm: Accurate and efficient low-bitwidth quantization for large language models. arXiv preprint arXiv:2310.08041 2023.
- Shen, S.; Dong, Z.; Ye, J.; Ma, L.; Yao, Z.; Gholami, A.; Mahoney, M.W.; Keutzer, K. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2020, Vol. 34, pp. 8815–8821.
- Kim, Y.J.; Henry, R.; Fahim, R.; Awadalla, H.H. Finequant: Unlocking efficiency with fine-grained weight-only quantization for llms. arXiv preprint arXiv:2308.09723 2023.
- Zhao, Z.; Liu, Y.; Chen, L.; Liu, Q.; Ma, R.; Yu, K. An investigation on different underlying quantization schemes for pre-trained language models. In Proceedings of the Natural Language Processing and Chinese Computing: 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October 14–18, 2020, Proceedings, Part I 9. Springer, 2020, pp. 359–371.
- Magister, L.C.; Mallinson, J.; Adamek, J.; Malmi, E.; Severyn, A. Teaching small language models to reason. arXiv preprint arXiv:2212.08410 2022.
- Prasanna, S.; Rogers, A.; Rumshisky, A. When BERT plays the lottery, all tickets are winning. arXiv preprint arXiv:2005.00561 2020.
- Chee, J.; Cai, Y.; Kuleshov, V.; De Sa, C. QuIP: 2-Bit Quantization of Large Language Models With Guarantees. arXiv preprint arXiv:2307.13304 2023.
- Srinivasan, V.; Gandhi, D.; Thakker, U.; Prabhakar, R. Training Large Language Models Efficiently with Sparsity and Dataflow. arXiv preprint arXiv:2304.05511 2023.
- Clark, A.; De Las Casas, D.; Guy, A.; Mensch, A.; Paganini, M.; Hoffmann, J.; Damoc, B.; Hechtman, B.; Cai, T.; Borgeaud, S.; et al. Unified scaling laws for routed language models. In Proceedings of the International Conference on Machine Learning. PMLR, 2022, pp. 4057–4086.
- So, D.; Le, Q.; Liang, C. The evolved transformer. In Proceedings of the International conference on machine learning. PMLR, 2019, pp. 5877–5886.
- Zhu, Y.; Wichers, N.; Lin, C.C.; Wang, X.; Chen, T.; Shu, L.; Lu, H.; Liu, C.; Luo, L.; Chen, J.; et al. SiRA: Sparse Mixture of Low Rank Adaptation. ArXiv 2023, abs/2311.09179.
- Chen, T.; Frankle, J.; Chang, S.; Liu, S.; Zhang, Y.; Wang, Z.; Carbin, M. The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems 2020, 33, 15834–15846.
- Cui, B.; Li, Y.; Zhang, Z. Joint structured pruning and dense knowledge distillation for efficient transformer model compression. Neurocomputing 2021, 458, 56–69. [CrossRef]
- Wu, S.; Chen, H.; Quan, X.; Wang, Q.; Wang, R. AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression. arXiv preprint arXiv:2305.10010 2023.
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 2017.
- Zoph, B.; Bello, I.; Kumar, S.; Du, N.; Huang, Y.; Dean, J.; Shazeer, N.; Fedus, W. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906 2022.
- Wang, L.; Yang, N.; Wei, F. Learning to retrieve in-context examples for large language models. arXiv preprint arXiv:2307.07164 2023.
- Xin, J.; Tang, R.; Lee, J.; Yu, Y.; Lin, J.J. DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2020.
- Zniyed, Y.; Nguyen, T.P.; et al. Efficient tensor decomposition-based filter pruning. Neural Networks 2024, 178, 106393. [CrossRef]
- Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3531–3539.
- Zadeh, A.H.; Edo, I.; Awad, O.M.; Moshovos, A. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 811–824.
- Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural computation 1991, 3, 79–87. [CrossRef]
- Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418 2019.
- Correia, G.M.; Niculae, V.; Martins, A.F. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015 2019.
- Mishra, A.; Latorre, J.A.; Pool, J.; Stosic, D.; Stosic, D.; Venkatesh, G.; Yu, C.; Micikevicius, P. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378 2021.
- Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 2016.
- Zhang, Y.; Zhao, L.; Cao, S.; Wang, W.; Cao, T.; Yang, F.; Yang, M.; Zhang, S.; Xu, N. Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models. arXiv preprint arXiv:2305.12356 2023.
- Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704–2713.
- Anonymous. Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity. In Proceedings of the Submitted to The Twelfth International Conference on Learning Representations, 2023. under review.
- Frantar, E.; Alistarh, D. Optimal brain compression: A framework for accurate post-training quantization and pruning. Advances in Neural Information Processing Systems 2022, 35, 4475–4488.
- Goyal, S.; Choudhury, A.R.; Raje, S.; Chakaravarthy, V.; Sabharwal, Y.; Verma, A. PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In Proceedings of the International Conference on Machine Learning. PMLR, 2020, pp. 3690–3699.
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International conference on machine learning. PMLR, 2020, pp. 5156–5165.
- Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning structured sparsity in deep neural networks. Advances in neural information processing systems 2016, 29.
- Team, T.A.B. RayLLM. https://github.com/ray-project/ray-llm.
- Li, R.; Murray, G.; Carenini, G. Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023.
- Kim, Y.J.; Awan, A.A.; Muzio, A.; Salinas, A.F.C.; Lu, L.; Hendy, A.; Rajbhandari, S.; He, Y.; Awadalla, H.H. Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465 2021.
- Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10734–10742.
- Chai, Y.; Gkountouras, J.; Ko, G.G.; Brooks, D.; Wei, G.Y. INT2. 1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation. arXiv preprint arXiv:2306.08162 2023.
- Zniyed, Y.; Nguyen, T.P.; et al. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 2024.
- Narang, S.; Undersander, E.; Diamos, G. Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782 2017.
- Fan, A.; Grave, E.; Joulin, A. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 2019.
- Kim, S.; Gholami, A.; Yao, Z.; Mahoney, M.W.; Keutzer, K. I-bert: Integer-only bert quantization. In Proceedings of the International conference on machine learning. PMLR, 2021, pp. 5506–5518.
- Scao, T.L.; Fan, A.; Akiki, C.; Pavlick, E.; Ilic, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M.; et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. CoRR 2022, abs/2211.05100.
- Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision; Chapman and Hall/CRC, 2022; pp. 291–326.
- Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 2020, 33, 5776–5788.
- Li, Y.; Niu, L.; Zhang, X.; Liu, K.; Zhu, J.; Kang, Z. E-Sparse: Boosting the Large Language Model Inference through Entropy-based N: M Sparsity. arXiv preprint arXiv:2310.15929 2023.
- Zhao, C.; Hua, T.; Shen, Y.; Lou, Q.; Jin, H. Automatic mixed-precision quantization search of BERT. arXiv preprint arXiv:2112.14938 2021.
- Wu, M.; Waheed, A.; Zhang, C.; Abdul-Mageed, M.; Aji, A.F. Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402 2023.
- Williams, M.; Aletras, N. How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models? arXiv preprint arXiv:2311.09755 2023.
- Li, Y.; Yu, Y.; Liang, C.; He, P.; Karampatziakis, N.; Chen, W.; Zhao, T. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659 2023.
- Jiang, Y.; Chan, C.; Chen, M.; Wang, W. Lion: Adversarial distillation of closed-source large language model. arXiv preprint arXiv:2305.12870 2023.
- Radosavovic, I.; Kosaraju, R.P.; Girshick, R.B.; He, K.; Dollár, P. Designing Network Design Spaces. In Proceedings of the Conference on Computer Vision and Pattern Recognition, CVPR, 2020, pp. 10425–10433.
- Shao, H.; Liu, B.; Qian, Y. One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models. arXiv preprint arXiv:2310.09499 2023.
- Guo, S.; Xu, J.; Zhang, L.L.; Yang, M. Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models. arXiv preprint arXiv:2310.05015 2023.
- Zuo, S.; Zhang, Q.; Liang, C.; He, P.; Zhao, T.; Chen, W. Moebert: from bert to mixture-of-experts via importance-guided adaptation. arXiv preprint arXiv:2204.07675 2022.
- Kurtic, E.; Campos, D.; Nguyen, T.; Frantar, E.; Kurtz, M.; Fineran, B.; Goin, M.; Alistarh, D. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259 2022.
- Lin, Z.; Qu, G.; Chen, Q.; Chen, X.; Chen, Z.; Huang, K. Pushing large language models to the 6g edge: Vision, challenges, and opportunities. arXiv preprint arXiv:2309.16739 2023.
- Ren, S.; Zhu, K.Q. Low-Rank Prune-And-Factorize for Language Model Compression. arXiv preprint arXiv:2306.14152 2023.
- Sun, S.; Cheng, Y.; Gan, Z.; Liu, J. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355 2019.
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 2018.
- Jaiswal, A.K.; Liu, S.; Chen, T.; Ding, Y.; Wang, Z. Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. In Proceedings of the International Conference on Machine Learning. PMLR, 2023, pp. 14691–14701.
- Liu, K.; Wu, T.; Liu, C.; Guo, G. Dynamic group transformer: A general vision transformer backbone with dynamic group attention. arXiv preprint arXiv:2203.03937 2022.
- Team, T.W.J. Saxml. https://github.com/google/saxml.
- Agarwal, R.; Vieillard, N.; Zhou, Y.; Stanczyk, P.; Ramos, S.; Geist, M.; Bachem, O. Generalized knowledge distillation for auto-regressive language models. arXiv preprint arXiv:2306.13649 2023.
- Xu, Z.; Liu, Z.; Chen, B.; Tang, Y.; Wang, J.; Zhou, K.; Hu, X.; Shrivastava, A. Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt. arXiv preprint arXiv:2305.11186 2023.
- Yang, Z.; Cui, Y.; Yao, X.; Wang, S. Gradient-based Intra-attention Pruning on Pre-trained Language Models. arXiv preprint arXiv:2212.07634 2022.
- Wang, Y.; Agarwal, S.; Mukherjee, S.; Liu, X.; Gao, J.; Awadallah, A.H.; Gao, J. AdaMix: Mixture-of-adaptations for parameter-efficient model tuning. arXiv preprint arXiv:2210.17451 2022.
- Anonymous. Pushing Gradient towards Zero: A Novel Pruning Method for Large Language Models, 2024.
- Kim, M.; Lee, S.; Lee, J.; Hong, S.; Chang, D.S.; Sung, W.; Choi, J. Token-Scaled Logit Distillation for Ternary Weight Generative Language Models. arXiv preprint arXiv:2308.06744 2023.
- Dettmers, T.; Lewis, M.; Belkada, Y.; Zettlemoyer, L. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339 2022.
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 2021.
- Park, G.; Park, B.; Kwon, S.J.; Kim, B.; Lee, Y.; Lee, D. nuQmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557 2022.
- Shao, W.; Chen, M.; Zhang, Z.; Xu, P.; Zhao, L.; Li, Z.; Zhang, K.; Gao, P.; Qiao, Y.; Luo, P. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137 2023.
- Csord’as, R.; Irie, K.; Schmidhuber, J. Approximating Two-Layer Feedforward Networks for Efficient Transformers. ArXiv 2023, abs/2310.10837.
- Xue, F.; Zheng, Z.; Fu, Y.; Ni, J.; Zheng, Z.; Zhou, W.; You, Y. OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models, 2024, [arXiv:cs.CL/2402.01739].
- Sahu, G.; Vechtomova, O.; Bahdanau, D.; Laradji, I.H. Promptmix: A class boundary augmentation method for large language model distillation. arXiv preprint arXiv:2310.14192 2023.
- Lee, C.; Jin, J.; Kim, T.; Kim, H.; Park, E. OWQ: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272 2023.
- Yvinec, E.; Dapgony, A.; Cord, M.; Bailly, K. REx: Data-Free Residual Quantization Error Expansion. arXiv preprint arXiv:2203.14645 2022.
- Park, M.; You, J.; Nagel, M.; Chang, S. Quadapter: Adapter for GPT-2 Quantization. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 2510–2517.
- Gu, Y.; Dong, L.; Wei, F.; Huang, M. Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2306.08543 2023.
- Zhang, Z.; Gu, Y.; Han, X.; Chen, S.; Xiao, C.; Sun, Z.; Yao, Y.; Qi, F.; Guan, J.; Ke, P.; et al. Cpm-2: Large-scale cost-effective pre-trained language models. AI Open 2021, 2, 216–224. [CrossRef]
- Zhang, Q.; Zuo, S.; Liang, C.; Bukharin, A.; He, P.; Chen, W.; Zhao, T. Platon: Pruning large transformer models with upper confidence bound of weight importance. In Proceedings of the International Conference on Machine Learning. PMLR, 2022, pp. 26809–26823.
- An, Y.; Zhao, X.; Yu, T.; Tang, M.; Wang, J. Fluctuation-based Adaptive Structured Pruning for Large Language Models. arXiv preprint arXiv:2312.11983 2023. [CrossRef]
- Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Xia, M.; Gao, T.; Zeng, Z.; Chen, D. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning. arXiv preprint arXiv:2310.06694 2023.
- Yang, N.; Jang, Y.; Lee, H.; Jeong, S.; Jung, K. Task-specific Compression for Multi-task Language Models using Attribution-based Pruning. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 582–592.
- Zhu, X.; Qi, B.; Zhang, K.; Long, X.; Zhou, B. PaD: Program-aided Distillation Specializes Large Models in Reasoning. arXiv preprint arXiv:2305.13888 2023.
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 2021.
- Ye, D.; Lin, Y.; Huang, Y.; Sun, M. TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. In Proceedings of the North American Chapter of the Association for Computational Linguistics, 2021.
- Valipour, M.; Rezagholizadeh, M.; Kobyzev, I.; Ghodsi, A. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558 2022.
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 2017.
- Xiong, Y.; Zeng, Z.; Chakraborty, R.; Tan, M.; Fung, G.; Li, Y.; Singh, V. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2021, Vol. 35, pp. 14138–14148.
- Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems 2021, 34, 24261–24272.
- Khetan, A.; Karnin, Z. schuBERT: Optimizing elements of BERT. arXiv preprint arXiv:2005.06628 2020.
- Yang, A.; Lin, J.; Men, R.; Zhou, C.; Jiang, L.; Jia, X.; Wang, A.; Zhang, J.; Wang, J.; Li, Y.; et al. M6-t: Exploring sparse expert models and beyond. arXiv preprint arXiv:2105.15082 2021.
- Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611–626.
- Hou, L.; Huang, Z.; Shang, L.; Jiang, X.; Chen, X.; Liu, Q. Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems 2020, 33, 9782–9793.
- Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Susano Pinto, A.; Keysers, D.; Houlsby, N. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems 2021, 34, 8583–8595.
- McCarley, J.; Chakravarti, R.; Sil, A. Structured pruning of a bert-based question answering model. arXiv preprint arXiv:1910.06360 2019.
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 2020.
- Pan, H.; Wang, C.; Qiu, M.; Zhang, Y.; Li, Y.; Huang, J. Meta-KD: A meta knowledge distillation framework for language model compression across domains. arXiv preprint arXiv:2012.01266 2020.
- Aminabadi, R.Y.; Rajbhandari, S.; Awan, A.A.; Li, C.; Li, D.; Zheng, E.; Ruwase, O.; Smith, S.; Zhang, M.; Rasley, J.; et al. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15.
- Diao, S.; Xu, T.; Xu, R.; Wang, J.; Zhang, T. Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models’ Memories. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2023.
- Lingle, L.D. Transformer-VQ: Linear-Time Transformers via Vector Quantization. arXiv preprint arXiv:2309.16354 2023.
- Zhou, Y.; Lei, T.; Liu, H.; Du, N.; Huang, Y.; Zhao, V.; Dai, A.M.; Le, Q.V.; Laudon, J.; et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 2022, 35, 7103–7114.
- Fu, Y.; Peng, H.; Ou, L.; Sabharwal, A.; Khot, T. Specializing Smaller Language Models towards Multi-Step Reasoning. arXiv preprint arXiv:2301.12726 2023.
- Wu, Z.; Liu, Z.; Lin, J.; Lin, Y.; Han, S. Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886 2020.
- Boža, V. Fast and Optimal Weight Update for Pruned Large Language Models, 2024, [arXiv:cs.CL/2401.02938].
- Zaken, E.B.; Ravfogel, S.; Goldberg, Y. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199 2021.
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 2020, 21, 5485–5551.
- Baines, M.; Bhosale, S.; Caggiano, V.; Goyal, N.; Goyal, S.; Ott, M.; Lefaudeux, B.; Liptchinsky, V.; Rabbat, M.; Sheiffer, S.; et al. Fairscale: A general purpose modular pytorch library for high performance and large scale training, 2021. https://github.com/facebookresearch/fairscale.
- Rasley, J.; Rajbhandari, S.; Ruwase, O.; He, Y. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–3506.
- Gholami, S.; Omar, M. Can pruning make Large Language Models more efficient? arXiv preprint arXiv:2310.04573 2023.
- Yao, Z.; Wu, X.; Li, C.; Youn, S.; He, Y. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation, 2023, [arXiv:cs.LG/2303.08302]. [CrossRef]
- Wang, Z.; Huang, S.; Liu, Y.; Wang, J.; Song, M.; Zhang, Z.; Huang, H.; Wei, F.; Deng, W.; Sun, F.; et al. Democratizing reasoning ability: Tailored learning from large language model. arXiv preprint arXiv:2310.13332 2023.
- Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research 2022, 23, 5232–5270.
- Xie, Y.; Huang, S.; Chen, T.; Wei, F. MoEC: Mixture of Expert Clusters. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2023, Vol. 37, pp. 13807–13815.
- Wan, A.; Dai, X.; Zhang, P.; He, Z.; Tian, Y.; Xie, S.; Wu, B.; Yu, M.; Xu, T.; Chen, K.; et al. Fbnetv2: Differentiable neural architecture search for spatial and channel dimensions. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12965–12974.
- Hassibi, B.; Stork, D.G.; Wolff, G.J. Optimal brain surgeon and general network pruning. In Proceedings of the IEEE international conference on neural networks. IEEE, 1993, pp. 293–299.
- Tay, Y.; Bahri, D.; Yang, L.; Metzler, D.; Juan, D.C. Sparse sinkhorn attention. In Proceedings of the International Conference on Machine Learning. PMLR, 2020, pp. 9438–9447.
- Wang, J.; Chen, K.; Chen, G.; Shou, L.; McAuley, J. SkipBERT: Efficient Inference with Shallow Layer Skipping. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2022.
- Peng, B.; Li, C.; He, P.; Galley, M.; Gao, J. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277 2023.
- Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 2021, 9, 53–68. [CrossRef]
- Pfeiffer, J.; Kamath, A.; Rücklé, A.; Cho, K.; Gurevych, I. AdapterFusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247 2020.
- Yang, K.; Liu, Z.; Cheng, P. MOSEC: Model Serving made Efficient in the Cloud, 2021.
- Zhu, M.; Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 2017.
- Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent Abilities of Large Language Models. Trans. Mach. Learn. Res. 2022, 2022.
- Zuo, S.; Liu, X.; Jiao, J.; Kim, Y.J.; Hassan, H.; Zhang, R.; Zhao, T.; Gao, J. Taming sparsely activated transformer with stochastic experts. arXiv preprint arXiv:2110.04260 2021.
- Zhou, A.; Ma, Y.; Zhu, J.; Liu, J.; Zhang, Z.; Yuan, K.; Sun, W.; Li, H. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010 2021.
- Li, S.; Liu, H.; Bian, Z.; Fang, J.; Huang, H.; Liu, Y.; Wang, B.; You, Y. Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the Proceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 766–775.
- Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? Advances in neural information processing systems 2019, 32.
- Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 2019.
- Pytorch. Pytorch JIT. https://github.com/pytorch/torchdynamo.
- Louizos, C.; Welling, M.; Kingma, D.P. Learning sparse neural networks through L_0 regularization. arXiv preprint arXiv:1712.01312 2017.
- Liang, C.; Jiang, H.; Li, Z.; Tang, X.; Yin, B.; Zhao, T. Homodistil: Homotopic task-agnostic distillation of pre-trained transformers. arXiv preprint arXiv:2302.09632 2023.
- Frantar, E.; Alistarh, D. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot 2023.
- Zhou, W.; Zhang, S.; Gu, Y.; Chen, M.; Poon, H. Universalner: Targeted distillation from large language models for open named entity recognition. arXiv preprint arXiv:2308.03279 2023.
- Xia, M.; Zhong, Z.; Chen, D. Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408 2022.
- Team, T.M.M. composer. https://github.com/mosaicml/composer/, 2021.
- Turc, I.; Chang, M.W.; Lee, K.; Toutanova, K. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962 2019.
- Rahman, M.W.U.; Abrar, M.M.; Copening, H.G.; Hariri, S.; Shao, S.; Satam, P.; Salehi, S. Quantized Transformer Language Model Implementations on Edge Devices. arXiv preprint arXiv:2310.03971 2023.
- Liu, W.; Zhou, P.; Zhao, Z.; Wang, Z.; Deng, H.; Ju, Q. FastBERT: a Self-distilling BERT with Adaptive Inference Time. ArXiv 2020, abs/2004.02178.
- Jiang, X.; Wang, H.; Chen, Y.; Wu, Z.; Wang, L.; Zou, B.; Yang, Y.; Cui, Z.; Cai, Y.; Yu, T.; et al. Mnn: A universal and efficient inference engine. Proceedings of Machine Learning and Systems 2020, 2, 1–13.
- Tang, R.; Lu, Y.; Liu, L.; Mou, L.; Vechtomova, O.; Lin, J. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136 2019.
- Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 2020.
- Kim, J.; Lee, J.H.; Kim, S.; Park, J.; Yoo, K.M.; Kwon, S.J.; Lee, D. Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization. arXiv preprint arXiv:2305.14152 2023.
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 2023.
- Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A Survey of Transformers. CoRR 2021, abs/2106.04554.
- Ma, X.; Fang, G.; Wang, X. LLM-Pruner: On the Structural Pruning of Large Language Models. arXiv preprint arXiv:2305.11627 2023.
- team, M. MLC-LLM, 2023.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).