Submitted:
27 March 2025
Posted:
27 March 2025
You are already at the latest version
Abstract
As deep learning models are widely applied across various domains, a critical challenge is how to compress models while maintaining high reasoning capability. Knowledge distillation, an effective technique for model compression, has been used to enhance the performance of lightweight models. However, traditional distillation methods are limited when dealing with complex reasoning tasks. Reinforcement learning (RL) offers a novel approach to knowledge distillation by optimizing the reasoning strategies of teacher models, generating more efficient decision paths, and providing more valuable learning content for student models. This paper reviews the latest advancements in combining reinforcement learning with knowledge distillation, focusing on policy distillation, value function distillation, and dynamic reward-guided distillation methods. It also discusses the challenges faced by RL-driven distillation, such as simplifying complex strategies, addressing temporal dependencies, and balancing exploration and exploitation, and suggests possible solutions. Finally, this paper explores the applications of RL-driven knowledge distillation in fields such as game AI, robotic control, and dialogue systems, and outlines future research directions, including automated distillation, multimodal distillation, and challenges in federated learning.
Keywords:
1. Introduction
- The main techniques and approaches used in RL-driven KD.
- The challenges and limitations of RL-driven KD.
- The applications of RL-driven KD in various domains.
- Future research directions and potential breakthroughs.
2. Background and Problem Definition
3. RL-Driven Knowledge Distillation Techniques
A. Policy Distillation
B. Value Function Distillation
C. Dynamic Reward-Guided Distillation
4. Challenges and Solutions
A. Capacity Mismatch
B. Temporal Dependency
C. Reward Design
5. Applications
A. Large Language Model Compression
B. Autonomous Driving
C. DeepSeek
6. Conclusions
References
- Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- L. Herrmann and S. Kollmannsberger. Deep learning in computational mechanics: A Review. Computational Mechanics 2024, 74, 281–331. [Google Scholar] [CrossRef]
- P. Yu, X. Xu, and J. Wang. Applications of Large Language Models in Multimodal Learning. Journal of Computer Technology & Applied Mathematics 2024, 1, 108–116. [Google Scholar]
- S. Sun, W. Ren, J. Li, R. Wang, and X. Cao. Logit standardization in knowledge distillation. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15731–15740, Jun. 2024. [CrossRef]
- S. Muralidharan et al.. Compact Language Models via Pruning and Knowledge Distillation. arXiv 2024, arXiv:2407.14679. [Google Scholar]
- H. Liu, Y. Wang, H. Liu, F. Sun, and A. Yao. Small scale data-free knowledge distillation. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6008–6016, Jun. 2024. [CrossRef]
- C. Tang et al.. Deep Reinforcement Learning for Robotics: A survey of real-world successes. Annual Review of Control, Robotics, and Autonomous Systems 2024. [CrossRef]
- Q. Li, W. Xia, L. Yin, J. Jin, and Y. Yu. Privileged knowledge state distillation for reinforcement learning-based educational path recommendation. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1621–1630, Aug. 2024. [CrossRef]
- H. Xiao, L. Fu, C. Shang, X. Bao, and X. Xu. A knowledge distillation compression algorithm for ship speed and energy coordinated optimal scheduling model based on Deep Reinforcement Learning. IEEE Transactions on Transportation Electrification 2025, 11, 945–960. [Google Scholar] [CrossRef]
- D. Huang et al.. Alignsam: Aligning segment anything model to open context via reinforcement learning. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3205–3215, Jun. 2024. [CrossRef]
- A. Rusu et al.. Policy distillation. arXiv 2015, arXiv:1511. 06295. [Google Scholar]
- Z. Wang, B. Yang, H. Yue, and Z. Ma. Fine-grained prototypes distillation for few-shot object detection. Proceedings of the AAAI Conference on Artificial Intelligence 2024, 38, 5859–5866. [Google Scholar] [CrossRef]
- D. Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948.
- W. Liu, S. W. Liu, S. Cheng, D. Zeng, and Q. Hong. Enhancing document-level event argument extraction with contextual clues and role relevance. Findings of the Association for Computational Linguistics: ACL 2023, 2023. [Google Scholar] [CrossRef]
- J. Jiang, Z. J. Jiang, Z. Wang, S. Qiu, X. Li, and C. Zhang. Multi-Task Load Identification and Signal Denoising Via Hierarchical Knowledge Distillation”, IEEE Transactions on Network Science and Engineering, pp. 1–14, 2025.
- M. Jin et al.. Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers?”, in Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 558–573.
- K. Yang, J. K. Yang, J. Tao, J. Lyu, and X. Liu. Exploration and anti-exploration with distributional random network distillation. arXiv 2024, arXiv:2401.09750. [Google Scholar]
- X. Xu, Z. Xu, Y. Pei, and J. Wang. Enhancing User Intent for Recommendation Systems via Large Language Models. arXiv 2025, arXiv:2501.10871. [Google Scholar]
- Z. Mai, J. Zhang, Z. Xu, and Z. Xiao. Is Llama 3 good at sarcasm detection? A comprehensive study. Proceedings of the 2024 7th International Conference on Machine Learning and Machine Intelligence (MLMI), pp. 141–145, Aug. 2024. [CrossRef]
- J. Yi, Z. J. Yi, Z. Xu, T. Huang, and P. Yu. Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions. arXiv 2025, arXiv:2502.00339. [Google Scholar]
- T. Huang, J. T. Huang, J. Yi, P. Yu, X. Xu. Unmasking Digital Falsehoods: A Comparative Analysis of LLM-Based Misinformation Detection Strategies. arXiv 2025, arXiv:2503.00724. [Google Scholar]
- X. Huang, Y. Wu, D. Zhang, J. Hu, and Y. Long. Improving academic skills assessment with NLP and ensemble learning. 2024 IEEE 7th International Conference on Information Systems and Computer Aided Education (ICISCAE), pp. 37–41, Sep. 2024. [CrossRef]
- W. Liu, J. W. Liu, J. Chen, K. Ji, L. Zhou, W. Chen, and B. Wang. RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions. arXiv 2024, arXiv:2501.00353. [Google Scholar]
- Y. Wu, Z. Xiao, J. Zhang, Z. Mai, and Z. Xu. Can llama 3 understand monetary policy?. 2024 17th International Conference on Advanced Computer Theory and Engineering (ICACTE), pp. 145–149, Sep. 2024. [CrossRef]
- T. Huang, Z. T. Huang, Z. Xu, P. Yu, J. Yi, and X. Xu. A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit. arXiv 2025, arXiv:2502.09097. [Google Scholar]
| Technique | Core Mechanism | Advantages | Limitations | Applicable Scenarios |
|---|---|---|---|---|
| Policy Distillation | Optimizes the policy network of the teacher model via RL to extract lightweight policies for student imitation. | Preserves complex decision logic; suitable for sequential tasks. | Low training efficiency with large policy spaces; relies on teacher quality. | Game AI, robotic path planning |
| Value Function Distillation | Transfers state-action evaluations from the teacher’s value function to guide student optimization for long-term rewards. | Reduces exploration costs; improves stability. | Poor adaptability to dynamic environments; requires precise value estimation. | Autonomous driving, resource scheduling |
| Dynamic Reward-Guided Distillation | Designs dynamic reward functions to adjust distillation based on environmental feedback, balancing imitation and exploration. | Adapts to complex tasks; avoids overfitting. | Complex reward design; high training convergence difficulty. | Dialogue systems, multimodal interaction |
| Challenge | Specific Issues | Solutions |
|---|---|---|
| Simplifying Complex Policies | Overly intricate teacher strategies hinder student model compression. | Introduce hierarchical RL (HRL) to decompose policies into subtasks. |
| Handling Temporal Dependencies | Capturing long-term decision dependencies (e.g., dialogue context). | Integrate attention mechanisms or Transformer architectures. |
| Balancing Exploration & Exploitation | Students over-rely on teachers, limiting autonomous exploration. | Design hybrid rewards combining imitation (teacher) and environmental feedback. |
| Heterogeneous Model Compatibility | Structural mismatches (e.g., CNN→Transformer) impede knowledge transfer. | Use adapter layers or feature mapping networks to align representation spaces. |
| Training Efficiency & Stability | High complexity and slow convergence from combining RL and KD. | Apply offline RL pretraining with curriculum learning. |
| Domain | Case Study | Technique | Outcome | Future Directions |
|---|---|---|---|---|
| Game AI | Lightweight deployment of AlphaGo-style models | Policy Distillation + Monte Carlo Tree Search (MCTS) | 90% fewer parameters; 5x faster inference. | Automated distillation frameworks, multi-agent collaboratio. |
| Robotic Control | Dynamic grasping for robotic arms | Dynamic Reward-Guided Distillation + Imitation Learning | 20% higher success rate; adapts to unseen objects. | Sim-to-real transfer learning. |
| Dialogue Systems | Personalized dialogue model compression | Value Function Distillation + RL dialogue policies | 75% lower memory usage; retains reply quality. | Multimodal distillation (text + speech + vision). |
| Healthcare | Lightweight medical imaging diagnosis models | Hierarchical Policy Distillation + Uncertainty-aware rewards | 10x smaller model; 95% diagnostic accuracy. | Privacy-preserving distillation in federated learning. |
| Autonomous Driving | Real-time edge-side path planning | Value Function Distillation + Safety-constrained RL | <50ms planning delay; 40% lower accident rate. | Vehicle-road collaborative distillation. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).