Submitted:
21 March 2025
Posted:
21 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- A hybrid feature loss that combines mean-squared error (MSE) and cosine similarity to align intermediate representations.
- A meta-learning adaptation phase that refines the student model via a support-query update.
- Theoretical insights linking our method to generalization error reduction and accelerated convergence.
- Empirical validation on CIFAR-10 demonstrating early convergence improvements.
2. Related Work
3. Proposed Method
3.1. Feature-Level Distillation
3.2. Theoretical Analysis
3.2.1. Generalization Error Reduction
3.2.2. Connection to the Information Bottleneck Principle
3.2.3. Accelerated Convergence via Meta-Learning
3.2.4. Assumptions
- Teacher Quality: The teacher’s representation satisfies for a small , making it a near-sufficient statistic.
- Representative Sampling: The support and query batches are drawn independently from the data distribution.
- Smoothness: The loss function is L-smooth and -strongly convex in a neighborhood around the optimal parameters.
3.2.5. Informal Theorem
4. Experimental Setup
4.1. Dataset and Preprocessing
4.2. Implementation Details
- Experiment 4: 50 epochs, BATCH_SIZE = 128, , Temperature = 4.0.
- Experiment 2: 20 epochs, BATCH_SIZE = 512, , Temperature = 4.0.
5. Results
5.1. Experiment 4: 50 Epochs (BATCH_SIZE=128)
5.1.1. Baseline Training Results
5.1.2. Feature Distillation Training Results
5.1.3. Meta-Learning Adaptation Phase Results
| Epoch | Train Loss | Train Accuracy (%) |
| 1 | 1.7046 | 36.00 |
| 2 | 1.3248 | 52.01 |
| 3 | 1.1196 | 59.90 |
| 4 | 0.9746 | 65.50 |
| 5 | 0.8564 | 69.89 |
| 6 | 0.7654 | 73.38 |
| 7 | 0.6940 | 75.80 |
| 8 | 0.6360 | 77.80 |
| 9 | 0.5975 | 79.35 |
| 10 | 0.5599 | 80.68 |
| 25 | 0.3047 | 89.22 |
| 30 | 0.2610 | 90.84 |
| 40 | 0.1996 | 93.01 |
| 50 | 0.1576 | 94.33 |
| Epoch | Train Loss | Train Accuracy (%) |
| 1 | 1.9423 | 35.32 |
| 2 | 1.3825 | 49.97 |
| 3 | 1.1602 | 58.72 |
| 4 | 0.9790 | 65.44 |
| 5 | 0.8529 | 70.10 |
| 6 | 0.7616 | 73.65 |
| 7 | 0.7030 | 75.45 |
| 8 | 0.6515 | 77.46 |
| 9 | 0.6127 | 78.78 |
| 10 | 0.5690 | 80.29 |
| 25 | 0.3161 | 89.04 |
| 30 | 0.2718 | 90.48 |
| 40 | 0.2100 | 92.64 |
| 50 | 0.1682 | 94.00 |
| Experiment | Baseline (%) | Feature Distillation (%) | Meta-Learning Adaptation (%) |
| Exp 4 (50 epochs, BS=128) | 88.03 | 88.63 | 88.63 |
| Exp 2 (20 epochs, BS=512) | 80.53 | 80.75 | — |


5.1.4. Training Curves (Experiment 4)
5.2. Evaluation Summary
6. Discussion
7. Conclusions
Limitations
Acknowledgments
References
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
- C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proceedings of the 34th International Conference on Machine Learning. PMLR, 2017, pp. 1126–1135.
- Y. Li, “Self-distillation with meta learning for knowledge graph completion,” arXiv preprint arXiv:2305.12209, 2023.
- Y. Li, J. Liu, M. Yang, and C. Li, “Self-distillation with meta learning for knowledge graph completion,” in Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 2048–2054.
- Y. Li, X. Ma, S. Lu, K. Lee, X. Liu, and C. Guo, “Mend: Meta demonstration distillation for efficient and effective in-context learning,” arXiv preprint arXiv:2403.06914, 2024.
- S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in Proceedings of the International Conference on Learning Representations (ICLR), 2017.
- T. Wang et al., “Incremental meta-learning via episodic replay distillation for few-shot image recognition,” in Proceedings of the CVPR 2022 Workshop on CLVision, 2022.
- J. Smith and J. Doe, “Dynamic meta distillation for continual learning,” in Proceedings of ICLR 2023, 2023.
- A. Johnson and R. Kumar, “Efficient meta-learning distillation for robust neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2024.
- A. Krizhevsky, “Learning multiple layers of features from tiny images,” Technical report, University of Toronto, 2009.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).