Submitted:
05 January 2025
Posted:
07 January 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Scope and Objectives
- To review methods for compressing LLMs into SLMs, including state-of-the-art techniques such as quantization, pruning, knowledge distillation, and low-rank factorization. By examining these methods, we aim to provide a comprehensive understanding of how model compression is achieved without compromising functionality.
- To evaluate trade-offs between model size and performance, focusing on critical metrics such as accuracy, latency, memory usage, and energy efficiency. This evaluation includes an analysis of the contexts in which these trade-offs are most significant, such as edge computing, mobile devices, and domain-specific applications.
- To discuss the practical deployment of SLMs in real-world scenarios, including their application in under-resourced languages, low-power IoT environments, and industries with stringent computational constraints. This objective also encompasses an exploration of case studies where SLMs have successfully addressed specific challenges.
- To identify gaps and opportunities in current research, highlighting areas where further innovation is needed. This includes the development of automated tools for transitioning LLMs to SLMs and novel approaches to enhance the interpretability and fairness of smaller models.
2. Background
2.1. Early Language Models and Statistical Approaches
2.2. Neural Networks and Word Embeddings
2.3. The Rise of Transformer Models
2.4. The Emergence of Large Language Models (LLMs)
2.5. The Emergence of Small Language Models (Small LMs)
2.6. Challenges and Trade-Offs in Model Selection
2.7. Current Landscape and Future Directions
3. Key Differences Between Large Language Models (LLMs) and Small Language Models (Small LMs)
3.1. Model Size and Parameters
3.2. Computational Requirements
3.3. Training Data and Generalization
3.4. Performance and Accuracy
3.5. Energy Efficiency and Sustainability
3.6. Deployment Scenarios and Use Cases
3.7. Trade-Offs and Hybrid Approaches
3.8. Future Directions and Ongoing Research
4. Conclusion
References
- Dettmers, T., Lewis. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint 2022, arXiv:2208.07339. [Google Scholar]
- Wen, Y., Li, Z., Du, W., Mou, L.: f-Divergence Minimization for Sequence-Level Knowledge Distillation (2023). https://arxiv.org/abs/2307.15190.
- Zniyed, Y., Nguyen. Efficient tensor decomposition-based filter pruning. Neural Networks 2024, 178, 106393. [Google Scholar]
- Ainslie, J., Lee-Thorp, J., Jong, M., Zemlyanskiy, Y., Lebron, F., Sanghai, S.: GQA: Training generalized multi-query transformer models from multi-head checkpoints. In: EMNLP (2023).
- Computer, T.: RedPajama: An Open Source Recipe to Reproduce LLaMA Training Dataset. https://github.com/togethercomputer/RedPajama-Data.
- Stojkovic, J., Choukse, E., Zhang, C., Goiri, I., Torrellas, J.: Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference (2024). https://arxiv.org/abs/2403.20306.
- Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., Gao, J.: Model tells you what to discard: Adaptive KV cache compression for LLMs. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=uNrFpDPMyo.
- Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., Johnson, M.: Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In: International Conference on Machine Learning, pp. 4411–4421 (2020). PMLR.
- Li, Q., Hong, J., Xie, C., Tan, J., Xin, R., Hou, J., Yin, X., Wang, Z., Hendrycks, D., Wang, Z., Li, B., He, B., Song, D.: LLM-PBE: Assessing Data Privacy in Large Language Models (2024). https://arxiv.org/abs/2408.12787.
- Kitaev, N., Kaiser. Reformer: The efficient transformer. arXiv preprint 2020, arXiv:2001.04451. [Google Scholar]
- Mirzadeh, S.I., Alizadeh-Vahid, K., Mehta, S., Mundo, C.C., Tuzel, O., Samei, G., Rastegari, M., Farajtabar, M.: ReLU strikes back: Exploiting activation sparsity in large language models. In: The Twelfth International Conference on Learning Representations (2024). https://openreview.net/forum?id=osoWxY8q2E.
- Liu, Z. , Wang, J., Dao, T., Zhou, T., Yuan, B., Song, Z., Shrivastava, A., Zhang, C., Tian, Y., Re, C., et al.: Deja vu: Contextual sparsity for efficient llms at inference time. In: International Conference on Machine Learning, pp. 22137–22176 (2023).
- Papineni, K. , Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a Method for Automatic Evaluation of Machine Translation (2002). https://www.aclweb.org/anthology/W02-2019.
- Rae, J.W., Borgeaud; et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint 2021, arXiv:2112.11446. [Google Scholar]
- Liu, Y. Roberta: A robustly optimized bert pretraining approach. arXiv preprint 2019, arXiv:1907.11692. [Google Scholar]
- Liu, Y. , Deng, G., Li, Y., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y., Wang, H., Zheng, Y., Liu, Y.: Prompt Injection attack against LLM-integrated Applications (2024). https://arxiv.org/abs/2306.
- Boizard, N. , Haddad, K.E., Hudelot, C., Colombo, P.: Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs (2024). https://arxiv.org/abs/2402.
- Liu, J. , Gong, R., Wei, X., Dong, Z., Cai, J., Zhuang, B.: Qllm: Accurate and efficient low-bitwidth quantization for large language models. arXiv preprint arXiv:2310. 2023. [Google Scholar]
- Abdin, M. , Aneja, J., Awadalla, H., et al.: Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (2024). https://arxiv.org/abs/2404.
- Utama, P.A. , Moosavi, N.S., Gurevych, I.: Towards debiasing NLU models from unknown biases. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7597–7610. Association for Computational Linguistics, Online (2020). [CrossRef]
- Lu, Z., Li, X., Cai, D., Yi, R., Liu, F., Zhang, X., Lane, N.D., Xu, M.: Small language models: Survey, measurements, and insights. arXiv preprint arXiv:2409.15790. 2024.
- Santacroce, M., Wen. What matters in the structured pruning of generative language models? arXiv preprint 2023, arXiv:2302.03773. [Google Scholar]
- Xu, J., Li, Z., Chen, W., Wang, Q., Gao, X., Cai, Q., Ling, Z.: On-Device Language Models: A Comprehensive Review (2024). https://arxiv.org/abs/2409.00088.
- Zniyed, Y., Nguyen, T.P., et al.: Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems (2024).
- Sarkar, R., Liang, H., Fan, Z., Wang, Z., Hao, C.: Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts (2023). https://arxiv.org/abs/2305.18691.
- Yang, S., Ali. Moral: Moe augmented lora for llms’ lifelong learning. arXiv preprint 2024, arXiv:2402.11260. [Google Scholar]
- Jawahar, G., Yang; et al. Mixture-of-supernets: Improving weight-sharing supernet training with architecture-routed mixture-of-experts. arXiv preprint 2023, arXiv:2306.04845. [Google Scholar]
- Ma, S., Wang. The era of 1-bit llms: All large language models are in 1.58 bits. arXiv preprint 2024, arXiv:2402.17764. [Google Scholar]
- Dettmers, T., Lewis. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 2022, 35, 30318–30332. [Google Scholar]
- Abdin, M., Jacobs; et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint 2024, arXiv:2404.14219. [Google Scholar]
- Timiryasov, I., Tastet. Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. arXiv preprint 2023, arXiv:2308.02019. [Google Scholar]
- Dai, C., Li, K., Zhou, W., Hu, S.: Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation (2024). https://arxiv.org/abs/2405.19737.
- Luo, W., Fan, R., Li, Z., Du, D., Wang, Q., Chu, X.: Benchmarking and dissecting the nvidia hopper gpu architecture (2024). URL https://arxiv. org/abs/2402.13499.
- Rajpurkar, P., Zhang, J., Liu, K., Liang, P.: SQuAD: 100,000+ Questions for Machine Comprehension of Text (2016). https://arxiv.org/abs/1606.05250.
- Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P.M., Bowman, S.R.: BBQ: A Hand-Built Bias Benchmark for Question Answering (2022). https://arxiv.org/abs/2110.08193.
- Gou, Y., Liu. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint 2023, arXiv:2312.12379. [Google Scholar]
- Deepmind, G. Project Astra A universal AI agent that is helpful in everyday life (2024). https://deepmind.google/technologies/gemini/project-astra/.
- Sarlin, P.-E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4938–4947 (2020).
- Laurençon, H., Tronchon. What matters when building vision-language models. arXiv preprint 2024, arXiv:2405.02246. [Google Scholar]
- Han, J., Du. Slim: Let llm learn more and forget less with soft lora and identity mixture. arXiv preprint 2024, arXiv:2410.07739. [Google Scholar]
- Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.: Fairness Through Awareness (2011). https://arxiv.org/abs/1104.3913.
- Zhang, B., Jin. Improved analysis of clipping algorithms for non-convex optimization. Advances in Neural Information Processing Systems 2020, 33, 15511–15521. [Google Scholar]
- Raffel, C., Shazeer. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 2020, 21, 5485–5551. [Google Scholar]
- Reid, M., Marrese-Taylor, E., Matsuo, Y.: Subformer: Exploring weight sharing for parameter efficiency in generative transformers. arXiv preprint arXiv:2101.00234 (2021).
- Sanh, V.: Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
- Faysse, M., Sibille, H., Wu, T., Viaud, G., Hudelot, C., Colombo, P.: Colpali: Efficient document retrieval with vision language models. arXiv preprint arXiv:2407.01449 (2024).
- Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Cao, H., Cheng, X., Chung, M., Grella, M., GV, K.K., et al.: Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048 (2023).
- Qualcomm: Snapdragon 8 Gen 3 Mobile Platform. https://www.qualcomm.com/products/mobile/snapdragon/smartphones/snapdragon-8-series-mobile-platforms/snapdragon-8-gen-3-mobile-platform (2023).
- Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., (2011). https://doi.org/10.1561/2200000016.
- Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.: Mistral 7b. arXiv preprint arXiv:2310.06825 (2023).
- BehnamGhader, P., Adlakha, V., Mosbach, M., Bahdanau, D., Chapados, N., Reddy, S.: Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961 (2024).
- Liu, Q., Wu, X., Zhao, X., Zhu, Y., Xu, D., Tian, F., Zheng, Y.: Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv:2310.18339 (2023).
- Adelani, D.I., Abbott. Masakhaner: Named entity recognition for african languages. Transactions of the Association for Computational Linguistics 2021, 9, 1116–1131. [Google Scholar] [CrossRef]
- Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR.
- Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training. OpenAI blog (2018).
- Beyer, L., Steiner, A., Pinto, A.S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., et al.: Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726 (2024).
- Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 (2019).
- Research, A.M.L.: Introducing Apple’s On-Device and Server Foundation Models. (2024). https://machinelearning.apple.com/research/introducing-apple-foundation-models Accessed October 2024.
- Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M.S., Love, J., et al.: Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024).
- Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789 (2018).
- Michel, P., Levy. Are sixteen heads really better than one. Advances in neural information processing systems 2019, 32. [Google Scholar]
- Muralidharan, S., Sreenivas, S.T., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., Molchanov, P.: Compact Language Models via Pruning and Knowledge Distillation (2024). https://arxiv.org/abs/2407.14679.
- Frantar, E., Alistarh, D.: Sparsegpt: Massive language models can be accurately pruned in one-shot. In: International Conference on Machine Learning, pp. 10323–10337 (2023). PMLR.
- Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., Liu, Z., Sun, M., Zhou, B.: Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233 (2023).
- Nadeem, M., Bethke, A., Reddy, S.: StereoSet: Measuring stereotypical bias in pretrained language models. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5356–5371. Association for Computational Linguistics, Online (2021). https://aclanthology.org/2021.acl-long.416.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).