Version 1
: Received: 23 October 2023 / Approved: 24 October 2023 / Online: 24 October 2023 (07:51:13 CEST)
Version 2
: Received: 24 May 2024 / Approved: 27 May 2024 / Online: 27 May 2024 (08:34:18 CEST)
How to cite:
Zhang, Y.; Bai, H.; Lin, H.; Zhao, J.; Hou, L.; Cannistraci, C. V. An Efficient Plug-and-Play Post-Training Pruning Strategy in Large Language Models. Preprints2023, 2023101487. https://doi.org/10.20944/preprints202310.1487.v1
Zhang, Y.; Bai, H.; Lin, H.; Zhao, J.; Hou, L.; Cannistraci, C. V. An Efficient Plug-and-Play Post-Training Pruning Strategy in Large Language Models. Preprints 2023, 2023101487. https://doi.org/10.20944/preprints202310.1487.v1
Zhang, Y.; Bai, H.; Lin, H.; Zhao, J.; Hou, L.; Cannistraci, C. V. An Efficient Plug-and-Play Post-Training Pruning Strategy in Large Language Models. Preprints2023, 2023101487. https://doi.org/10.20944/preprints202310.1487.v1
APA Style
Zhang, Y., Bai, H., Lin, H., Zhao, J., Hou, L., & Cannistraci, C. V. (2023). An Efficient Plug-and-Play Post-Training Pruning Strategy in Large Language Models. Preprints. https://doi.org/10.20944/preprints202310.1487.v1
Chicago/Turabian Style
Zhang, Y., Lu Hou and Carlo Vittorio Cannistraci. 2023 "An Efficient Plug-and-Play Post-Training Pruning Strategy in Large Language Models" Preprints. https://doi.org/10.20944/preprints202310.1487.v1
Abstract
With the rapid growth of large language models (LLMs), there is increasing demand for memory and computation for LLMs. Recent efforts on post-training pruning of LLMs aim to reduce the model size and computation, yet the performance is still sub-optimal. In this paper, we present a plug-and-play solution for post-training pruning of LLMs. The proposed solution has two innovative components: 1) **Relative Importance and Activations** (RIA), a new pruning metric that jointly considers the weight and activations efficiently on LLMs; and 2) **Channel Permutation**, a new approach to maximally preserve important weights under N:M sparsity. The proposed two components can be readily combined to further enhance the N:M structuredly pruned LLMs. Our empirical experiments show that RIA alone can already surpass all existing post-training pruning methods on prevalent LLMs, e.g., LLaMA ranging from 7B to 65B. Furthermore, N:M structured pruning with channel permutation can even outperform the original LLaMA2 70B on zero-shot tasks, together with practical speed-up on specific hardware.
Keywords
post-training pruning; combinatorial optimization; large language models; inference acceleration
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.