Submitted:
18 July 2024
Posted:
19 July 2024
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
- We demonstrated that of the weights in the Llama-2 7B model could be pruned without notable decrease in language modeling capability, as evidenced by the wikitext perplexity metrics.
- We examined the user experience of daily dialogue and query handling at this sparsity limit to assess the feasibility of on-device AI assistants. The pruned model generated fluent and helpful answers but with factual inaccuracies, raising intriguing theoretical questions about the nature of pruning or sparsification. We hypothesized that these inaccuracies might result from forgetting or hallucination, proposed a simple protocol to distinguish between the two mechanisms, and discussed future directions to improve the pruned models.
2. Methods
3. Results
3.1. The sparsity limit to prune LLMs, evaluated by wikitext perplexity
3.2. Pruned sparse LLM as AI assistants: user experience examination
- “Tell me about Boston”;
- “Describe the Python programming language, in terms of its syntax, history, user experience, and popularity”.
4. Discussion
Appendix A


| Prompt: | Tell me about Boston. |
| 0.6-sparsity model: | Boston is a city in the United States of America. Boston is the capital of the state of Massachusetts. The city of Boston was founded in 1630 by a group of settlers who came from Plymouth in England. The city has a population of about 650,00 people and a total area of 47 square kilometers. It is located on the east coast of North America and is surrounded by the Atlantic Ocean. In the city there are a lot of museums, such as the Museum of Fine Arts and the Boston Museum. There is also a university in Boston, which is called the Massachusetts Institute of Technology (MIT). There are many places to visit in this city. For example, you can visit the John F. Kennedy (JFK) Memorial. |
| Full model: | I’ve never been to Boston, but I hear it’s a great place to visit. The people are friendly, the food is delicious, and the history is fascinating. Boston is also home to some of the best universities in the country, so if you’re looking for a top-notch education, this is the place for you. Boston is a city with a rich history and a vibrant culture. There are plenty of things to see and do in this city, from exploring the Freedom Trail to taking in a Red Sox game at Fenway Park. No matter what your interests are, you’re sure to find something to enjoy in Boston. So what are you waiting for? Come on down and experience all that this great city has to offer! |
| Prompt: | Describe the Python programming language, in terms of its syntax, history, user experience, and popularity. |
| 0.6-sparsity model: | Python is a general-purpose scripting language. It was created in 1990 by Guido van Rosenberg. Python is one of the most popular programming languages in the world. Its syntax is similar to that of C++ and Java, but it is much easier to learn and use than those languages. The syntax of Python can be learned in a matter of minutes, even by those who have never programmed in any language before. This makes Python a great language for beginners, as well as for experienced programmers who are looking for a language that is easy to use and has a lot of flexibility. There are many websites that are dedicated to Python, such as Python.org and Python Wiki. These websites are a good source of information about Python and how to program in it. They also have a forum where you can ask questions. |
| Full model: | Python is a high-level, general-purpose, interpreted, interactive, object-oriented scripting language. It was created by Guido van Rossum and first released in 1991. Python’s design philosophy emphasizes code readability with its notable use of significant whitespace. Its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. |




References
- Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, arXiv:2303.08774, 2023.
- Anthropic. Introducing claude, 2023.
- Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, arXiv:2312.11805, 2023.
- Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, arXiv:2307.09288, 2023.
- Meta, AI. Meta llama 3, 2024.
- Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, and Zihan Wang. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
- Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, arXiv:2311.16867, 2023.
- Karl Friston. Hierarchical models in the brain. PLoS computational biology, 4(11):e1000211,2008.
- Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609,1996.
- Ron A Jortner, S Sarah Farivar, and Gilles Laurent. A simple connectivity scheme for sparse coding in an olfactory system. Journal of Neuroscience, 27(7):1659–1669,2007.
- Cindy Poo and Jeffry S Isaacson. Odor representations in olfactory cortex:“sparse” coding, global inhibition, and oscillations. Neuron, 62(6):850–861,2009.
- Baktash Babadi and Haim Sompolinsky. Sparseness and expansion in sensory representations. Neuron, 83(5):1213–1226, 2014.
- Evan S Schaffer, Dan D Stettler, Daniel Kato, Gloria B Choi, Richard Axel, and LF Abbott. Odor perception on the two sides of the brain: consistency despite randomness. Neuron, 98(4),2018.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30,2017.
- Bo Liu, Shanshan Qin, Venkatesh Murthy, and Yuhai Tu. One nose but two nostrils: Learn to align with sparse connections between two olfactory cortices. ArXiv, 2024.
- Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. O (n) connections are expressive enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems, 33:13783–13794, 2020.
- Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241):1–124,2021.
- Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720,2023.
- Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, arXiv:2306.11695, 2023.
- Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, arXiv:2310.06694, 2023.
- Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332,2022.
- Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, arXiv:2306.03078, 2023.
- Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378, arXiv:2104.08378, 2021.
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135.
- Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, arXiv:1312.6211, 2013.
- Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engineering in large language models: a comprehensive review. arXiv preprint arXiv:2310.14735, arXiv:2310.14735, 2023.
- Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, arXiv:2106.10199, 2021.
- Sharath Girish, Shishira R Maiya, Kamal Gupta, Hao Chen, Larry S Davis, and Abhinav Shrivastava. The lottery ticket hypothesis for object recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 762–771, 2021.
- Chao Jiang, Bo Hui, Bohan Liu, and Da Yan. Successfully applying lottery ticket hypothesis to diffusion model. arXiv preprint arXiv:2310.18823, arXiv:2310.18823, 2023.
- Bohan Liu, Zijie Zhang, Peixiong He, Zhensen Wang, Yang Xiao, Ruimeng Ye, Yang Zhou, Wei-Shinn Ku, and Bo Hui. A survey of lottery ticket hypothesis. arXiv preprint arXiv:2403.04861, arXiv:2403.04861, 2024.
- Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, arXiv:2403.03853, 2024.
- Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The unreasonable ineffectiveness of the deeper layers. arXiv preprint arXiv:2403.17887, arXiv:2403.17887, 2024.
- Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, arXiv:2308.07633, 2023.
- Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, arXiv:2304.01089, 2023.
- Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024.
- Mingxue Xu, Yao Lei Xu, and Danilo P Mandic. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526, arXiv:2307.00526, 2023.


Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).