Submitted:
05 February 2024
Posted:
06 February 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Background
1.2. Static Embeddings
1.3. Dynamic Embeddings
1.4. Task Dependent Architectures
1.5. Task Agnostic Architecture
2. Language Models and Attention Mechanism
2.1. Language Models
2.2. Attention Layer
2.3. MultiHead Attention
2.4. Attention based RNN Models
2.5. Attention based Transformer Models
3. Transformer
3.1. Encoder-Decoder based Model
3.2. Encoder-only based Model
3.3. Decoder-only (Causal) based Model
3.4. Prefix (Non-Causal) Language Model
3.5. Mask Types
4. Pretraining - Strategies & Objectives
4.1. Objectives
4.1.1. Left-To-Right (LTR) Language Model Objective
4.1.2. Prefix Language Model Objective
4.1.3. Masked Language Model Objective
4.1.4. General Language Mode Objective
4.1.5. Span Corruption Objective
4.1.6. Deshuffle Objective
4.1.7. Next Sentence Prediction (NSP) Objective
4.2. Learning Strategies
4.2.1. Multi-Task Pretraining
4.2.2. Multilingual Pretraining
4.2.3. Mixture of Experts (MoE) based Pretraining
4.2.4. Knowledge Enhanced Pretraining
4.2.5. Mixture of Denoisers (MoD) based Pretraining
Extreme Denoising
Sequential Denoising
Regular Denoising
4.2.6. Prompt Pretraining
4.2.7. Information Retrieval based Pretraining
5. Transfer Learning Strategies
5.1. Fine Tuning
5.2. Adapter Tuning
5.3. Gradual Unfreezing
5.4. Prefix Tuning
5.5. Prompt Tuning
5.5.1. Prompt Engineering
5.5.2. Continuous Prompt Tuning
5.6. MultiLingual FineTuning
5.7. Reinforcement Learning from Human Feedback (RLHF) Fine Tuning
- In the first step, supervised fine-tuning is used, where the dataset consisting of prompts along with their desired output behavior is given as input.
- Another dataset of comparisons between model outputs is collected, where for a given input, labelers identify which output they would prefer using labels. This comparison data is then used to train a Reward Model to predict human-preferred output (which model output the labelers prefer).
- The policy generates an output for which the reward model generates a reward. This reward is then used to update (maximize) the policy’s Proximal Policy Optimization (PPO) algorithm.
5.8. Instruction Tuning
5.9. Code based Fine Tuning
6. In-Context Learning
6.1. Few-Shot learning
6.2. One-Shot learning
6.3. Zero-shot learning
6.4. Chain-of-Thought learning
7. Scalability
7.1. Model Width (Parameter Size)
7.2. Training Tokens & Data Size
7.3. Model Depth (Network Layers)
7.4. Architecture - Parallelism
7.4.1. Data parallelism
7.4.2. Tensor parallelism (op-level model parallelism)
7.4.3. Pipeline parallelism
7.5. Miscellaneous
7.5.1. Training Steps
7.5.2. Checkpoints
7.5.3. Ensembling
8. LLM Challenges
8.1. Biases
8.2. Toxic Content
8.3. Hallucination
- sample, not one, but multiple outputs and check the information consistency between them to check which statements are factual and which are hallucinated
- validate the correctness of the model output by relying and using external knowledge source
- check if the generated Named Entities or <subject, relation, object> tuples appear in the ground-truth knowledge source or not etc.
8.4. Cost & Carbon Footprints
- Computational cost for pre-training: a super large model requires several weeks of pre-training with thousands of GPUs.
- Storage cost for fine-tuned models: a large language model usually takes hundreds of gigabytes (GBs) to store, and as many model copies as the number of downstream tasks need to be stored.
- Equipment cost for inference: it is expected to use multiple GPUs to infer a large language model.
- report energy consumed and CO2e explicitly,
- ML conferences should reward improvements in efficiency as well as traditional metrics and
- include the time and number of processors for training to help everyone understand its cost.
8.5. Open source & Low Resource
9. Future Directions & Development Trends
9.1. Interpretability & Explainability
9.2. Fairness
9.3. Robustness & Adversarial Attacks
9.4. Multimodal LLMs
9.5. Energy Efficiency & Environmental Impact
9.6. Different Languages & Domains
9.7. Privacy-Preserving Models
9.8. Continual Learning & Adaptability
9.9. Ethical Use & Societal Impact
9.10. Real-world Applications & Human-LLM Collaboration
10. Conclusions
References
- Harris, Z. S.Distributional structure, Word, 10 (2-3), pp. 146-162, 1954. [CrossRef]
- Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J., Mercer, R. L., and Roossin, P. S. A statistical approach to machine translation,Computational linguistics, 16(2), pp. 79-85, 1990.
- Salton, G., and Lesk, M. E. Computer evaluation of indexing and text processing,Journal of the ACM (JACM), 15(1), pp. 8-36, 1968. [CrossRef]
- Jones, K. S. A statistical interpretation of term specificity and its application in retrieval,Journal of documentation, 1972. [CrossRef]
- Salton, G., Wong, A., and Yang, C. S. A vector space model for automatic indexing,Communications of the ACM, 18(11), pp. 613-620, 1975. [CrossRef]
- Tang, B., Shepherd, M., Milios, E., and Heywood, M. I. Comparing and combining dimension reduction techniques for efficient text clustering,In Proceeding of SIAM international workshop on feature selection for data mining, pp. 17-26, 2005.
- Hyvärinen, A., and Oja, E. Independent component analysis: algorithms and applications,Neural networks, 13(4-5), pp. 411-430, 2000. [CrossRef]
- Vilnis, L., and McCallum, A. Word representations via gaussian embedding,arXiv preprint arXiv:1412.6623, 2014. [CrossRef]
- Athiwaratkun, B.,and Wilson, A. G. Multimodal word distributions,arXiv preprint arXiv:1704.08424, 2017. [CrossRef]
- Le, Q., and Mikolov, T. Distributed representations of sentences and documents,In International conference on machine learning, pp. 1188-1196, PMLR, 2014.
- Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space,arXiv preprint arXiv:1301.3781, 2013. [CrossRef]
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality,Advances in neural information processing systems, 26, 2013.
- Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation,In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014.
- Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword information,Transactions of the association for computational linguistics, 5, 135-146, 2017. [CrossRef]
- Melamud, O., Goldberger, J., & Dagan, I. context2vec: Learning generic context embedding with bidirectional lstm,In Proceedings of the 20th SIGNLL conference on computational natural language learning, pp. 51-61, 2016.
- McCann, B., Bradbury, J., Xiong, C., and Socher, R. Learned in translation: Contextualized word vectors,Advances in neural information processing systems, 30, 2017.
- Peters, M. E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K., and Zettlemoyer L. Deep Contextualized Word Representations,In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol 1, pages 2227–2237, 2018.
- Howard, J., & Ruder, S. Universal language model fine-tuning for text classification,arXiv preprint arXiv:1801.06146, 2018. [CrossRef]
- Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y., Gao, J., Zhou, M., & Hon, H. W. Unified language model pre-training for natural language understanding and generation,Advances in Neural Information Processing Systems, 32, 2019.
- Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation,arXiv preprint arXiv:1406.1078, 2014. [CrossRef]
- Sutskever, I., Vinyals, O., & Le, Q. V. Sequence to sequence learning with neural networks,Advances in neural information processing systems, 27, 2014.
- Bahdanau, D., Cho, K., & Bengio, Y. Neural machine translation by jointly learning to align and translate,arXiv preprint arXiv:1409.0473, 2014. [CrossRef]
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. & Polosukhin, I. Attention is all you need,Advances in neural information processing systems, 30, 2017.
- Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. Improving language understanding by generative pre-training,URL https://s3-us-west-2. amazonaws. com/openaiassets/research-covers/languageunsupervised/language understanding paper.pdf, 2018.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding,arXiv preprint arXiv:1810.04805, 2018. [CrossRef]
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. & Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer,The Journal of Machine Learning Research, 21(1), pp. 5485-5551, 2020.
- Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. Retrieval augmented language model pre-training,In International conference on machine learning, pp. 3929-3938, 2020, November, PMLR.
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. Language models are unsupervised multitask learners,OpenAI blog, 1(8), 9, 2019.
- Brown. T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. & Agarwal, S. Language models are few-shot learners,Advances in neural information processing systems, 33, pp.1877-1901, 2020.
- Lieber, O., Sharir, O., Lenz, B., & Shoham, Y. Jurassic-1: Technical details and evaluation,White Paper. AI21 Labs, 1, 2021.
- Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z. & Tang, J., GPT understands, too,AI Open, 2023. [CrossRef]
- Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A. and Raffel, C., mT5: A massively multilingual pre-trained text-to-text transformer,arxiv preprint arXiv:2010.11934, 2020. [CrossRef]
- Zeng, W., Ren, X., Su, T., Wang, H., Liao, Y., Wang, Z., ... & Tian, Y. Pangu-α: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation.,arxiv preprint arXiv:2104.12369, 2021. [CrossRef]
- Zhang, Z., Gu, Y., Han, X., Chen, S., Xiao, C., Sun, Z., Yao, Y., Qi, F., Guan, J., Ke, P. and Cai, Y., Cpm-2: Large-scale cost-effective pre-trained language models,AI Open, 2, pp.216-224, 2021. [CrossRef]
- Wu, S., Zhao, X., Yu, T., Zhang, R., Shen, C., Liu, H., Li, F., Zhu, H., Luo, J., Xu, L. and Zhang, X., Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning,arxiv preprint arXiv:2110.04725, 2021. [CrossRef]
- Kim, B., Kim, H., Lee, S.W., Lee, G., Kwak, D., Jeon, D.H., Park, S., Kim, S., Kim, S., Seo, D. and Lee, H., What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers,arxiv preprint arXiv:2109.04650, 2021. [CrossRef]
- Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M. and Le, Q.V., Finetuned language models are zero-shot learners,arxiv preprint arXiv:2109.01652, 2021. [CrossRef]
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., Training language models to follow instructions with human feedback,Advances in Neural Information Processing Systems, 35, pp.27730-27744, 2022.
- Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.T., Jin, A., Bos, T., Baker, L., Du, Y. and Li, Y., Lamda: Language models for dialog applications,arxiv preprint arXiv:2201.08239, 2022. [CrossRef]
- Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z. and Tang, J., Glm: General language model pretraining with autoregressive blank infilling,arXiv preprint arXiv:2103.10360, 2021. [CrossRef]
- Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X. and Tam, W.L., Glm-130b: An open bilingual pre-trained model,arxiv preprint arXiv:2210.02414, 2022. [CrossRef]
- Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., Mirzaei, A., Arunkumar, A., Ashok, A., Dhanasekaran, A.S., Naik, A., Stap, D. and Pathak, E., Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,arxiv preprint arXiv:2204.07705, 2022. [CrossRef]
- Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T.L., Raja, A. and Dey, M., Multitask prompted training enables zero-shot task generalization,arxiv preprint arXiv:2110.08207, 2021. [CrossRef]
- Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J. and Pieler, M., Gpt-neox-20b: An open-source autoregressive language model,arxiv preprint arXiv:2204.06745, 2022. [CrossRef]
- Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V. and Mihaylov, T., Opt: Open pre-trained transformer language models,URL https://arxiv. org/abs/2205.01068, 2022. [CrossRef]
- Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. and Rodriguez, A., Llama: Open and efficient foundation language models,arxiv preprint arXiv:2302.13971, 2023. [CrossRef]
- Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A.W., Firat, O. and Zoph, B., Glam: Efficient scaling of language models with mixture-of-experts,In International Conference on Machine Learning, pp. 5547-5569, PMLR, June, 2022.
- Soltan, S., Ananthakrishnan, S., FitzGerald, J., Gupta, R., Hamza, W., Khan, H., Peris, C., Rawls, S., Rosenbaum, A., Rumshisky, A. and Prakash, C.S., Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model,arxiv preprint arXiv:2208.01448, 2022. [CrossRef]
- Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G. and Ray, A., Evaluating large language models trained on code,arxiv preprint arXiv:2107.03374, 2021. [CrossRef]
- So, D. R., Mańke, W., Liu, H., Dai, Z., Shazeer, N., & Le, Q. V. Primer: Searching for efficient transformers for language modeling,arxiv preprint arXiv:2109.08668, 2021.
- Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D. and Hajishirzi, H., Self-instruct: Aligning language model with self generated instructions,arXiv preprint arXiv:2212.10560, 2022. [CrossRef]
- Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T. and Wu, Y., Solving quantitative reasoning problems with language models,Advances in Neural Information Processing Systems, 35, pp.3843-3857, 2022.
- Su, H., Zhou, X., Yu, H., Chen, Y., Zhu, Z., Yu, Y., & Zhou, J. Welm: A well-read pre-trained language model for chinese,arxiv preprint arXiv:2209.10372, 2022. [CrossRef]
- Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S. and Schuh, P., Palm: Scaling language modeling with pathways,arxiv preprint arXiv:2204.02311, 2022.
- Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A.S., Yvon, F., Gallé, M. and Tow, J., Bloom: A 176b-parameter open-access multilingual language model,arxiv preprint arXiv:2211.05100, 2022.
- Tay, Y., Dehghani, M., Tran, V.Q., Garcia, X., Wei, J., Wang, X., Chung, H.W., Bahri, D., Schuster, T., Zheng, S. and Zhou, D., Ul2: Unifying language learning paradigms,In The Eleventh International Conference on Learning Representations September, 2022.
- Iyer, S., Lin, X.V., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., Shuster, K., Wang, T., Liu, Q., Koura, P.S. and Li, X., Opt-iml: Scaling language model instruction meta learning through the lens of generalization,arxiv preprint arXiv:2212.12017, 2022. [CrossRef]
- Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Scao, T.L., Bari, M.S., Shen, S., Yong, Z.X., Schoelkopf, H. and Tang, X., Crosslingual generalization through multitask finetuning,arxiv preprint arXiv:2211.01786, 2022. [CrossRef]
- Lin, X.V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., Simig, D., Ott, M., Goyal, N., Bhosale, S., Du, J. and Pasunuru, R., Few-shot Learning with Multilingual Generative Language Models,In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, (pp. 9019-9052), 2022.
- Tay, Y., Wei, J., Chung, H.W., Tran, V.Q., So, D.R., Shakeri, S., Garcia, X., Zheng, H.S., Rao, J., Chowdhery, A. and Zhou, D., Transcending scaling laws with 0.1% extra compute,arxiv preprint arXiv:2210.11399, 2022. [CrossRef]
- Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S. and Webson, A., Scaling instruction-finetuned language models,arxiv preprint arXiv:2210.11416, 2022.
- Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A.J., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L. and Kolesnikov, A., Pali: A jointly-scaled multilingual language-image model,URL https://arxiv. org/abs/2209.06794, 2022. [CrossRef]
- Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez, V. and Stojnic, R., Galactica: A large language model for science,arxiv preprint arXiv:2211.09085, 2022. [CrossRef]
- Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A. and Hubert, T., Competition-level code generation with alphacode,Science, 378(6624), pp.1092-1097, 2022.
- Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S. and Xiong, C., Codegen: An open large language model for code with multi-turn program synthesis,arxiv preprint arXiv:2203.13474, 2022. [CrossRef]
- Zheng, Q., Xia, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Wang, Z., Shen, L., Wang, A., Li, Y. and Su, T., Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x,arxiv preprint arXiv:2303.17568, 2023. [CrossRef]
- Costa-jussà, M.R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J. and Sun, A., No language left behind: Scaling human-centered machine translation,arxiv preprint arXiv:2207.04672, 2022. [CrossRef]
- Biderman, S., Schoelkopf, H., Anthony, Q.G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M.A., Purohit, S., Prashanth, U.S., Raff, E. and Skowron, A., Pythia: A suite for analyzing large language models across training and scaling,In International Conference on Machine Learning, pp. 2397-2430, July, 2023, PMLR.
- Ziegler, D.M., Stiennon, N., Wu, J., Brown, T.B., Radford, A., Amodei, D., Christiano, P. and Irving, G., Fine-tuning language models from human preferences,arXiv preprint arXiv:1909.08593, 2019. [CrossRef]
- Wu, J., Ouyang, L., Ziegler, D.M., Stiennon, N., Lowe, R., Leike, J. and Christiano, P., Recursively summarizing books with human feedback,arXiv preprint arXiv:2109.10862, 2021. [CrossRef]
- Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D. and Christiano, P.F., Learning to summarize with human feedback,Advances in Neural Information Processing Systems, 33, pp.3008-3021, 2020.
- Madaan, A., Tandon, N., Clark, P. and Yang, Y., Memory-assisted prompt editing to improve gpt-3 after deployment,arXiv preprint arXiv:2201.06009, 2022. [CrossRef]
- Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G. and Dean, J., Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,arXiv preprint arXiv:1701.06538, 2017. [CrossRef]
- Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N. and Chen, Z., GShard: Scaling giant models with conditional computation and automatic sharding,arxiv preprint arXiv:2006.16668,2020. [CrossRef]
- Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N. and Fedus, W., St-moe: Designing stable and transferable sparse expert models,arXiv preprint arXiv:2202.08906, 2022. [CrossRef]
- Fedus, W., Zoph, B. and Shazeer, N., Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,The Journal of Machine Learning Research, 23(1), pp.5232-5270, 2022.
- Artetxe, M., Bhosale, S., Goyal, N., Mihaylov, T., Ott, M., Shleifer, S., Lin, X.V., Du, J., Iyer, S., Pasunuru, R. and Anantharaman, G., Efficient large scale language modeling with mixtures of experts,arXiv preprint arXiv:2112.10684, 2021. [CrossRef]
- Rae, J.W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S. and Rutherford, E., Scaling language models: Methods, analysis & insights from training gopher,arxiv preprint arXiv:2112.11446, 2021. [CrossRef]
- Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., Training compute-optimal large language models,arXiv preprint arXiv:2203.15556, 2022. [CrossRef]
- Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., Scaling laws for neural language models,arXiv preprint arXiv:2001.08361, 2020. [CrossRef]
- Zhao, Z., Wallace, E., Feng, S., Klein, D. and Singh, S., Calibrate before use: Improving few-shot performance of language models,In International Conference on Machine Learning, pp. 12697-12706. PMLR, 2021.
- Tay, Y., Dehghani, M., Rao, J., Fedus, W., Abnar, S., Chung, H.W., Narang, S., Yogatama, D., Vaswani, A. and Metzler, D., Scale efficiently: Insights from pre-training and fine-tuning transformers,arXiv preprint arXiv:2109.10686, 2021. [CrossRef]
- Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D. and Chi, E.H., Emergent abilities of large language models,arXiv preprint arXiv:2206.07682, 2022. [CrossRef]
- Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M. and Liu, Q., ERNIE: Enhanced language representation with informative entities,arXiv preprint arXiv:1905.07129, 2019. [CrossRef]
- Peters, M.E., Neumann, M., Logan IV, R.L., Schwartz, R., Joshi, V., Singh, S. and Smith, N.A., Knowledge enhanced contextual word representations,arXiv preprint arXiv:1909.04164, 2019.
- Xiong, W., Du, J., Wang, W.Y. and Stoyanov, V., Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model,arXiv preprint arXiv:1912.09637, 2019. [CrossRef]
- Zhou, W., Lee, D.H., Selvam, R.K., Lee, S., Lin, B.Y. and Ren, X., Pre-training text-to-text transformers for concept-centric common sense,arXiv preprint arXiv:2011.07956, 2020. [CrossRef]
- Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J. and Tang, J., KEPLER: A unified model for knowledge embedding and pre-trained language representation,Transactions of the Association for Computational Linguistics, 9, pp.176-194, 2021. [CrossRef]
- Sun, T., Shao, Y., Qiu, X., Guo, Q., Hu, Y., Huang, X. and Zhang, Z., Colake: Contextualized language and knowledge embedding,arXiv preprint arXiv:2010.00309, 2020. [CrossRef]
- Wang, R., Tang, D., Duan, N., Wei, Z., Huang, X., Cao, G., Jiang, D. and Zhou, M., K-adapter: Infusing knowledge into pre-trained models with adapters,arXiv preprint arXiv:2002.01808, 2020. [CrossRef]
- Sun, Y., Wang, S., Feng, S., Ding, S., Pang, C., Shang, J., Liu, J., Chen, X., Zhao, Y., Lu, Y. and Liu, W., Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation,arxiv preprint arXiv:2107.02137, 2021. [CrossRef]
- Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S., Parameter-efficient transfer learning for NLP,In International Conference on Machine Learning, pp. 2790-2799, PMLR, 2019.
- Shin, T., Razeghi, Y., Logan IV, R.L., Wallace, E. and Singh, S., Autoprompt: Eliciting knowledge from language models with automatically generated prompts,arXiv preprint arXiv:2010.15980, 2020. [CrossRef]
- Li, X.L. and Liang, P., Prefix-tuning: Optimizing continuous prompts for generation,arXiv preprint arXiv:2101.00190, 2021. [CrossRef]
- Han, X., Zhao, W., Ding, N., Liu, Z. and Sun, M., Ptr: Prompt tuning with rules for text classification,AI Open, 3, pp.182-192, 2022. [CrossRef]
- Lester, B., Al-Rfou, R. and Constant, N., The power of scale for parameter-efficient prompt tuning,arXiv preprint arXiv:2104.08691, 2021. [CrossRef]
- Mosbach, M., Pimentel, T., Ravfogel, S., Klakow, D. and Elazar, Y., Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation,arXiv preprint arXiv:2305.16938, 2023. [CrossRef]
- Wang, T., Roberts, A., Hesslow, D., Le Scao, T., Chung, H.W., Beltagy, I., Launay, J. and Raffel, C., What language model architecture and pretraining objective works best for zero-shot generalization?,In International Conference on Machine Learning, pp. 22964-22984. PMLR, 2022.
- Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L. and Chen, W., What Makes Good In-Context Examples for GPT-3?,arXiv preprint arXiv:2101.06804, 2021. [CrossRef]
- Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V., Unsupervised cross-lingual representation learning at scale,arXiv preprint arXiv:1911.02116, 2019. [CrossRef]
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V. and Zhou, D., Chain-of-thought prompting elicits reasoning in large language models,Advances in Neural Information Processing Systems, 35, pp.24824-24837, 2022.
- Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A. and Zhou, D., Self-consistency improves chain of thought reasoning in language models,arXiv preprint arXiv:2203.11171, 2022. [CrossRef]
- Lin, S., Hilton, J. and Evans, O., Truthfulqa: Measuring how models mimic human falsehoods,arXiv preprint arXiv:2109.07958, 2021. [CrossRef]
- Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.M., Rothchild, D., So, D., Texier, M. and Dean, J., Carbon emissions and large neural network training,arXiv preprint arXiv:2104.10350, 2021. [CrossRef]
- Gehman, S., Gururangan, S., Sap, M., Choi, Y. and Smith, N.A., Realtoxicityprompts: Evaluating neural toxic degeneration in language models,arXiv preprint arXiv:2009.11462, 2020. [CrossRef]
- Ung, M., Xu, J. and Boureau, Y.L., Saferdialogues: Taking feedback gracefully after conversational safety failures,arXiv preprint arXiv:2110.07518, 2021. [CrossRef]
- Dinan, E., Abercrombie, G., Bergman, A.S., Spruit, S., Hovy, D., Boureau, Y.L. and Rieser, V., Anticipating safety issues in e2e conversational ai: Framework and tooling,arXiv preprint arXiv:2107.03451, 2021. [CrossRef]
- Rudinger, R., Naradowsky, J., Leonard, B. and Van Durme, B., Gender bias in coreference resolution,arXiv preprint arXiv:1804.09301, 2018.
- Nangia, N., Vania, C., Bhalerao, R. and Bowman, S.R., CrowS-pairs: A challenge dataset for measuring social biases in masked language models,arXiv preprint arXiv:2010.00133, 2020. [CrossRef]
- Nadeem, M., Bethke, A. and Reddy, S., StereoSet: Measuring stereotypical bias in pretrained language models,arXiv preprint arXiv:2004.09456, 2020. [CrossRef]
- Levine, Y., Wies, N., Sharir, O., Bata, H. and Shashua, A., Limits to depth efficiencies of self-attention,Advances in Neural Information Processing Systems, 33, pp.22640-22651, 2020.
- Lester, B., Al-Rfou, R. and Constant, N., The power of scale for parameter-efficient prompt tuning,arXiv preprint arXiv:2104.08691, 2021. [CrossRef]






| Model | Param-Size | Layers | d-model | Atte-ntion heads | Hardware |
|---|---|---|---|---|---|
| Transformer-base [23] | - | 6E, 6D | 512 | 8 | 8 NVIDIA P100 GPUs |
| Transformer-big [23] | - | 12E, 12D | 1024 | 16 | 8 NVIDIA P100 GPUs |
| BERT-base [25] | 110M | 12E | 768 | 12 | 4 Cloud TPUs |
| BERT-large [25] | 340M | 24E | 1024 | 16 | 16 Cloud TPUs (64 TPU chips) |
| GPT-1 [24] | 117M | 12D | 768 | 12 | - |
| GPT-2 [28] | 117M to 1.5B | 24D to 48D | 1600 | 48 | - |
| GPT-3 [29] | 175B | 96 | 12288 | 96 | V 100 GPUs (285K CPU cores, 10K GPUs) |
| T5 [26] | 220M-11B | (12E, 12D) | - | - | 1024 TPU v3 |
| REALM [27] | 330M | - | - | - | 64 Google Cloud TPUs, 12GB GPU |
| Jurassic-1 [30] | 178B | 76 | 13824 | 96 | |
| mT5 [32] | 13B | - | - | - | - |
| Pangu-Alpha [33] | 207B | 64 | 16384 | 128 | 2048 Ascend 910 AI processors |
| CPM-2 [34] | 198B | 24 | 4096 | 64 | - |
| Yuan 1.0 [35] | 245B | - | - | - | - |
| HyperClova [36] | 82B | 64 | 10240 | 80 | 128 DGX servers with 1024 A100 GPUs |
| GLaM [47] | 1.2T (96.6) | 64 MoE | 8,192 | 128 | 1024 Cloud TPU-V4 chips (Single System) |
| ERNIE 3.0 [91] | 10B | 48, 12 | 4096, 768 | 64, 12 | 384 NVDIA v100 GPU cards |
| Gopher [78] | 280B | 80 | 16384 | 128 | 4 DCN-connected TPU v3 Pods (each with 1024 TPU v3 chips) |
| Chinchilla [79] | 70B | 80 | 8192 | 64 | - |
| AlphaCode [64] | 41.1B | 8E, 56D | 6144 | 48, 16 | - |
| CodeGEN [65] | 16.1B | 34 | 256 | 24 | - |
| CodeGeeX [66] | 13B | 39 | 5120 | 40 | 1,536 Ascend 910 AI Processors |
| FLAN [37] | 137B | - | - | - | TPUv3 with 128 cores |
| InstructGPT [38] | 175B | 96 | 12288 | 96 | V 100 GPUs |
| LaMDA [39] | 137B | 64 | 8192 | 128 | 1024 TPU-v3 chips |
| T0 [43] | 11B | 12 | - | - | - |
| GPT NeoX 20B [44] | 20B | 44 | 6144 | 64 | 12 AS-4124GO-NART servers (each with 8 NVIDIA A100-SXM4-40GB GPUs) |
| OPT [45] | 175B | 96 | 12288 | 96 | 992 80GB A100 GPUs |
| MINERVA [52] | 540.35B | 118 | 18432 | 48 | - |
| AlexaTM 20B [48] | 20B (19.75B) | 46E, 32D | 4096 | 32 | 128 A100 GPUs |
| GLM-130B [41] | 130B | 70 | 12288 | 96 | 96 NVIDIA DGX-A100 (8x40G) |
| XGLM [59] | 7.5B | 32 | 4096 | ||
| PaLM [54] | 540.35B | 118 | 18432 | 48 | 6144 TPU v4 chips (2 Pods) |
| Galactica [63] | 120B | 96 | 10,240 | 80 | 128 NVIDIA A100 80GB nodes |
| Pali [62] | 16.9 ( 17)B | - | - | - | - |
| LLaMA [46] | 65B | 80 | 8192 | 64 | 2048 A100 GPU (80GB RAM) |
| UL2 [56] | 20B | 32E, 32D | 4096 | 16 | 64 to 128 TPUv4 chips |
| Pythia [68] | 12B | 36 | 5120 | 40 | - |
| WeLM [53] | 10B | 32 | 5120 | 40 | 128 A100-SXM4-40GB GPUs |
| BLOOM [55] | 176B | 70 | 14336 | 112 | 48 nodes having 8 NVIDIA A100 80GB GPUs (384 GPUs) |
| GLM [40] | 515M | 30 | 1152 | 18 | 64 V100 GPUs |
| GPT-J | 6B | 28 | 4096 | 16 | TPU v3-256 pod |
| YaLM | 100B | 800 A100 | |||
| Alpaca | 7B | 8 80GB A100s | |||
| Falcon | 40B | - | - | - | - |
| (Xmer) XXXL [82] | 30B | 28 | 1280 | 256 | 64 TPU-v3 chips |
| [77] | 1.1T | 32 | 4096 | 512 (experts) | |
| XLM-R [100] | 550M | 24 | 1024 | 16 |
| Model | Archit-ecture | Objectives | Pretraining-Dataset | Tokens, Corpus Size |
|---|---|---|---|---|
| Transformer-base [23] | Encoder-Decoder | |||
| Transformer-big [23] | Encoder-Decoder | MLM, NSP | WMT 2014 | - |
| BERT-base [25] | Encoder-only | |||
| BERT-large [25] | Encoder-only | MLM, NSP | BooksCorpus, English Wikipedia | 137B, - |
| GPT-1 [24] | Decoder-only | Causal/LTR-LM | BooksCorpus, 1B Word Benchmark | - |
| GPT-2 [28] | Decoder-only | Causal/LTR-LM | Reddit, WebText | -, 40GB |
| GPT-3 [29] | Decoder-only | Causal/LTR-LM | Common Crawl, WebText, English-Wikipedia, Books1, Books2 | 300B, 570GB |
| T5 [26] | Encoder-Decoder | MLM, Span Correction | C4 | (1T tokens) 34B, 750GB |
| REALM [27] | Retriever + Encoder | Salient Span Masking | English Wikipedia (2018) | - |
| Jurassic-1 [30] | Decoder-only | Causal/LTR-LM | Wikipedia, OWT, Books, C4, PileCC | 300B, |
| mT5 [32] | Encoder-Decoder | MLM, Span Correction | mC4 | - |
| Pangu-Alpha [33] | Decoder + Query Layer | LM | Public datasets (e.g., BaiDuQA, CAIL2018, Sogou-CA, etc.) , Common Crawl, encyclopedia, news and e-books | 1.1TB (80TB raw) |
| CPM-2 [34] | Encoder-Decoder | MLM | encyclopedia, novels, QA, scientific literature, e-book, news, and reviews. | -, 2.3TB Chinese data and 300GB English Data |
| Yuan 1.0 [35] | Decoder-only | LM, PLM | Common Crawl, Public Datasets, Encyclopedia, Books | , 5TB |
| HyperClova [36] | Decoder-only | LM | Blog, Cafe, News, Comments, KiN, Modu, WikiEn, WikiJp, Others | 561B |
| GLaM [47] | Sparse/MoE Decoder-only | LM | Web Pages, Wikipedia, Forums, Books, News, Conversations | 1.6T tokens, |
| ERNIE 3.0 [91] | Transformer-XL structure | UKTP | plain texts and a large-scale knowledge graph | 375 billion, 4TB |
| Gopher [78] | Decoder-only | LM | MassiveText (MassiveWeb, Books, C4, News, GitHub, Wikipedia) | 300B |
| Chinchilla [79] | - | - | MassiveText | 1.4T |
| AlphaCode [64] | encoder-decoder | MLM, LM | Github, CodeContests | 967B |
| CodeGEN [65] | decoder only | LM | THEPILE, BIGQUERY, and BIGPYTHON | 505.5B |
| CodeGeeX [66] | decoder only | LM | The Pile, CodeParrot Collected | 850B |
| FLAN [37] | Decoder-only | LM | web documents, dialog data, and Wikipedia | 2.49T tokens, |
| InstructGPT [38] | Decoder-only | LTR-LM | Common Crawl, WebText, English-Wikipedia, Books1, Books2, Prompt Dataset (SFT, RM, PPO) | 300B, 570GB |
| LaMDA [39] | Decoder Only | LM | public dialog data and web text | 168B (2.97B documents, 1.12B dialogs, and 13.39B) 1.56T words, - |
| T0 [43] | Encoder-Decoder | MLM + LM | C4 | 1T tokens + 100B |
| GPT NeoX 20B [44] | Decoder-only | LM | The Pile | - 825 GB |
| OPT [45] | Decoder-only | - | BookCorpus, Stories, the Pile, and PushShift.io Reddit | 180B tokens |
| MINERVA [52] | Decoder-only + Parallel Layers | LM | technical content dataset (containing scientific and mathematical data) , questions from MIT’s OpenCourseWare, in addition to PaLM pretraining dataset | 38.5B tokens (math content), |
| AlexaTM 20B [48] | seq2seq (Encoder-Decoder) | mix of denoising and Causal Language Modeling (CLM) tasks | Wikipedia and mC4 datasets | 1 Trillion tokens, - |
| GLM-130B [41] | bidirectional encoder & unidirectional decoder, | GLM, MIP (Multi-Task Instruction Pre-Training) | 400 billion tokens, | |
| XGLM [59] | decoder-only | Causal LM | CC100-XL | |
| PaLM [54] | Decoder-only + Parallel Layers | LM | Social media conversations, Filtered webpages, Wikipedia (multilingual), Books, Github, News (English) | 780B tokens, |
| Galactica [63] | decoder-only | - | papers, code, reference material, knowledge bases, filtered CommonCrawl, prompts, GSM8k, OneSmallStep, Khan Problems, Workout, Other | 106B |
| Pali [62] | encoder-decoder and Vision Transformers | mixture of 8 pretraining tasks | WebLI (10B images and texts in over 100 languages), 29 billion image-OCR pairs | - |
| LLaMA [46] | transformer | LM | CommonCrawl, C4, Github, Wikipedia, Books, ArXiv, StackExchange | 1.4T tokens, |
| UL2 [56] | Enc-Dec, decoder-Only Prefix LM | R, S, X denoising | C4 | 32B tokens, |
| Pythia [68] | decoder-only | LM | the Pile | 300B tokens - |
| WeLM [53] | - | - | Common Crawl, news, books, forums, academic writings. | - |
| BLOOM [55] | Causal Decoder-only | LM | ROOTS corpus (46 natural languages and 13 programming languages) | 366B, 1.61TB |
| GLM [40] | bidirectional encoder & unidirectional decoder | GLM | ||
| GPT-J | Mesh Transformer JAX | LM | PILE | 402B |
| YaLM | online texts, The Pile, books, other resources (in English, Russian) | , 1.7TB | ||
| Alpaca | ||||
| Falcon | encoder only | LM | RefinedWeb, Reddit | 1T |
| (Xmer) XXXL [82] | encoder-decoder | MLM | C4 | 1T tokens |
| [77] | decoder-only (MoE) | LM | BookCorpus, English Wikipedia, CC-News, OpenWebText, CC-Stories, CC100 | 300B |
| XLM-R [100] | encoder only | Multilingual MLM | CommonCrawl (CC-100) | , 2.5TB |
| Model | PT, FT batch-size | Conte-xt Size | PT, FT Epochs | Activation, Optimizer | Fine Tuning Methods |
|---|---|---|---|---|---|
| Transformer-base [23] | - | - | 100,000 | ||
| Transformer-big [23] | - | - | 300,000 | - , Adam | Feature Based |
| BERT-base [25] | 256, 32 | 128, 512 | 40, 4 | ||
| BERT-large [25] | 256, 32 | 128, 512 | 40, 4 | GELU, Adam | FT |
| GPT-1 [24] | 64 | 512 | 100 | GELU, Adam | FT, zero-shot |
| GPT-2 [28] | 512 | 1024 | - | GELU, Adam | zero-shot |
| GPT-3 [29] | 3.2M | 2048 | - | - | few-shot, one, zero-shot |
| T5 [26] | 128, 128 | 512 | , steps | RELU, AdaFactor | FT |
| REALM [27] | 512, 1 | - | 200k steps, 2 epochs | - | - |
| Jurassic-1 [30] | 3.2M tokens | 2048 | - | - | few-shot, zero-shot |
| mT5 [32] | - | - | - | GeGLU, | FT, zero-shot |
| Pangu-Alpha [33] | - | 1024 | 130K 260K | GeLU | few-shot, one, zero-shot |
| CPM-2 [34] | - | - | - | - | FT, PT |
| Yuan 1.0 [35] | - | - | - | - | few-shot, zero-shot |
| HyperClova [36] | 1024,- | - | - | -, AdamW | few-shot, zero-shot, PT |
| GLaM [47] | - | 1024 | - | -, Adafactor | zero, one and few-shot |
| ERNIE 3.0 [91] | 6144 | 512 | - | GeLU, Adam | FT, zero and few-shot |
| Gopher [78] | - | 2048 | - | Adam | FT, few-shot, zero-shot |
| Chinchilla [79] | - | - | - | AdamW | FT, zero-shot |
| AlphaCode [64] | 2048 | - | 205K | - | fine tuning |
| CodeGEN [65] | 2M | 2048 | - | - | zero-shot |
| CodeGeeX [66] | 3072 | - | - | FastGELU, Adam | fine tuning |
| FLAN [37] | -, 8192 | 1024 | , 30K | -, Adafactor | Instruction Tuning, zero-shot |
| InstructGPT [38] | 3.2M | 2048 | - | - | RLHF |
| LaMDA [39] | - | - | - | gated-GELU, | FT |
| T0 [43] | - | - | - | RELU, AdaFactor | FT, zero-shot |
| GPT NeoX 20B [44] | 3.15M tokens | 2048 | 150K steps | -, AdamW with ZeRO | few-shot |
| OPT [45] | 2M tokens | 2048 | - | ReLU, AdamW | few-shot, zero-shot |
| MINERVA [52] | - | 1024 | 399K | SwiGLU, Adafactor | few-shot, chain-of-thought context |
| AlexaTM 20B [48] | 2 million tokens | - | - | , Adam | Fine Tuning, few-shot, one-shot, zero-shot |
| GLM-130B [41] | 4224 | 2048 | - | GeGLU, | zero-shot, few (5) shots |
| XGLM [59] | |||||
| PaLM [54] | 512, 1024, 2048 (1, 2, 4M tokens) , - | 2048 | 1 (255k steps) | SwiGLU, Adafactor | few-shot, Chain of Thought, finetuning |
| Galactica [63] | 2M | 2048 | 4 epochs | GeLU | zero-shot |
| Pali [62] | - | - | - | ||
| LLaMA [46] | 4M tokens | - | - | SwiGLU, AdamW | zero-shot, few-shot, Instruction Tuning |
| UL2 [56] | 128 | 512 | 500K steps | SwiGLU, Adafactor | in-context learning, zero-shot, one-shot, fine-tuning, instruction tuning |
| Pythia [68] | 1024 | 2048 | 1.5 Epochs | Adam | zero-shot |
| WeLM [53] | 2048 | 2048 | - | - | zero-shot, few-shot |
| BLOOM [55] | 20,482,048 | 2048 | - | GELU, - | zero-shot, few-shot, multitask prompted (fine) tuning |
| GLM [40] | 1024 | 200K Steps | FT | ||
| GPT-J | 2048 | 383,500 steps | - | FT | |
| YaLM | |||||
| Alpaca | - | - | FT, IT (instruction Tuning) | ||
| Falcon | - | 2048 | - | - | - |
| (Xmer) XXXL [82] | FT | ||||
| [77] | 2048 | FT, zero-shot, few-shot |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).