Submitted:
11 April 2024
Posted:
12 April 2024
You are already at the latest version
Abstract
Keywords:
1. Design Goals
2. Introduction
3. Source and Curation
3.1. Source of Dataset
3.2. Synthetic Data Generation
3.3. Curation of Intellecta
4. Dataset Description
5. Evaluation and Results
6. Conclusion
Appendix A
Appendix A.1. Prompt for Thought data


Appendix A.2. Prompt for Textbook data


References
- Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Mendes, Caio César Teodoro and Del Giorno, Allie and Gopi, Sivakanth and Javaheripi, Mojan and Kauffmann, Piero and de Rosa, Gustavo and Saarikivi, Olli and others. Textbooks are all you need. arXiv preprint arXiv:2306.11644 2023. [CrossRef]
- Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Bogin, Ben and Chandu, Khyathi and Dumas, Jennifer and Elazar, Yanai and others. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint arXiv:2402.00159 2024. [CrossRef]
- Longpre, Shayne and Yauney, Gregory and Reif, Emily and Lee, Katherine and Roberts, Adam and Zoph, Barret and Zhou, Denny and Wei, Jason and Robinson, Kevin and Mimno, David and others. A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity. arXiv preprint arXiv:2305.13169, 2023. [CrossRef]
- Lee, Alycia and Miranda, Brando and Koyejo, Sanmi. Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data.arXiv preprint arXiv:2306.13840, 2023. [CrossRef]
- Nikhil Kandpal, Eric Wallace, and Colin Raffel. Enhancing mathematical capabilities through ChatGPT and similar generative artificial intelligence: Roles and challenges in solving mathematical problems. Available at SSRN 4603237, 2023. [CrossRef]
- Li, Junyi and Tang, Tianyi and Zhao, Wayne Xin and Nie, Jian-Yun and Wen, Ji-Rong. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2306.13840, 2022. [CrossRef]
- Hosseini, Hossein and Kannan, Sreeram and Zhang, Baosen and Poovendran, Radha. PDeceiving google’s perspective api built for detecting toxic comments. arXiv preprint arXiv:1702.08138 2017. [CrossRef]
- Chen, Daoyuan and Huang, Yilun and Ma, Zhijian and Chen, Hesen and Pan, Xuchen and Ge, Ce and Gao, Dawei and Xie, Yuexiang and Liu, Zhaoyang and Gao, Jinyang and others. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity.arXiv preprint arXiv:2309.02033 2023. [CrossRef]
- Schubert, Erich and Sander, Jörg and Ester, Martin and Kriegel, Hans Peter and Xu, Xiaowei. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems, (TODS), 42, 2, 1–21, 2017; ACM New York, NY, USA.
- Kandpal, Nikhil and Wallace, Eric and Raffel, Colin. Deduplicating training data mitigates privacy risks in language models. International Conference on Machine Learning. 10697–10707, 2022; PMLR.
- Sadowski, Caitlin and Levin, Greg. Technical report, Google. 2007.
- Chaudhuri, Arindam and Mandaviya, Krupa and Badelia, Pratixa and K Ghosh, Soumya and Chaudhuri, Arindam and Mandaviya, Krupa and Badelia, Pratixa and Ghosh, Soumya K. Optical character recognition systems. Springer, 2017.




| Model | Parameters | Token | ARC | HellaSwag | MMLU | Winogrande | GSM8K |
|---|---|---|---|---|---|---|---|
| EleutherAI/pythia-1b-deduped | 1.1B | - | 29.10 | 49.65 | 24.27 | 53.59 | 1.14 |
| facebook/opt-1.3b | 1.3B | 180B | 29.52 | 54.53 | 24.96 | 59.75 | 0.15 |
| Qwen/Qwen1.5-0.5B | 620M | - | 31.48 | 49.05 | 39.35 | 57.22 | 16.3 |
| HuggingFaceTB/cosmo-1b | 1.8B | 30B | 38.57 | 55.13 | 26.69 | 55.49 | 5.53 |
| TinyLlama/TinyLlama-1.1B-Chat-v0.6 | 1.1B | 3T | 31.66 | 55.79 | 25.98 | 59.35 | 2.12 |
| boomer-634m | 634M | 11.5B | 29.86 | 39.24 | 25.91 | 50.61 | 1.67 |
| EleutherAI/gpt-neo-1.3B | 1.3B | 380B | 31.23 | 48.47 | 24.82 | 56.91 | 0.45 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).