Submitted:
04 December 2025
Posted:
08 December 2025
You are already at the latest version
Abstract
Keywords:
1. The Data-Blindness of Current LLM Architectures
2. Representing Data Provenance, Density, and Diversity Inside Models
- a standard parametric core (transformer + MoE, etc.) that learns task performance,
- plus an attached meta-representation over data regions, implemented as a small learned table or graph whose nodes correspond to regions and whose entries track statistics such as total tokens seen, number of distinct sources, distribution over time, and estimated label noise.
3. Inference-Time Access to Meta-Data: Querying One’s Own Epistemic State
4. Active Data Acquisition and Applications to Robustness and Scientific Discovery
5. Governance of Data Meta-Layers: Privacy, Standards, and Long-Term Challenges
References
- Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.
- Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv preprint (2020). arXiv:2001.08361.
- Hoffmann, Jordan, et al. “Training compute-optimal large language models.” arXiv preprint (2022). arXiv:2203.15556.
- Bender, Emily M., et al. “On the dangers of stochastic parrots: Can language models be too big?.” Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2021.
- Gao, Leo, et al. “The pile: An 800gb dataset of diverse text for language modeling.” arXiv preprint (2020). arXiv:2101.00027.
- Dodge, J.; Sap, M.; Marasović, A.; Agnew, W.; Ilharco, G.; Groeneveld, D.; Mitchell, M.; Gardner, M. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. LOCATION OF CONFERENCE, Dominican RepublicDATE OF CONFERENCE; pp. 1286–1305.
- Wiggins, W.F.; Tejani, A.S. On the Opportunities and Risks of Foundation Models for Natural Language Processing in Radiology. Radiol. Artif. Intell. 2022, 4, e220119. [CrossRef]
- Mitchell, Margaret, et al. “Model cards for model reporting.” Proceedings of the conference on fairness, accountability, and transparency. 2019.
- Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Iii, H.D.; Crawford, K. Datasheets for datasets. Commun. ACM 2021, 64, 86–92. [CrossRef]
- Liang, Percy, et al. “Holistic evaluation of language models.” arXiv preprint (2022). arXiv:2211.09110.
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 1–38. [CrossRef]
- Kalai, Adam Tauman, et al. “Why language models hallucinate.” arXiv preprint (2025). arXiv:2509.04664.
- Guo, Chuan, et al. “On calibration of modern neural networks.” International conference on machine learning. PMLR, 2017.
- Ashukha, Arsenii, et al. “Pitfalls of in-domain uncertainty estimation and ensembling in deep learning.” arXiv preprint (2020). arXiv:2002.06470.
- Hendrycks, Dan, and Kevin Gimpel. “A baseline for detecting misclassified and out-of-distribution examples in neural networks.” arXiv preprint (2016). arXiv:1610.02136.
- Ren, Jie, et al. “Likelihood ratios for out-of-distribution detection.” Advances in neural information processing systems 32 (2019).
- Hu, Jun, and Zhan-Long Wang. “Dynamic Wetting and Spreading of High-Viscosity Liquids on Grooved Substrates.” (2025).
- Hu, J.; Wang, Z.-L. The effect of hygroscopic liquids on the spatial controlling of condensation on low-temperature surfaces. Surfaces Interfaces 2024, 55. [CrossRef]
- Bran, Andres M., et al. “Chemcrow: Augmenting large-language models with chemistry tools.” arXiv preprint (2023). arXiv:2304.05376.
- Häse, F.; Roch, L.M.; Aspuru-Guzik, A. Next-Generation Experimentation with Self-Driving Laboratories. Trends Chem. 2019, 1, 282–291. [CrossRef]
- Fries, J.A.; Varma, P.; Chen, V.S.; Xiao, K.; Tejeda, H.; Saha, P.; Dunnmon, J.; Chubb, H.; Maskatia, S.; Fiterau, M.; et al. Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences. Nat. Commun. 2019, 10, 1–10. [CrossRef]
- Hu, Jun, and Zhan-Long Wang. “Analysis of fluid flow in fractal microfluidic channels.” arXiv preprint (2024). arXiv:2409.12845.
- Hu, J.; Zhao, H.; Xu, Z.; Hong, H.; Wang, Z.-L. The effect of substrate temperature on the dry zone generated by the vapor sink effect. Phys. Fluids 2024, 36. [CrossRef]
- Hu, J.; Wang, Z.-L. Crystallization Morphology and Self-Assembly of Polyacrylamide Solutions During Evaporation. Fine Chem. Eng. 2024, 487–497. [CrossRef]
- Hu, Jun, and Zhan-Long Wang. “Inhibition of water vapor condensation by dipropylene glycol droplets on hydrophobic surfaces via vapor sink strategy.” arXiv preprint (2023). arXiv:2311.03930.
- Bose, R.; Frew, J. Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. 2005, 37, 1–28. [CrossRef]
- Buneman, Peter, Sanjeev Khanna, and Tan Wang-Chiew. “Why and where: A characterization of data provenance.” International conference on database theory. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001.
- Wang, Zhan-Long, et al. “Suppression of water vapor condensation by glycerol droplets on hydrophobic surfaces.” arXiv preprint (2023). arXiv:2311.03068.
- Xu, Y.; Zhang, D.; Wu, Q.; Chang, X.; Wang, F.; Yu, S.; Zhang, J.; Wang, Z.; Xu, Z.; Wu, T. Facet-Dependent Electrochemical Behavior of Au–Pd Core@Shell Nanorods for Enhanced Hydrogen Peroxide Sensing. ACS Appl. Nano Mater. 2023, 6, 18739–18747. [CrossRef]
- Shazeer, Noam, et al. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.” arXiv preprint (2017). arXiv:1701.06538.
- Lewis, Patrick, et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks.” Advances in neural information processing systems 33 (2020): 9459-9474.
- Wang, Z.-L.; Lin, K. The multi-lobed rotation of droplets induced by interfacial reactions. Phys. Fluids 2023, 35. [CrossRef]
- Gal, Yarin, and Zoubin Ghahramani. “Dropout as a bayesian approximation: Representing model uncertainty in deep learning.” international conference on machine learning. PMLR, 2016.
- Wang, Z.; Wang, X.; Miao, Q.; Gao, F.; Zhao, Y.-P. Spontaneous Motion and Rotation of Acid Droplets on the Surface of a Liquid Metal. Langmuir 2021, 37, 4370–4379. [CrossRef]
- Wang, Z.; Wang, X.; Miao, Q.; Zhao, Y. Realization of Self-Rotating Droplets Based on Liquid Metal. Adv. Mater. Interfaces 2020, 8. [CrossRef]
- Yin, X.; Zhang, X.; Ruan, J.; Wan, X. Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). LOCATION OF CONFERENCE, ThailandDATE OF CONFERENCE; pp. 2270–2286.
- Li, M.; Zhao, Y.; Zhang, W.; Li, S.; Xie, W.; Ng, S.-K.; Chua, T.-S.; Deng, Y. Knowledge Boundary of Large Language Models: A Survey. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). LOCATION OF CONFERENCE, AustriaDATE OF CONFERENCE; pp. 5131–5157.
- Liu, Genglin, et al. “Examining LLMs’ Uncertainty Expression Towards Questions Outside Parametric Knowledge.” arXiv preprint (2023). arXiv:2311.09731.
- Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 2024, 630, 625–630. [CrossRef]
- Wang, Z.; Lin, K.; Zhao, Y.-P. The effect of sharp solid edges on the droplet wettability. J. Colloid Interface Sci. 2019, 552, 563–571. [CrossRef]
- Zhao, Ya-Pu, and Zhanlong Wang. “Moving Contact Line of Droplets on Structured Surfaces: Some Problems Relevant to Tribology.” Surfactants in Tribology, Volume 6 (2019): 73-111.
- Wang, Z.; Chen, E.; Zhao, Y. The effect of surface anisotropy on contact angles and the characterization of elliptical cap droplets. Sci. China Technol. Sci. 2017, 61, 309–316. [CrossRef]
- Chen, S.; Yu, S.; Zhao, S.; Lu, C. From Imitation to Introspection: Probing Self-Consciousness in Language Models. Findings of the Association for Computational Linguistics: ACL 2025. LOCATION OF CONFERENCE, AustriaDATE OF CONFERENCE; pp. 7553–7583.
- Seo, Yeongbin, Dongha Lee, and Jinyoung Yeo. “Quantifying Self-Awareness of Knowledge in Large Language Models.” arXiv preprint (2025). arXiv:2509.15339.
- Chen, Lihu, et al. “Query-level uncertainty in large language models.” arXiv preprint (2025). arXiv:2506.09669.
- Wang, Z.; Zhao, Y.-P. Wetting and electrowetting on corrugated substrates. Phys. Fluids 2017, 29, 067101. [CrossRef]
- Frazier, Peter I. “A tutorial on Bayesian optimization.” arXiv preprint (2018). arXiv:1807.02811.
- Shahriari, Bobak. Practical Bayesian optimization with application to tuning machine learning algorithms. The University of British Columbia (Canada), 2016.
- Griffiths, R.-R.; Hernández-Lobato, J.M. Constrained Bayesian optimization for automatic chemical design using variational autoencoders. Chem. Sci. 2019, 11, 577–586. [CrossRef]
- Hickman, R.J.; Aldeghi, M.; Häse, F.; Aspuru-Guzik, A. Bayesian optimization with known experimental and design constraints for chemistry applications. Digit. Discov. 2022, 1, 732–744. [CrossRef]
- Häse, F.; Roch, L.M.; Aspuru-Guzik, A. Next-Generation Experimentation with Self-Driving Laboratories. Trends Chem. 2019, 1, 282–291. [CrossRef]
- Stach, E.; DeCost, B.; Kusne, A.G.; Hattrick-Simpers, J.; Brown, K.A.; Reyes, K.G.; Schrier, J.; Billinge, S.; Buonassisi, T.; Foster, I.; et al. Autonomous experimentation systems for materials development: A community perspective. Matter 2021, 4, 2702–2726. [CrossRef]
- Tom, G.; Schmid, S.P.; Baird, S.G.; Cao, Y.; Darvish, K.; Hao, H.; Lo, S.; Pablo-García, S.; Rajaonson, E.M.; Skreta, M.; et al. Self-Driving Laboratories for Chemistry and Materials Science. Chem. Rev. 2024, 124, 9633–9732. [CrossRef]
- A Bennett, J.; Abolhasani, M. Autonomous chemical science and engineering enabled by self-driving laboratories. Curr. Opin. Chem. Eng. 2022, 36. [CrossRef]
- Ishizuki, N.; Shimizu, R.; Hitosugi, T. Autonomous experimental systems in materials science. Sci. Technol. Adv. Mater. Methods 2023, 3. [CrossRef]
- Rahimian, Hamed, and Sanjay Mehrotra. “Distributionally robust optimization: A review.” arXiv preprint (2019). arXiv:1908.05659.
- Staib, Matthew, and Stefanie Jegelka. “Distributionally robust optimization and generalization in kernel methods.” Advances in Neural Information Processing Systems 32 (2019).
- Słowik, Agnieszka, and Léon Bottou. “On distributionally robust optimization and data rebalancing.” International Conference on Artificial Intelligence and Statistics. PMLR, 2022.
- Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. “Practical bayesian optimization of machine learning algorithms.” Advances in neural information processing systems 25 (2012).
- Sanchez-Lengeling, B.; Aspuru-Guzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 2018, 361, 360–365. [CrossRef]
- Dwork, C.; Roth, A. “The algorithmic foundations of differential privacy.” Foundations and trends® in theoretical computer science 9.3–4 (2014): 211-407.
- Rocher, L.; Hendrickx, J.M.; de Montjoye, Y.-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 2019, 10, 1–9. [CrossRef]
- Raji, Inioluwa Deborah, and Joy Buolamwini. “Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products.” Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 2019.
- Floridi, L.; Cowls, J. A Unified Framework of Five Principles for AI in Society. Machine learning and the city: Applications in architecture and urban design 2019. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).