Submitted:
04 December 2025
Posted:
04 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Why Tokens and Parameters Misalign with Capabilities
3. Representing Task Space as Capability Graphs
4. Capability-Centric Training: Curricula, Data, and Loss Functions
5. System-Level Architectures: Multi-Agent Specialization on Capability Subgraphs
- A literature & retrieval agent specialized on “information seeking” and “evidence synthesis” nodes;
- A theory & reasoning agent covering “symbolic manipulation”, “dimensional analysis”, and “scaling-law derivation” capabilities;
- A simulation & coding agent focusing on “numerical setup”, “code generation”, and “debugging” subgraphs;
- A safety & governance agent that monitors high-risk capabilities and enforces domain-specific constraints.
- Router + specialists (MRKL style). The capability graph is partitioned into regions handled by different specialists (calculator for arithmetic, retriever for factual recall, LLM for open-ended reasoning). A router agent learns a mapping from user queries (and intermediate states) onto these regions.
- Planner + workers (AutoGen / task-decomposition style). A planner agent operates at a higher level of the capability graph, decomposing a macro-task into a sequence of sub-tasks (nodes/edges). Worker agents specialize on narrower subgraphs (e.g., “code and run simulations”, “summarize experimental literature”), with the planner deciding which worker to invoke when.
- Hierarchical controllers. Multiple layers of agents correspond to different abstraction layers of the capability graph: top-level agents reason about high-level strategy (which capabilities to invoke in what order), mid-level agents handle specific domains (fluid mechanics vs. materials vs. data analysis), and low-level agents interface with concrete tools (CFD solvers, lab robots, databases).
- Modularity and maintainability. Because each agent is responsible for a subset of capabilities, we can update or replace it independently. If a new, better simulation engine is available, only the simulation agent and its graph region need to be retrained and re-certified.
- Interpretability. When a failure occurs—e.g., an incorrect scaling law for droplet spreading or an unsafe experimental suggestion—it can often be attributed to a specific agent/subgraph: the theory agent’s reasoning, the planner’s decomposition, or the safety agent’s oversight. This makes post-mortems and fixes more targeted.
- Safety and capability isolation. High-risk capabilities (bio-lab planning, chemical synthesis, security-relevant code) can be isolated into specialized agents wrapped in strong guardrails and access controls. Other agents can be prohibited from calling them directly, forcing requests through an oversight layer. MRKL-style routers and AutoGen-style orchestrators can implement policy checks at routing time [41,42,45].
6. Redefining Evaluation, Red-Teaming, and Safety in Capability Terms
References
- Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877-901.
- Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J. and Amodei, D., 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.D.L., Hendricks, L.A., Welbl, J., Clark, A. and Hennigan, T., 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D. and Chi, E.H., 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V. and Zhou, D., 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, pp.24824-24837.
- Srivastava, A., Rastogi, A., Rao, A., Shoeb, A.A.M., Abid, A., Fisch, A., Brown, A.R., Santoro, A., Gupta, A., Garriga-Alonso, A. and Kluska, A., 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on machine learning research.
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, pp.27730-27744.
- Bender, E.M., Gebru, T., McMillan-Major, A. and Shmitchell, S., 2021, March. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610-623).
- Wiggins, W.F. and Tejani, A.S., 2022. On the opportunities and risks of foundation models for natural language processing in radiology. Radiology: Artificial Intelligence, 4(4), p.e220119. [CrossRef]
- Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A. and Newman, B., 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S. and Nori, H., 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Ilharco, G., Ribeiro, M.T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H. and Farhadi, A., 2022. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
- Srivastava, A., Rastogi, A., Rao, A., Shoeb, A.A.M., Abid, A., Fisch, A., Brown, A.R., Santoro, A., Gupta, A., Garriga-Alonso, A. and Kluska, A., 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on machine learning research.
- Meng, K., Bau, D., Andonian, A. and Belinkov, Y., 2022. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35, pp.17359-17372.
- Meng, K., Sharma, A.S., Andonian, A., Belinkov, Y. and Bau, D., 2022. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229.
- Tam, D., Bansal, M. and Raffel, C., 2023. Merging by matching models in task parameter subspaces. arXiv preprint arXiv:2312.04339.
- Wiggins, W.F. and Tejani, A.S., 2022. On the opportunities and risks of foundation models for natural language processing in radiology. Radiology: Artificial Intelligence, 4(4), p.e220119.
- Bommasani, R., 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R. and Peng, W., 2024. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253.
- Wang, Z., Chen, E. and Zhao, Y., 2018. The effect of surface anisotropy on contact angles and the characterization of elliptical cap droplets. Science China Technological Sciences, 61(2), pp.309-316. [CrossRef]
- Wang, Z. and Zhao, Y.P., 2017. Wetting and electrowetting on corrugated substrates. Physics of Fluids, 29(6). [CrossRef]
- Wang, Z., Lin, K. and Zhao, Y.P., 2019. The effect of sharp solid edges on the droplet wettability. Journal of colloid and interface science, 552, pp.563-571. [CrossRef]
- Wang, Z., Wang, X., Miao, Q., Gao, F. and Zhao, Y.P., 2021. Spontaneous motion and rotation of acid droplets on the surface of a liquid metal. Langmuir, 37(14), pp.4370-4379. [CrossRef]
- Wang, Z., Wang, X., Miao, Q. and Zhao, Y.P., 2021. Realization of self-rotating droplets based on liquid metal. Advanced Materials Interfaces, 8(3), p.2001756. [CrossRef]
- Wang, Z.L. and Lin, K., 2023. The multi-lobed rotation of droplets induced by interfacial reactions. Physics of Fluids, 35(2). [CrossRef]
- Hu, J., Zhao, H., Xu, Z., Hong, H. and Wang, Z.L., 2024. The effect of substrate temperature on the dry zone generated by the vapor sink effect. Physics of Fluids, 36(6). [CrossRef]
- Hu, J. and Wang, Z.L., 2024. The effect of hygroscopic liquids on the spatial controlling of condensation on low-temperature surfaces. Surfaces and Interfaces, 55, p.105430. [CrossRef]
- Hu, J. and Wang, Z.L., 2024. Analysis of fluid flow in fractal microfluidic channels. arXiv preprint arXiv:2409.12845.
- Ba, Y., Mancenido, M.V. and Pan, R., 2024. Fill in the gaps: Model calibration and generalization with synthetic data. arXiv preprint arXiv:2410.10864.
- Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N.A. and Lewis, M., 2023, December. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 5687-5711).
- Bengio, Y., Louradour, J., Collobert, R. and Weston, J., 2009, June. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (pp. 41-48).
- Hu, J. and Wang, Z.L., 2025. Dynamic Wetting and Spreading of High-Viscosity Liquids on Grooved Substrates.
- Hu, J. and Wang, Z. L., 2024. Crystallization morphology and self-assembly of polyacrylamide solutions during evaporation. arXiv preprint arXiv:2403.20191. [CrossRef]
- Hu, J. and Wang, Z.L., 2023. Inhibition of water vapor condensation by dipropylene glycol droplets on hydrophobic surfaces via vapor sink strategy. arXiv preprint arXiv:2311.03930.
- Graves, A., Bellemare, M.G., Menick, J., Munos, R. and Kavukcuoglu, K., 2017, July. Automated curriculum learning for neural networks. In international conference on machine learning (pp. 1311-1320).
- Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M.E. and Stone, P., 2020. Curriculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research, 21(181), pp.1-50.
- Wang, Z.L., Zhao, H., Xu, Z. and Hong, H., 2023. Suppression of water vapor condensation by glycerol droplets on hydrophobic surfaces. arXiv preprint arXiv:2311.03068.
- Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D. and Hajishirzi, H., 2023, July. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 13484-13508).
- Settles, B. 2009, Active Learning Literature Survey. Univ. Wisconsin-Madison Tech. Rep. 1648.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J. and Awadallah, A.H., 2024, August. Autogen: Enabling next-gen LLM applications via multi-agent conversations. In First Conference on Language Modeling.
- Karpas, E., Abend, O., Belinkov, Y., Lenz, B., Lieber, O., Ratner, N., Shoham, Y., Bata, H., Levine, Y., Leyton-Brown, K. and Muhlgay, D., 2022. MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv preprint arXiv:2205.00445.
- Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N. and Scialom, T., 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, pp.68539-68551.
- Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R. and Cao, Y., 2022, October. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations.
- Chin, S.Y. and Why, D.N.K., Comparative of Multi-Agent System Frameworks: Crewai, Langchain, and Autogen. Langchain, and Autogen.
- Dibia, V., Chen, J., Bansal, G., Syed, S., Fourney, A., Zhu, E., Wang, C. and Amershi, S., 2024. Autogen studio: A no-code developer tool for building and debugging multi-agent systems. arXiv preprint arXiv:2408.15247.
- Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C. and Chen, C., 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K. and Jones, A., 2022. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).