Submitted:
18 November 2025
Posted:
18 November 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction: From Solitary Models to Artificial Communities
2. Why the Next Scaling Frontier Is Multi-Agent, Not Just Bigger Models
- Population scale refers to the number and diversity of agents in a system. A trivial multi-agent setup might simply duplicate the same base model several times; a more sophisticated one might include agents with different model sizes, training data, or alignment procedures, as well as specialist tools and simulators. Diversity in cognitive styles and priors—“optimistic vs skeptical”, “global vs local”—plays a role analogous to polydispersity in droplet ensembles or heterogeneity in porous media.
- Organizational scale captures the topology and hierarchy of interactions: are agents arranged in a simple planner–worker structure, a star topology with a central judge, a deep hierarchy with multiple levels of review, or a graph with local neighborhood communication? Just as the branching ratio and connectivity of fractal microchannel determine flow and mixing patterns, [25,36] different communication topologies in LLM societies can lead to dramatically different modes of convergence, exploration, and error propagation.[10,15,22,24]
- Institutional scale concerns the maturity of norms, protocols, and shared memories that govern the system over time. Human scientific communities rely on journals, peer review, standards, and archives to stabilize knowledge and coordinate activity.[12] Artificial communities can similarly maintain institutional memory in shared vector stores, version-controlled artefacts, and explicit procedural templates. The user’s work on condensation control and pattern formation under repeated cycles of wetting and drying provides a physical analogue: history-dependent phenomena—such as hysteresis in contact angles or path-dependent crystallization—show that past interactions matter for present behavior.[32,33,34,35,37]
3. Interaction Regimes: Competition, Collaboration, and Coordination
3.1. Competitive Regimes: Debate, Self-Play, and Adversarial Search
- Diversity of priors and roles among agents, so that at least some are predisposed to challenge majority views;
- Debate protocols that reward novel critiques and counter-examples rather than repetition;
- Judging mechanisms that can recognize when minority arguments are epistemically stronger than majority ones.
3.2. Collaborative Regimes: Division of Labor and Team Reasoning
- How to assign and evolve roles? Static hand-crafted roles may be a starting point, but over time the system should learn which agents are effective at which subtasks, and adjust division of labor dynamically.
- How to encourage information sharing without overload? If every agent broadcasts everything to everyone, communication becomes expensive and noisy. Conversely, if information stays soloed, the team cannot integrate its insights. This is closely analogous to balancing connectivity and mixing in fractal networks: too few connections and transport is inefficient; too many and flows interfere and recirculate.[25,36]
- How to prevent “free-riding” and over-reliance on a single strong agent? In human teams, social norms and incentives encourage each member to contribute. In LLM teams, one agent (often the largest or best-aligned) may end up doing most of the work. Training objectives and orchestration logic must explicitly value diverse contributions, not just final answers.[21,22]
3.3. Coordinated Regimes: Orchestration and Workflow Execution
- Task decomposition strategies: how the planner represents tasks, chooses subtasks, and decides when to stop decomposition;
- Scheduling and resource allocation: which agents or tools are invoked when, subject to latency and cost constraints;
- Failure handling and recovery: how the system detects when a subtask has failed or produced inconsistent results, and how it retries, escalates, or replans.
3.4. Regime–Task Alignment And Dynamic Regime Switching
4. Architectures for LLM Societies: Roles, Memories, and Communication
4.1. Role Specialization and Cognitive Diversity
4.2. Shared Memory and Institutional Knowledge
- Granularity. What events deserve to be written into institutional memory? Storing every intermediate thought is infeasible and undesirable; instead, systems must learn to extract “commit-worthy” artefacts: accepted hypotheses, vetted protocols, approved code patches.
- Structure. Should memory be predominantly vector-based (for flexible retrieval) or symbolic (for explicit constraints and traceability)? Hybrid approaches can, for example, index structured records (e.g. an experimental run of condensation control conditions and outcomes) with both symbolic keys and learned embedding.[32,33,34,35,36,37]
- Revision and forgetting. Scientific institutions constantly revise their knowledge: retractions, updated standards, superseded protocols. LLM societies need mechanisms for amending or retiring outdated entries, lest they be haunted by early mistakes—especially when systems autonomously generate synthetic data or self-imposed “norms”.
4.3. Communication Topologies and Substrates
4.4. Design Motifs and Failure Modes
- Committees plus executors, where a deliberative body evaluates options and an execution agent interacts with external systems.
5. Multi-Agent Training Objectives: Debate, Consensus, Peer Review, Bargaining
5.1. From Individual Loss to Collective Objectives
- Conflict resolution quality, e.g. whether minority but correct views can eventually overturn majority but wrong ones, as tested in hidden-profile tasks.[41]
5.2. Debate-Style Objectives
- Encourage novelty: reward agents for introducing new lines of evidence or alternative reasoning paths.
- Penalize redundancy: limit rewards for repeating already stated arguments, to avoid echo chambers.
- Incorporate meta-cognitive signals: allow agents to express uncertainty, defer judgement, or call for more evidence, and reward such caution when appropriate.
5.3. Consensus and Self-Consistency
5.4. Peer Review and Cross-Agent Critique
- Reward reviewers when their critiques lead to measurable improvements in subsequent revisions (e.g. lower error rates, higher robustness).
- Discourage spurious criticism by penalising reviewers whose comments do not improve or actively degrade performance.
- Teach authors to respond to critique by revising appropriately, providing justification when they reject comments, much like human authors in journal rebuttals.
5.5. Bargaining, Negotiation, and Resource Allocation
5.6. Towards Multi-Agent Pretraining
- Generating synthetic corpora of debates, peer reviews, and negotiations, possibly seeded by real scientific and engineering records (papers, reviews, code reviews, standards discussions).
- Training populations of agents jointly, sharing parameters where appropriate but allowing role-specific adapters or memory modules to diverge.
6. New Benchmarks: Measuring Collective, Not Individual, Intelligence
6.1. Why Single-Agent Benchmarks Are Insufficient
- Is the solution the product of genuine information integration, or did one strong agent dominate while others free-rode?
- Does the group remain robust when some agents are noisy, adversarial, or misaligned?
- Can a correct minority view eventually overturn an incorrect majority, as in hidden-profile experiments in social psychology?
6.2. Task Families for Collective Evaluation
- Distributed-information reasoning. Hidden-profile style problems, where no single agent has enough information to solve the task, but the group could in principle succeed through communication.[41] This probes whether agents can surface and integrate complementary evidence rather than amplifying shared priors.
- Long-horizon projects. Tasks that require maintaining goals, plans, and artefacts over many steps and time scales—e.g. multi-stage codebase refactoring, iterative scientific experiment design, or multi-day project management. Generative-agent environments, in which agents inhabit persistent simulation towns, offer testbeds for institutional memory and norm formation.[12]
- Safety- and governance-sensitive scenarios. Benchmarks where different agents play regulators, developers, auditors, and affected stakeholders, negotiating policies or red-teaming decisions. Multi-agent referee systems such as ChatEval, which use agent committees to evaluate generated text, already show how LLM panels can outperform single models in judgement tasks.[43]
6.3. Metrics for Collective Performance
- Task performance: final accuracy, solution quality, and resource usage (latency, tokens, tool calls).
- Division of labor: how evenly are contributions distributed? Do specialized agents carry out the subtasks they are suited for, or does one “hero agent” dominate?
- Institutional memory and reproducibility: can the group reconstruct its past decisions, rationales, and experimental conditions from its logs and shared artefacts?
6.4. Controlled Environments and “Wind Tunnels”
6.5. Reporting Standards and Transparency
- Communication topology and agent roles.
- Training regime (single- vs multi-agent, debate-style fine-tuning, synthetic interaction data).
- Memory structures and access policies.
7. Risks and Opportunities in Delegating to Artificial Communities
7.1. New Risk Patterns: Collusion, Echo Chambers, and Power Asymmetries
- Collusion and jailbreaks. Multiple agents can conspire—intentionally or emergently—to bypass safety mechanisms. For example, one agent might rephrase disallowed content into innocuous-looking instructions that another agent executes, or a group might gradually normalize unsafe actions through repeated mutual reinforcement. Multi-agent debate protocols can, in principle, be repurposed to game evaluation metrics or persuade judges of incorrect conclusions.[11,13,15,22,43]
- Echo chambers and groupthink. As HiddenBench demonstrates, LLM groups can fail to integrate distributed information, instead amplifying shared but incomplete priors.[41] If communication topologies favor majority views and judges are insufficiently skeptical, groupthink becomes the default: the appearance of consensus hides a fragile epistemic base.
- Power asymmetries. In heterogeneous societies, some agents (larger models, those with privileged tool access, or those closer to human interfaces) may disproportionately shape outcomes. Over time, institutional memory and decision logs can entrench these asymmetries, much like path-dependent processes in physical and social systems.[12,32,33,34,35,36,37]
7.2. Accountability, Traceability, and Audit
- Which agents participated in which decisions.
- What information each agent saw and produced.
- How final outputs were synthesized from intermediate contributions.
7.3. Opportunities: Robustness, Diversity, and Human–AI co-Governance
- Robustness through redundancy. Multiple agents with diverse priors and tools can cross-check each other’s outputs, catching errors that a single model might overlook.[11,13,15,21,22,41,42,43] Analogous to redundant sensors in engineering or multiple droplets suppressing condensation in different zones of a cold surface, diversity can increase fault tolerance—if interaction protocols prevent herding.
- Bias mitigation via cognitive diversity. By embedding agents tuned to different normative frameworks (e.g. different fairness definitions, privacy preferences, or stakeholder perspectives), a system can surface tensions and trade-offs rather than silently optimising for one objective. Human overseers can then make more informed decisions. [4,5,12,18,23]
- Human–AI co-governance. Multi-agent architectures naturally accommodate human agents as first-class participants: reviewers, auditors, or domain experts who can join deliberations, veto decisions, or reshape protocols. In contrast to monolithic black-box models, artificial communities can be designed to expose interfaces at multiple levels of granularity: from individual debates to committee reports. [12,18,21,22,23]
8. Outlook: from Emergent Swarms to Engineered Scientific Communities
References
- Brown, T. B.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Kaplan, J.; et al. Scaling laws for neural language models. arXiv 2001.08361 ( 2020.
- Hoffmann, J.; et al. Training compute-optimal large language models. arXiv 2203.15556 ( 2022.
- Achiam, J.; et al. GPT-4 technical report. arXiv 2303.08774 ( 2023.
- Bubeck, S.; et al. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv 2303.12712 ( 2023.
- Touvron, H.; et al. LLaMA: Open and efficient foundation language models. arXiv 2302.13971 ( 2023.
- Mialon, G.; et al. Augmented language models: A survey. arXiv 2302.07842 ( 2023.
- Yao, S.; et al. ReAct: Synergizing reasoning and acting in language models. arXiv 2210.03629 ( 2022.
- Schick, T. , Dwivedi-Yu, J., Ossa, A., Scarlatos, A. & Schütze, H. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 38822–38839. [Google Scholar]
- Li, K.; et al. CAMEL: Communicative agents for “mind” exploration of large scale language model society. arXiv 2303.17760 ( 2023.
- Du, Y.; et al. Improving factuality and reasoning in language models through multiagent debate. arXiv 2305.14325 ( 2023.
- Malone, T. W. Superminds: The surprising power of people and computers thinking together. (Little, Brown and Company, 2018).
- Shinn, N. , Cassano, F. & Gopinath, A. Reflexion: Language agents with verbal reinforcement learning. arXiv, 2023. [Google Scholar]
- Madaan, A.; et al. Self-Refine: Iterative refinement with self-feedback. Adv. Neural Inf. Process. Syst. 2023, 36, 24608–24628. [Google Scholar]
- Besta, M.; Blach, N.; Kubicek, A.; Gerstenberger, R.; Podstawski, M.; Gianinazzi, L.; Gajda, J.; Lehmann, T.; Niewiadomski, H.; Nyczyk, P.; et al. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17682–17690. [Google Scholar] [CrossRef]
- Pan, L.; et al. A survey of large language model based agents: Architectures, tasks, and challenges. arXiv 2401.09498 ( 2024.
- Mehandru, N.; et al. Evaluating large language model agents in real clinics: Opportunities and challenges. npj Digit. Med. 2024, 7, 178. [Google Scholar] [CrossRef] [PubMed]
- Dong, Y.; Mu, R.; Zhang, Y.; Sun, S.; Zhang, T.; Wu, C.; Jin, G.; Qi, Y.; Hu, J.; Meng, J.; et al. Safeguarding large language models: a survey. Artif. Intell. Rev. 2025, 58, 1–56. [Google Scholar] [CrossRef]
- Zheng, J.; Qiu, S.; Shi, C.; Ma, Q. Towards Lifelong Learning of Large Language Models: A Survey. ACM Comput. Surv. 2025, 57, 1–35. [Google Scholar] [CrossRef]
- Seshadri, A.; et al. A survey of large language model agents for question answering. arXiv 2503.19213 ( 2025.
- Wei, J.; et al. Reasoning with language models. Commun. ACM 2025, 68, 46–57. [Google Scholar]
- Ni, Y.; et al. Large language models as agents. Found. Trends Mach. Learn. 2024; 18, 1–194. [Google Scholar]
- OpenAI. OpenAI o1 system card. (OpenAI, 2024).
- Chen, F.; et al. Beyond scaling laws: Towards scientific reasoning-driven LLM architectures. Preprints 202504.2088 ( 2025.
- Whitesides, G.M. The origins and the future of microfluidics. Nature 2006, 442, 368–373. [Google Scholar] [CrossRef]
- Wang, Z.; Zhao, Y.-P. Wetting and electrowetting on corrugated substrates. Phys. Fluids 2017, 29, 067101. [Google Scholar] [CrossRef]
- Wang, Z.; Chen, E.; Zhao, Y. The effect of surface anisotropy on contact angles and the characterization of elliptical cap droplets. Sci. China Technol. Sci. 2017, 61, 309–316. [Google Scholar] [CrossRef]
- Wang, Z.; Lin, K.; Zhao, Y.-P. The effect of sharp solid edges on the droplet wettability. J. Colloid Interface Sci. 2019, 552, 563–571. [Google Scholar] [CrossRef]
- Wang, Z.-L.; Lin, K. The multi-lobed rotation of droplets induced by interfacial reactions. Phys. Fluids 2023, 35, 021705. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, X.; Miao, Q.; Zhao, Y. Realization of Self-Rotating Droplets Based on Liquid Metal. Adv. Mater. Interfaces 2020, 8. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, X.; Miao, Q.; Gao, F.; Zhao, Y.-P. Spontaneous Motion and Rotation of Acid Droplets on the Surface of a Liquid Metal. Langmuir 2021, 37, 4370–4379. [Google Scholar] [CrossRef] [PubMed]
- Hu, J.; Wang, Z.-L. Crystallization Morphology and Self-Assembly of Polyacrylamide Solutions During Evaporation. Fine Chem. Eng. 2024, 487–497. [Google Scholar] [CrossRef]
- Hu, J. Inhibition of water vapor condensation by dipropylene glycol droplets on hydrophobic surfaces via vapor sink strategy. Surf. Interfaces 2024. [Google Scholar]
- Wang, Z.-L.; et al. Suppression of water vapor condensation by glycerol droplets on hydrophobic surfaces. Phys. Fluids 2024, 36, 067106. [Google Scholar]
- Hu, J.; Zhao, H.; Xu, Z.; Hong, H.; Wang, Z.-L. The effect of substrate temperature on the dry zone generated by the vapor sink effect. Phys. Fluids 2024, 36. [Google Scholar] [CrossRef]
- Hu, J. & Wang, Z.-L. Analysis of fluid flow in fractal microfluidic channels. Phys. Fluids 2024, 36, 093603. [Google Scholar]
- Hu, J. & Wang, Z.-L. Effect of hygroscopic liquids on spatial control of vapor condensation patterns. Surf. Interfaces 2024, 2024. [Google Scholar]
- Xu, Y.; et al. Facet-dependent electrochemical behavior of Au–Pd core@shell nanorods for enhanced hydrogen peroxide sensing. ACS Appl. Nano Mater. 2023, 6, 18739–18747. [Google Scholar] [CrossRef]
- Zhuang, S.; Qi, H.; Wang, X.; Li, X.; Liu, K.; Liu, J.; Zhang, H. Advances in Solar-Driven Hygroscopic Water Harvesting. Glob. Challenges 2020, 5. [Google Scholar] [CrossRef] [PubMed]
- Ni, F.; et al. Tillandsia-inspired hygroscopic photothermal organogels for atmospheric water harvesting. Adv. Funct. Mater. 2020, 30, 2003268. [Google Scholar]
- Li, Y. , Naito, A. & Shirado, H. HiddenBench: Assessing collective reasoning in multi-agent LLMs via hidden profile tasks. arXiv, 2025. [Google Scholar]
- Liang, P.; et al. Holistic Evaluation of Language Models. arXiv 2211.09110 ( 2022.
- Chan, C. M.; et al. ChatEval: Towards better LLM-based evaluators through multi-agent debate. In Proc. Int. Conf. Learn. Represent. (ICLR) (2024).
- Tam, T.Y.C.; Sivarajkumar, S.; Kapoor, S.; Stolyar, A.V.; Polanska, K.; McCarthy, K.R.; Osterhoudt, H.; Wu, X.; Visweswaran, S.; Fu, S.; et al. A framework for human evaluation of large language models in healthcare derived from literature review. npj Digit. Med. 2024, 7, 1–20. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).