Submitted:
03 May 2026
Posted:
06 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Background
- Dynamic, multi-step task nature: Reasoning tasks unfold through autoregressive chain-of-thought generation, where each step produces a segment of the final answer and the accumulated context grows over time. The decision made at any single step not only impacts the immediate quality and latency but also changes the cost of subsequent decisions.
- Multi-user resource coupling: In a dynamic network with stochastically arriving users, all active devices compete for the same finite resources, such as the server’s maximum parallel processing units and the total communication bandwidth. An offloading decision for one user inevitably alters queuing states and bandwidth availability for all others, making the scheduling problem tightly coupled across tasks and time.
- The quality-latency trade-off: Using the server LLM improves reasoning accuracy but incurs a triple delay penalty: communication delay to upload the ever-growing context, queuing delay due to limited server concurrency, and the server’s own computation delay. Conversely, executing a step on the local SLM avoids these penalties but risks lower-quality outputs. Balancing this trade-off in a time-varying environment under strict resource constraints is the core challenge.
1.2. Related Work
1.2.1. Problem-Level Routing
1.2.2. Step- and Token-Level Collaboration
1.2.3. PRM-Guided Step-Level Methods
1.2.4. Multi-Agent Homogeneous Collaboration
1.3. Our Contributions
- We provide the first principled formulation of heterogeneous agent collaboration in dynamic edge networks as a constrained sequential decision problem. Our formulation explicitly captures stochastic user arrivals, per-step chain-of-thought generation with growing context, and coupled competition for limited server concurrency and communication bandwidth. Crucially, we build a unified latency model that expresses all sources of delay, inclduing computation, communication, and queuing, in a single, real-time metric. The computation component is grounded in the actual LLM inference architecture: it separately accounts for the prefill and decode phases using FLOPs-level characterization, precisely capturing the strong dependence of both the server-side and local inference time on context length and the number of generated tokens. This rigorous modeling avoids the coarse heuristics prevalent in prior work and enables consistent optimization of a composite objective that trades off reasoning quality against total end-to-end latency through a single, interpretable parameter.
- Building on the formulation, we design the PRADA framework, which rests on two essential and complementary ideas. The first is a two-stage architecture that decouples the global decision problem. At the edge, a compact screening network makes per-user, per-step binary nominations without uploading any context; only the nominated candidates are forwarded to the server, reducing the candidate set by an order of magnitude and eliminating unnecessary communication. At the server, a Lagrangian-based scheduler resolves the final actions under real-time resource constraints, yielding a threshold-structured scheduling policy and a closed-form rule for bandwidth allocation. The second idea is to use the PRM exclusively as an offline teacher. Instead of enduring its prohibitive online inference cost, we distill its forward-looking quality assessment into the lightweight screening policy during training. The reward model is never invoked at deployment time, yet the resulting policy preserves its ability to judge whether a step is likely to benefit from stronger reasoning. We further prove that, under mild conditions, a first-stage decision to stay local remains optimal even after communication and queuing penalties are added back, guaranteeing that valuable offloading candidates are never prematurely discarded.
- Extensive simulations across several reasoning benchmarks reveal structural properties of the system. We show that PRADA consistently retains most of the accuracy gains achievable by always using the large model while substantially reducing end-to-end delay. More importantly, we uncover clear threshold effects for both the server parallel processing capacity and the total communication bandwidth. As either resource increases, accuracy and latency improve rapidly up to a critical point, after which the gains saturate and the system bottleneck shifts from one resource to another, for example, from queuing to computation, or from communication to server contention. These findings provide concrete, actionable guidance for jointly provisioning computation and communication resources to achieve a desired operating point on the quality-latency frontier, without requiring per-benchmark tuning or detailed prior knowledge of the task mix.
2. System Model
2.1. System Slots and Reasoning Steps
2.1.1. System Slots
2.1.2. Reasoning Steps
2.1.3. Interleaving of the Two Scales
- At any system slot t, a user is said to be active if it is ready to initiate a new reasoning step. An active user competes for system resources (e.g., bandwidth, server admission) and requires a scheduling decision.
- Once a decision is made and the step begins execution, the user becomes inactive for the duration of that step’s execution. During this inactive period, the task’s internal context remains static, and it does not participate in any resource contention.
- Upon completion of the current step, the task transitions back to the active state, its context is updated with the newly generated content, and it becomes eligible for scheduling again in the next system slot.
- Let denote the reasoning-step index that task i is about to execute when it becomes active in slot t.
- The context associated with this decision epoch is then denoted by .
- For brevity, when the step index is clear from context, we may simply write with the understanding that it refers to the appropriate .
2.2. Latency Decomposition
2.2.1. Communication Latency
2.2.2. Queuing and Computational Latency
3. The Quality-Latency Trade-Off
3.1. The Global MDP
3.1.1. State Space
3.1.2. Action Space
3.1.3. Trajectory and Stepwise Latency
3.1.4. Why Direct Optimization Is Intractable
- First, the joint action space couples heterogeneous decisions such as per-slot binary scheduling (local vs. offload), server admission control (queue vs. immediate execution), and continuous bandwidth allocation. These factors are strongly coupled. A decision made for one user at one generation step may affect not only its own quality and delay, but also the queuing state and bandwidth availability experienced by other users in subsequent slots. This makes exact optimization infeasible.
- Second, even if a heuristic centralized solver were employed, it would necessitate that every active user uploads its complete context to the server at the beginning of each slot purely for the purpose of decision making. Given that context sizes grow with reasoning progress, this overhead would consume a significant portion of the uplink budget B before any actual offloading occurs, defeating the purpose of communication-efficient acceleration.
- Third, the resulting mixed-integer program with time-varying constraints offers poor scalability and high latency. The centralized scheduling quickly becomes computationally intractable as the number of users grows.
3.2. Key Principles of PRADA
3.2.1. Two-Stage Decoupling
3.2.2. PRM as an Offline Supervisor
3.2.3. Precise Characterization of Computation Delay
- Prefill phase: The model processes the entire input context in parallel. During this phase, the key-value (KV) pairs for all input tokens are computed and cached, and the first token of the response is generated.
- Decode phase: Subsequent tokens are generated one by one in an autoregressive manner. Leveraging the stored KV cache, this phase computes attention only for the newest token, dramatically reducing the per-token computational cost.
4. Two-Stage Decoupled Collaboration
4.1. Decentralized Edge Screening (Stage 1)
4.1.1. Local Decision
4.1.2. Offline Training with PRM Supervision
4.1.3. Optimality of the Local Decision
4.2. Centralized Server Scheduling (Stage 2)
4.2.1. Refining the Value Estimates
- : the time the request has already spent waiting in the server queue before slot t (zero if the request is newly submitted);
- : the predicted server-side computation time , which is a function of the context length and the number of new tokens to be generated.
4.2.2. Optimization
4.2.3. Optimal Bandwidth Allocation for Immediate Admission
4.2.4. Threshold-Based Scheduling Policy
- 1.
- Competitive admission ():
- 2.
- Server fully occupied ():
- 3.
- Abundant server capacity ():
4.2.5. Practical Implementation
5. Simulation Experiments and Discussions
5.1. Single-User Evaluation of the Policy Network
5.1.1. Experimental Setup and Training Details
5.1.2. Results and Analysis
- Across all three datasets, consistently achieves a markedly better trade-off than the all-SLM baseline: it substantially improves accuracy while incurring only a fraction of the computation required by the all-LLM strategy. This confirms that the learned policy can identify which reasoning steps genuinely profit from the stronger LLM and which can be handled efficiently by the local SLM.
- Compared with RSD, is especially advantageous on the more challenging gaokao2023en and mmlu_stem benchmarks, delivering higher accuracy with lower computational cost. On gsm8k the two methods are highly competitive, exhibiting comparable accuracy-cost trade-offs. These results demonstrate that the proposed screening policy is an effective routing mechanism, not merely a cheap alternative.
- Across all datasets, lies substantially closer to the all-LLM accuracy bound than to the all-SLM bound, while its computation cost remains far below the all-LLM level. This indicates that the policy network successfully captures high-value offloading opportunities without over-using the stronger model.
5.2. Multi-User Evaluation of PRADA
| Dataset | Metric | All SLM | PRADA | |
|---|---|---|---|---|
| gsm8k | Accuracy (%) | 85.2 | 93.9 | 90.3 |
| Avg. Processing Delay (ms/task) | ||||
| Avg. Communication Delay (ms/task) | – | – | ||
| Avg. Queuing Delay (ms/task) | – | – | ||
| gaokao2023en | Accuracy (%) | 66.2 | 68.1 | 67.3 |
| Avg. Processing Delay (ms/task) | ||||
| Avg. Communication Delay (ms/task) | – | – | ||
| Avg. Queuing Delay (ms/task) | – | – | ||
| mmlu_stem | Accuracy (%) | 59.3 | 70.4 | 65.2 |
| Avg. Processing Delay (ms/task) | ||||
| Avg. Communication Delay (ms/task) | – | – | ||
| Avg. Queuing Delay (ms/task) | – | – |
| Parameter | Description | Value |
|---|---|---|
| M | Maximum parallel capacity of server | 9 |
| B | Total available bandwidth (bit/s) | |
| User arrival intensity | 3 | |
| Slot length (ms) | 1 | |
| LLM computation capability (FLOPs/s) | ||
| Maximum heuristic search iterations | 80 |
- PRADA preserves a large portion of the accuracy improvement brought by the policy network across all three benchmarks. On gsm8k, PRADA achieves 90.3% accuracy, compared with 93.9% for and 85.2% for the all-SLM baseline. On gaokao2023en, PRADA reaches 67.3%, remaining close to the 68.1% achieved by and still exceeding the all-SLM result of 66.2%. A similar trend is observed on mmlu_stem, where PRADA attains 65.2%, compared with 70.4% for and 59.3% for all-SLM. These results indicate that the second-stage multi-user scheduling mechanism does not substantially undermine the effectiveness of the first-stage routing policy.
- Simultaneously, PRADA significantly reduces the average processing delay relative to directly following . On gsm8k, the average processing delay decreases from 9.9 ms/task to 6.0 ms/task. On gaokao2023en, it is reduced from 22.5 ms/task to 11.7 ms/task, which corresponds to nearly a 48% reduction. On mmlu_stem, the processing delay drops from 15.3 ms/task to 8.9 ms/task. At the same time, the resulting delay remains only moderately higher than that of the all-SLM baseline. This shows that PRADA successfully translates the strong single-user routing capability of into a more practical multi-user strategy with substantially lower computation cost.
- Communication delay remains very small across all three datasets, ranging from only 0.4 to 0.7 ms/task. This suggests that under the current bandwidth setting, transmission overhead is well controlled and is not the dominant source of latency. By comparison, the queuing delay is larger than the communication delay, reaching 2.5 ms/task on gsm8k, 3.2 ms/task on gaokao2023en, and 3.8 ms/task on mmlu_stem. Thus, server contention contributes more to additional latency than pure communication overhead, although both remain within a relatively small range.
- Our PRADA framework generalizes well across datasets of varying difficulty: in all three cases it maintains a favorable accuracy-latency trade-off, preserving most of the performance gain of while significantly lowering system cost through resource-aware scheduling.
5.3. Impact of the Server Parallel Capacity M
5.4. Impact of Total Bandwidth B
6. Conclusions
Appendix A. Proof of Proposition 1
Appendix B. Proof of Theorem 2
References
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D.; et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
- Qu, G.; Chen, Q.; Wei, W.; Lin, Z.; Chen, X.; Huang, K. Mobile edge intelligence for large language models: A contemporary survey. IEEE Commun. Surv. Tutor. 2025, 27, 3820–3860. [Google Scholar] [CrossRef]
- Shao, Y.; Gündüz, D.; Liew, S.C. Federated edge learning with misaligned over-the-air computation. IEEE Trans. Wirel. Commun. 2021, 21, 3951–3964. [Google Scholar] [CrossRef]
- Kang, M.; Lee, S.; Baek, J.; Kawaguchi, K.; Hwang, S.J. Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. Adv. Neural Inf. Process. Syst. 2023, 36, 48573–48602. [Google Scholar]
- Liu, J.; Zhang, C.; Guo, J.; Zhang, Y.; Que, H.; Deng, K.; Bai, Z.; Liu, J.; Zhang, G.; Wang, J.; et al. DDK: Distilling domain knowledge for efficient large language models. Adv. Neural Inf. Process. Syst. 2024, 37, 98297–98319. [Google Scholar]
- Zhou, K.; Zhang, B.; Wang, J.; Chen, Z.; Zhao, W.X.; Sha, J.; Sheng, Z.; Wang, S.; Wen, J.R. Jiuzhang3.0: Efficiently improving mathematical reasoning by training small data synthesis models. Adv. Neural Inf. Process. Syst. 2024, 37, 1854–1889. [Google Scholar]
- Lu, Z.; Li, X.; Cai, D.; Yi, R.; Liu, F.; Liu, W.; Luan, J.; Zhang, X.; Lane, N.D.; Xu, M. Demystifying small language models for edge deployment. Proc. Proc. Assoc. Comput. Linguist. 2025, 14747–14764. [Google Scholar]
- Ong, I.; Almahairi, A.; Wu, V.; Chiang, W.L.; Wu, T.; Gonzalez, J.E.; Kadous, M.W.; Stoica, I. RouteLLM: Learning to Route LLMs from Preference Data. In Proceedings of the International Conference on Learning Representations, 2025. [Google Scholar]
- Chen, L.; Zaharia, M.; Zou, J. FrugalGPT: How to use large language models while reducing cost and improving performance. arXiv 2023, arXiv:2305.05176. [Google Scholar] [CrossRef]
- Hao, Z.; Jiang, H.; Jiang, S.; Ren, J.; Cao, T. Hybrid SLM and LLM for Edge-Cloud Collaborative Inference. In Proceedings of the MobiSys, 2024; pp. 36–41. [Google Scholar]
- Oh, S.; Kim, J.; Park, J.; Ko, S.W.; Quek, T.Q.S.; Kim, S.L. Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models. In Proceedings of the IEEE International Conference on Machine Learning for Communication and Networking, 2025. [Google Scholar]
- Zheng, W.; Chen, Y.; Zhang, W.; Kundu, S.; Li, Y.; Liu, Z.; Xing, E.P.; Wang, H.; Yao, H. CITER: Collaborative inference for efficient large language model decoding with token-level routing. arXiv 2025, arXiv:2502.01976. [Google Scholar] [CrossRef]
- Liao, B.; Xu, Y.; Dong, H.; Li, J.; Monz, C.; Savarese, S.; Sahoo, D.; Xiong, C. Reward-Guided Speculative Decoding for Efficient LLM Reasoning. In Proceedings of the International Conference on Machine Learning, 2025. [Google Scholar]
- Fan, Y.; Mao, Y.; Lai, L.; Zhang, Y.; Qian, Z.; Gao, Y. G-boost: Boosting private SLMs with general LLMs. arXiv 2025, arXiv:2503.10367. [Google Scholar] [CrossRef]
- Cui, H.; Du, Y.; Yang, Q.; Shao, Y.; Liew, S.C. LLMind: Orchestrating AI and IoT with LLM for complex task execution. IEEE Commun. Mag. 2024, 63, 214–220. [Google Scholar] [CrossRef]
- Luo, J.; Shao, Y. Cayley Graph Optimization for Scalable Multi-Agent Communication Topologies. arXiv 2026, arXiv:2604.09703. [Google Scholar]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus, K.; Aru, J.; Aru, J.; Vicente, R. Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE 2017, 12, e0172395. [Google Scholar] [CrossRef]
- Shao, Y.; Cao, Q.; Gündüz, D. A theory of semantic communication. IEEE Trans. Mob. Comput. 2024, 23, 12211–12228. [Google Scholar] [CrossRef]
- Qian, C.; Xie, Z.; Wang, Y.; Liu, W.; Zhu, K.; Xia, H.; Dang, Y.; Du, Z.; Chen, W.; Yang, C.; et al. Scaling large language model-based multi-agent collaboration. arXiv 2024, arXiv:2406.07155. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Yang, A.; Zhang, B.; Hui, B.; Gao, B.; Yu, B.; Li, C.; Liu, D.; Tu, J.; Zhou, J.; Lin, J.; et al. Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement. arXiv 2024, arXiv:2409.12122. [Google Scholar] [CrossRef]
- He, J.; Wei, T.; Yan, R.; Liu, J.; Wang, C.; Gan, Y.; Tu, S.; Liu, C.Y.; Zeng, L.; Wang, X.; et al. Skywork-o1 Open Series. 2024. [Google Scholar] [CrossRef]








| Parameter | Description | Value |
|---|---|---|
| N | Total training epochs | 100 |
| h | Training epochs per trajectory | 5 |
| Learning rate of policy network | ||
| Learning rate of value network | ||
| PPO clipping value |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.