Submitted:
08 December 2025
Posted:
09 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background
2.1. Prefill and Decode Characteristics
- Prefill. The model processes the entire prompt and builds the key–value (KV) cache for all input tokens across all layers. This phase is highly parallel and compute-intensive because self-attention must consider all previous tokens in the prompt. As a result, its computational demand and memory footprint are substantial.
- Decode. Leveraging the cached KV tensors, the model generates one token at a time in an autoregressive manner. Each decode step only processes the newly generated token while attending to the existing KV cache. Thus, per-step computation and memory usage are significantly smaller than in the prefill phase, but must be repeated many times.
2.2. KV-Cache and Memory Preemption Issue
2.3. Token-Budget: A Latency Control Parameter for Maintaining TTFT–ITL SLOs
- Small token-budget (ITL-friendly, TTFT-unfriendly). With fewer tokens per micro-batch, each decode iteration is lightweight, reducing ITL. Furthermore, when a prefill request arrives, the scheduler can insert its prefill chunk between decode batches with minimal disruption, mitigating Prefill–Decode interference. However, because each micro-batch carries little work, the first token arrives later, increasing TTFT.
- Large token-budget (TTFT-friendly, ITL-unfriendly). Larger budgets enable more prefill tokens to be processed at once, improving TTFT for long-prompt requests. However, inserting large prefill chunks into an ongoing decode stream causes significant Prefill–Decode interference, increasing ITL for existing decoding sessions and potentially amplifying queueing delays.
- Real-time workloads fluctuate unpredictably. IoT and edge applications generate highly variable prompt lengths, decode lengths, and arrival patterns, causing sudden shifts in workload composition.
- TTFT and ITL must remain within strict SLO bounds. Allowing unrestricted batch growth destabilizes both TTFT and ITL, especially under bursty or prefill-heavy workloads.
3. Related Work
3.1. IoT and Edge–Cloud Computing Architectures
3.2. Scheduling and Offloading in IoT and MEC Systems
3.3. LLM Inference Systems and Pipeline Parallelism
3.4. Throughput-Oriented Scheduling for LLM Inference
3.5. Summary and Research Gap
4. Proposed Method
4.1. Dynamic Token-Budget Estimation
- Each prefill generates KV-cache tensors that occupy GPU memory. Scheduling many small prefills concurrently increases memory pressure and raises the risk of preemption. A dynamic token-budget naturally limits concurrent prefill inflight size, lowering this risk.
- Splitting a prefill into numerous small chunks can benefit decode-heavy workloads, but excessively small chunks inflate TTFT and make GPU compute saturation difficult. By dividing the overall prefill cost across exactly num_micro_batch, we maintain a consistent TTFT/ITL balance while improving pipeline utilization.
| Algorithm 1: Dynamic Token-Budget Estimation |
![]() |
4.2. Dynamic Micro-Batch Scheduling
-
First-Token Generation Timerepresenting how long it takes for the first micro-batch to traverse the entire pipeline.
-
Rank-0 GPU Total Timerepresenting how long the first GPU remains busy processing all micro-batches.
| Algorithm 2: Dynamic Micro-batch Scheduling |
![]() |
5. Experiments
5.1. Experimental Setup
- Baseline. The stock implementation of vLLM v0.9.0.1, using its default static token-budget and micro-batch scheduling policy.
- Dynamic Token-Budget. Baseline augmented with Dynamic Token-Budget Estimation only.
- Dynamic Micro-Batch. The full proposed method, combining Dynamic Token-Budget Estimation and Dynamic Micro-batch Scheduling.
5.2. Scenarios and Evaluation Metrics
- Offline scenario. Synthetic workloads with controlled request parameters are used to examine how the methods behave under different sequence lengths and concurrency levels.
- Online scenario. Requests arrive according to a Poisson process, mimicking real-time service conditions with heterogeneous input/output lengths and time-varying load.
- Completion Time – total wall-clock time from the arrival of the first request until all requests are completed. This directly reflects throughput.
- Processing Time (Proc) – aggregated time during which GPUs are actively executing kernels.
- Idle Time – cumulative time during which at least one GPU has no work to execute (i.e., pipeline bubbles).
5.2.1. Offline Scenario
5.2.2. Online Scenario
- Dynamic Token-Budget alleviates excessive compute imbalance, shortens GPU Idle Time, and consequently lowers ITL while keeping TTFT within target.
- Dynamic Micro-Batch adapts the micro-batch count to rising network latency, sustaining the fraction of requests that meet the ITL SLO under congestion.
5.3. Results and Analysis
6. Discussion
6.1. Effectiveness for IoT-Scale Workloads
6.2. Throughput-First vs. Latency-First Optimization
6.3. Impact of Network Bandwidth.
6.4. Generalization, Limitations, and Future Work
7. Conclusions
Author Contributions
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- vLLM Contributors. vLLM: Open-source LLM Inference Library. https://github.com/vllm-project/vllm, 2025. Version v0.9.0.1, commit 5fbbfe9.
- Hamdan, S.; Ayyash, M.; Almajali, S. Edge-Computing Architectures for Internet of Things Applications: A Survey. Sensors 2020, 20, 6441. [Google Scholar] [CrossRef] [PubMed]
- Andriulo, F.C.; Fiore, M.; Mongiello, M.; Traversa, E.; Zizzo, V. Edge Computing and Cloud Computing for Internet of Things: A Review. Informatics 2024, 11, 71. [Google Scholar] [CrossRef]
- Lim, J. Latency-Aware Task Scheduling for IoT Applications Based on Artificial Intelligence with Partitioning in Small-Scale Fog Computing Environments. Sensors 2022, 22, 7326. [Google Scholar] [CrossRef] [PubMed]
- Eang, C.; Ros, S.; Kang, S.; Song, I.; Tam, P.; Math, S.; Kim, S. Offloading Decision and Resource Allocation in Mobile Edge Computing for Cost and Latency Efficiencies in Real-Time IoT. Electronics 2024, 13, 1218. [Google Scholar] [CrossRef]
- Saeik, F.; Avgeris, M.; Spatharakis, D.; Santi, N.; Dechouniotis, D.; Violos, J.; Leivadeas, A.; Athanasopoulos, N.; Mitton, N.; Papavassiliou, S. Task Offloading in Edge and Cloud Computing: A Survey on Mathematical, Artificial Intelligence and Control Theory Solutions. Computer Networks 2021, 195, 108177. [Google Scholar] [CrossRef]
- Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the Proceedings of the 29th Symposium on Operating Systems Principles (SOSP 2023). ACM, 2023, pp. 611–626. ACM. [CrossRef]
- Yu, G.; Jeong, J.S.; Kim, G.; Kim, S.; Chun, B. Orca: A Distributed Serving System for Transformer-Based Generative Models. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2022). USENIX Association, 2022, pp. 521–538.
- Agrawal, A.; Panwar, A.; Mohan, J.; Kwatra, N.; Gulavani, B.S.; Ramjee, R. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv 2023, abs/2308.16369, [2308.16369]. [CrossRef]
- Agrawal, A.; Kedia, N.; Panwar, A.; Mohan, J.; Kwatra, N.; Gulavani, B.S.; Tumanov, A.; Ramjee, R. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2024). USENIX Association, 2024, pp. 117–134.
- Zhong, Y.; Liu, S.; Chen, J.; Hu, J.; Zhu, Y.; Liu, X.; Jin, X.; Zhang, H. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2024). USENIX Association, 2024, pp. 193–210.
- Patel, P.; Choukse, E.; Zhang, C.; Shah, A.; Goiri, Í.; Maleki, S.; Bianchini, R. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In Proceedings of the 51st ACM/IEEE International Symposium on Computer Architecture (ISCA 2024). IEEE, 2024, pp. 118–132. [CrossRef]
- Hu, C.; Huang, H.; Xu, L.; Chen, X.; Xu, J.; Chen, S.; Feng, H.; Wang, C.; Wang, S.; Bao, Y.; et al. Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads. arXiv 2024, abs/2401.11181, [2401.11181]. [CrossRef]
- Jiang, Y.; Yan, R.; Yao, X.; Zhou, Y.; Chen, B.; Yuan, B. HexGen: Generative Inference of Large Language Model over Heterogeneous Environment. In Proceedings of the Proceedings of the 41st International Conference on Machine Learning (ICML 2024). OpenReview.net, 2024.
- Jiang, Y.; Yan, R.; Yuan, B. HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment. In Proceedings of the The Thirteenth International Conference on Learning Representations (ICLR 2025). OpenReview.net, 2025.
- Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.X.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019, pp. 103–112.
- Zhang, Z.; Sheng, Y.; Zhou, T.; Chen, T.; Zheng, L.; Cai, R.; Song, Z.; Tian, Y.; Ré, C.; Barrett, C.W.; et al. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023.
- Xiao, G.; Tian, Y.; Chen, B.; Han, S.; Lewis, M. Efficient Streaming Language Models with Attention Sinks. In Proceedings of the The Twelfth International Conference on Learning Representations (ICLR 2024). OpenReview.net, 2024.
- Dao, T.; Fu, D.Y.; Ermon, S.; Rudra, A.; Ré, C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022.
- Dao, T. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. In Proceedings of the The Twelfth International Conference on Learning Representations (ICLR 2024). OpenReview.net, 2024.
- Shah, J.; Bikshandi, G.; Zhang, Y.; Thakkar, V.; Ramani, P.; Dao, T. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. In Proceedings of the Advances in Neural Information Processing Systems 38 (NeurIPS 2024), 2024.
- Kamath, A.K.; Prabhu, R.; Mohan, J.; Peter, S.; Ramjee, R.; Panwar, A. POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference. In Proceedings of the Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS 2025). ACM, 2025, pp. 897–912. [CrossRef]
- Oh, H.; Kim, K.; Kim, J.; Kim, S.; Lee, J.; Chang, D.; Seo, J. ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference. In Proceedings of the Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS 2024). ACM, 2024, pp. 369–384. [CrossRef]
- Holmes, C.; Tanaka, M.; Wyatt, M.; Awan, A.A.; Rasley, J.; Rajbhandari, S.; Aminabadi, R.Y.; Qin, H.; Bakhtiari, A.; Kurilenko, L.; et al. DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference. arXiv 2024, abs/2401.08671, [2401.08671]. [CrossRef]
- Chen, Z.; May, A.; Svirschevski, R.; Huang, Y.; Ryabinin, M.; Jia, Z.; Chen, B. Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding. arXiv 2024, abs/2402.12374, [2402.12374]. [CrossRef]



| Workload (Req/In/Out) | Metric | Baseline | Only Dynamic Token-Budget |
Dynamic Token-Budget &Micro-batch |
Improvement |
|---|---|---|---|---|---|
| 64 / 256 / 256 | Completion | 130.3 s | 123.8 s | 111.6 s | 1.17× |
| Proc | 63.3 s | 67.1 s | 62.4 s | 1% | |
| Idle | 66.5 s | 56.2 s | 47.7 s | 28% | |
| 128 / 256 / 256 | Completion | 290.9 s | 226.4 s | 211.0 s | 1.38× |
| Proc | 112.6 s | 110.8 s | 122.4 s | –9% | |
| Idle | 177.7 s | 115.0 s | 88.0 s | 50% | |
| 256 / 256 / 256 | Completion | 616.2 s | 441.0 s | 381.9 s | 1.61× |
| Proc | 206.3 s | 197.2 s | 197.9 s | 4% | |
| Idle | 409.3 s | 243.2 s | 182.6 s | 55% | |
| 32 / 512 / 512 | Completion | 152.0 s | 147.5 s | 139.0 s | 1.09× |
| Proc | 88.2 s | 91.8 s | 87.4 s | 1% | |
| Idle | 63.2 s | 55.2 s | 51.1 s | 19% | |
| 64 / 512 / 512 | Completion | 319.3 s | 268.7 s | 256.5 s | 1.24× |
| Proc | 158.8 s | 158.4 s | 170.7 s | –8% | |
| Idle | 160.0 s | 109.7 s | 85.2 s | 47% | |
| 128 / 512 / 512 | Completion | 673.5 s | 499.9 s | 471.7 s | 1.43× |
| Proc | 295.0 s | 266.0 s | 282.8 s | 4% | |
| Idle | 377.9 s | 233.4 s | 188.3 s | 50% | |
| 32 / 1024 / 1024 | Completion | 420.6 s | 366.5 s | 349.6 s | 1.20× |
| Proc | 255.7 s | 255.0 s | 241.6 s | 6% | |
| Idle | 164.4 s | 110.9 s | 107.4 s | 35% | |
| 64 / 1024 / 1024 | Completion | 891.8 s | 708.8 s | 660.5 s | 1.35× |
| Proc | 475.6 s | 447.5 s | 424.9 s | 11% | |
| Idle | 415.7 s | 260.7 s | 234.2 s | 44% | |
| 32 / 2048 / 2048 | Completion | 1262.9 s | 1142.3 s | 1092.5 s | 1.16× |
| Proc | 833.1 s | 782.9 s | 738.6 s | 11% | |
| Idle | 429.3 s | 358.8 s | 353.4 s | 18% | |
| 64 / 2048 / 2048 | Completion | 2468.7 s | 2222.2 s | 2113.0 s | 1.17× |
| Proc | 1684.4 s | 1567.0 s | 1451.1 s | 14% | |
| Idle | 783.8 s | 654.5 s | 660.4 s | 16% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

