Preprint
Article

This version is not peer-reviewed.

Multi-Objective Scheduling for Large Language Model Inference with Prompt-Level Cost Prediction and SLO Awareness

Submitted:

18 April 2026

Posted:

20 April 2026

You are already at the latest version

Abstract
Large language model (LLM) inference in multi-tenant clouds is becoming an increasingly important contributor to data-center carbon emissions, yet existing carbon-aware scheduling techniques target long-running training jobs and are ill-suited for the short, bursty, SLO-sensitive nature of online serving. We propose CAPS (Carbon–Aware Prompt Scheduling), an online bi-objective scheduler that jointly optimizes goodput and per-request carbon cost for multi-tenant LLM inference. CAPS first employs a lightweight prompt complexity predictor to estimate token generation cost and latency risk for each incoming request. It then combines real-time grid carbon intensity, GPU energy profiles, and per-tenant SLO tiers to route each request to one of three execution pools: a low-latency pool, a low-carbon pool, or a delay-tolerant batch pool. A composite reward function balances goodput, carbon emissions, and SLO violation rate. In trace-driven simulations using public conversation traces and regional carbon intensity data, CAPS reduces average carbon emissions per 1K generated tokens by 26.8% compared to round-robin scheduling while achieving an SLO attainment rate that matches or exceeds a dedicated SLO-aware baseline.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated