ICML 2026  ·  ML Systems  ·  Scheduling Theory

Beyond Prediction:
Tail-Aware Scheduling for LLM Inference

LLM serving is judged by its tail — yet today's schedulers chase the mean by predicting decode lengths. We stop predicting. One control knob, γ, smoothly shapes priority and tames P99 latency — beating SRPT even when it is handed perfect length oracles.

Yueying Li1, Yuanfan Chen1*, Jiayang Chen1*, Esha Choukse2, Haoran Qiu2, G. Edward Suh3,5, Rodrigo Fonseca2, Ziv Scully4, Udit Gupta5
1Cornell CS/ 2Microsoft Azure Systems Research/ 3NVIDIA/ 4Cornell ORIE/ 5Cornell ECE   *equal contribution
One knob, the whole frontier Drag γ to interpolate between FCFS and SJF — watch the tail. Interactive Illustration
γ→0 · SJFsweet spotFCFS · γ→∞
P50 latency
P99 latency
TL;DR

Tail latency, not mean latency — and no length prediction needed

LLM serving exhibits extreme output-length variability. Existing schedulers approximate SJF/SRPT using predicted decode lengths — but these prediction-driven policies are fragile and, fundamentally, mean-optimal: they offer no guarantees on tail behavior. UniBoost replaces prediction with soft, continuous priority boosting and KV-cache-aware preemption. A single parameter γ interpolates between FCFS and SJF — and a mid setting beats both.

35–50%
lower P99 TTLT than SRPT with perfect length prediction
34–97%
lower P95 TTFT across reasoning- and chat-heavy workloads
2.9–8.7×
wider SLO threshold at 99% attainment — with zero length prediction
Key Insight

No single LLM scheduling policy dominates. The optimal strategy depends on the job size distribution and arrival burstiness. Optimizing for tail latency requires a scheduler that adaptively interpolates between two extremes: HoL-blocking mitigation (preempt long jobs) and starvation protection (guard long jobs).

The Problem

You can't schedule what you can't predict

Size-based policies (SJF/SRPT) are mean-optimal only when job sizes are known. In LLM serving, they almost never are — and the mean is the wrong target anyway.

Decode length is fundamentally unpredictable. Run the same prompt through the same model 20 times and the output length still swings by more than (coefficient of variation up to 0.47), purely from sampling stochasticity and numerical nondeterminism. Two requests with identical prefills can diverge by orders of magnitude in completion time, and the size distribution shifts across workloads and over time — especially for reasoning models that think, reflect, and call tools.

Reasoning workloads are heavy-tailed. Datasets like BigCodeBench and S1K exhibit extreme output length variance — a single long-thinking request can be 10–100× longer than the median. Under such distributions, SRPT causes priority inversion: a short request that arrives slightly after a long one gets starved, inflating TTFT and P99 TTLT simultaneously.

No single policy dominates. FCFS is fair but slow; SRPT minimizes mean latency but has no tail guarantees. Prediction-based hybrids (TRAIL, LTR) are fragile under distribution shift. The right answer depends on the workload — and workloads change.

Token length variance across workloads
Figure 1 · Output token length distributions. WildChat, BigCodeBench, and S1K at different load points. Reasoning workloads exhibit extreme heavy-tailed behavior — a single policy cannot handle all regimes.
FCFS convoy effect: long request blocks short ones
Figure 2a · HoL blocking under FCFS. A long request $A$ arrives at $t=0$, delaying short requests $B,C,D,E$. FCFS performs poorly; SRPT/LAS preempt $A$ to serve short jobs first.
SRPT starvation: burst of short jobs starves long request
Figure 2b · Starvation under SRPT. A burst of short jobs $B,C,D$ causes SRPT/LAS to repeatedly preempt long request $A$, inflating tail latency. FCFS maintains better fairness here.
Key Insight The optimal scheduler for tail latency must adaptively interpolate between FCFS (optimal for light-tailed workloads) and SRPT (optimal for heavy-tailed workloads) — without relying on explicit length prediction. A single continuous knob γ achieves this, and its sweet spot beats both extremes on P99 latency.

Comparison with Prior Work

Table 1 · Comparison of LLM Scheduling Approaches
Policy / Work Heavy-tailed Light-tailed No Size Pred. Preemption Overhead Tail-Aware by Design TTLT Evaluated
Shortest Prefix First (vLLM)
Rank-prediction SJF (LTR)
Prediction-based SRPT (TRAIL)
FCFS (vLLM default)
Skip-Join MLFQ / LAS
UniBoost (Ours)
✓ = supported; △ = partial; ✗ = not supported. UniBoost is the only scheduler that handles both heavy- and light-tailed workloads, requires no size prediction, accounts for preemption overhead, and is explicitly designed and evaluated for tail latency.
Method

A four-phase design for tail-aware LLM scheduling

Soft, continuous priority shaping — no length prediction required. One parameter γ controls the entire FCFS–SJF spectrum.

UniBoost is inspired by tail-optimal scheduling theory. Rather than hard size-based ranking (SJF/SRPT), it uses soft, continuous priority shaping: each request receives a smoothly varying score computed from lightweight signals, and requests are served in order of increasing arrival time minus boost. This boosting approach can provably suppress extreme delays under partial observability.

Applying boosting to LLM serving faces three practical challenges: (1) LLM inference is stateful and memory-coupled — KV caches make preemption expensive; (2) the optimal boost parameter depends on the workload distribution, which shifts across tasks; (3) prefill and decode phases have fundamentally different scheduling dynamics. UniBoost addresses all three through a staged design.

The Boost Priority Function

The γ-Boost priority — one knob, two extremes
$$b_\gamma(w) = \frac{1}{\gamma}\log\!\left(\frac{1}{1 - e^{-\gamma w}}\right)$$
$$\Phi(i,t) = a_i - b_\gamma\!\left(\tilde{w}_i(t)\right)$$
$$\tilde{w}_i(t) = \max\bigl(w_i(t),\, s_i^{\text{pre}}\bigr) \quad\cdot\quad \text{serve } \arg\min_i\, \Phi(i,t)$$

Reading the score. $a_i$ = arrival time (FCFS rank) · $w_i(t)$ = attained service (tokens done) · $s_i^{\text{pre}}$ = prefill length · $\tilde{w}_i = \max(w_i, s_i^{\text{pre}})$ = effective work in either phase. The boost $b_\gamma(\tilde{w}_i)$ is large when little work is done and decays to 0 as work accrues; the earliest boosted arrival $\Phi = a_i - b_\gamma(\tilde{w}_i)$. A barely-started request is pulled far ahead; as it runs its boost fades and it drifts back toward its true rank $a_i$ — short jobs jump the queue, yet long jobs are never starved.

Why one knob spans the frontier. $b_\gamma$ is decreasing and convex in $w$ — steep near $w{=}0$, flat once a job has run a while — and $\gamma$ sets how fast attained work converts to priority. $\gamma \to \infty$: $b_\gamma \to 0 \Rightarrow \Phi \to a_i$, recovering FCFS (protect the tail). $\gamma \to 0$: $b_\gamma(w) \approx \tfrac{1}{\gamma}\log\tfrac{1}{\gamma w}$, so the boost gap between two jobs $\approx \tfrac{1}{\gamma}\log(w_2/w_1) \to \infty$ and the least-attained job always wins — SJF/LAS (chase the mean). Provably strongly tail-optimal ($K_\pi{=}1$) among blind policies [Yu & Scully 2024; Harlev et al. 2025].

UniBoost four-phase design overview
Figure 3 · UniBoost system overview. A four-phase design that evolves from a strawman prefill-boosted scheduler (DistBoost) to a unified priority space (UniBoost-Base), then adds KV-aware preemption stability (MemGuard) and online adaptive parameter estimation (γ-Ada).
Phase 1

DistBoost — Prefill-Boosted Strawman

Treats prefill and decode as disjoint queues. Applies boost-based priority to prefill requests while serving decode requests in FCFS order. Simple but susceptible to inter-queue Head-of-Line blocking when long decode jobs delay short prefill requests.

Phase 2

UniBoost-Base — Unified Priority Space

Merges prefill and decode into a single priority queue using the effective work metric $\tilde{w}_i(t)$. Allows cross-phase preemption and promotion, eliminating inter-queue HoL blocking while maintaining distribution-aware priority ordering.

Phase 3

MemGuard — KV-Aware Hysteresis

Discretizes priority updates using geometric quantization aligned with KV block sizes. A request's priority is only reconsidered when its decoded token count crosses a threshold $k \cdot 2^{\lfloor \log_2(w_i/k) \rfloor}$, bounding preemption opportunities to $O(\log S_i)$ per request and preventing KV-swap thrashing.

Phase 4

γ-Ada — Adaptive Parameter Estimation

Estimates the optimal boost parameter $\gamma$ online by fitting a log-linear slope to the observed response time tail in the $[t_{95}, t_{99}]$ band. This allows the scheduler to automatically adapt to workload distribution shifts and varying arrival rates without manual tuning.

MemGuard geometric quantization of boost function
Figure 4 · MemGuard geometric quantization. Priority revisions are bounded to $O(\log S_i)$ opportunities per request, preventing thrashing under tight KV memory. Continuous vs. quantized boost functions at prefill lengths 256 and 1024.
Results

Consistent tail latency improvements across workloads

Evaluated on Qwen-72B, Llama-3-70B, TPA-CP4, and CodeLlama-34B across reasoning, chat, and mixed workloads.

Percentile Latency Improvements

Percentile latency improvement over Sarathi on reasoning workload
Figure 5 · Latency improvement on reasoning workload (QWen-72B). Percent slowdown of each baseline compared to UniBoost across percentiles (P50–P99.9) for TTLT, TTFT, and TBT. UniBoost consistently dominates: TTFT gains accelerate in the upper tail due to prefill admission protection, while TTLT gains come from avoiding attained-service inversion under memory-coupled decoding.

Across all three latency dimensions, UniBoost’s advantage widens as we move into the upper tail. The TTFT improvement is particularly striking: because UniBoost boosts newly-arrived requests before they accumulate service, short prefill jobs are never blocked behind long decode sequences. The TTLT gains reflect a deeper structural win — by keeping priority aligned with attained work rather than predicted length, UniBoost avoids the “attained-service inversion” that causes SRPT to trigger cascading KV-cache evictions under memory pressure. The result is a scheduler that is simultaneously better at the mean and at the tail, without any length oracle.

Design Phase Ablation

Load-latency ablation across design phases
Figure 6 · Design-phase ablation (CodeLlama-34B, 1 GPU). Load–latency curves sweeping QPS. DistBoost (purple) destabilizes first at 0.24 QPS. Phase 2 (blue) delays the knee and halves P99 near 0.248 QPS. Adding MemGuard (green) cuts tails by another 1.5–2×. UniBoost (orange) performs best: at 0.255 QPS, P99 remains sub-second while alternatives are multi-second. Overall, UniBoost shifts the knee right by ~4–6% and reduces P90/P99/mean by up to 10× near saturation.

The ablation isolates the contribution of each design phase. Phase 1 (DistBoost) already improves over Sarathi for moderate loads, but its disjoint-queue design creates a new failure mode: long decode jobs block short prefill arrivals, causing the system to destabilize earlier. Phase 2’s unified priority space resolves this by allowing cross-phase preemption, shifting the saturation knee rightward. MemGuard’s geometric quantization then provides the decisive gain near saturation: by bounding preemption opportunities to O(log Si) per request, it prevents the KV-swap thrashing that collapses P99 under tight memory. The final γ-Ada layer adds robustness to workload shifts without any manual tuning.

SLO Attainment

SLO attainment with chat+reasoning mixture
Figure 7a · SLO attainment (chat+reasoning mixture). UniBoost maintains highest attainment across the widest range of SLO scales. On average, 2.9–8.7× more stringent SLO can be achieved at 99% attainment.
SLO attainment with reasoning dataset
Figure 7b · SLO attainment (reasoning dataset). UniBoost achieves 1.7–4.3× gain in SLO threshold at 99% attainment for TTLT, while also dominating TTFT attainment across all scales.

SLO attainment captures what operators actually care about: what fraction of requests meet a latency budget? The 2.9–8.7× improvement in achievable SLO threshold means that a cluster running UniBoost can commit to a latency guarantee that is nearly an order of magnitude tighter than what Sarathi or SJF can sustain at the same 99% attainment level. Equivalently, at a fixed SLO, UniBoost can serve significantly higher load before attainment drops below the target — directly translating to higher revenue per GPU.

Preemption Efficiency

CDF of restarts per request at different load points
Figure 8 · CDF of restarts per request at ρ = 0.8, 0.9, 0.99 (Qwen-72B, reasoning workload). UniBoost (blue) achieves dramatically fewer preemptions than SRPT (gray) and Sarathi (orange), especially at high load — validating MemGuard's $O(\log 1)$ preemption bound.

Preemption is the hidden cost of priority scheduling in LLM serving: every time a request is evicted, its KV-cache must be swapped to CPU memory and reloaded later, consuming PCIe bandwidth and stalling the GPU. SRPT’s aggressive size-based preemption triggers this cascade frequently, especially at high load (ρ = 0.99) where memory is scarce. MemGuard’s geometric quantization ensures that a request’s priority is only reconsidered at exponentially spaced token milestones, so the total number of preemption opportunities is logarithmic in the request length. The CDF confirms this: at ρ = 0.99, over 90% of UniBoost requests complete with zero restarts, compared to frequent multi-restart patterns under SRPT and Sarathi.

Performance vs. TRAIL Baseline

Table 2 · Relative Performance vs. TRAIL+ (Reasoning Workload)
Scheduler End-to-End Latency (TTLT) TTFT TBT Throughput
MeanP95P99 MeanP95P99 MeanP95P99 tok/s
TRAIL+ (baseline) +0.0%+0.0%+0.0% +0.0%+0.0%+0.0% +0.0%+0.0%+0.0% +0.0%
SJF −11.1% −12.5% +11.2% −74.6% −70.5% −9.7% −5.2% −7.1% −6.3% −7.9%
Sarathi −12.3% −13.2% −7.6% −49.9% −40.5% −8.1% −6.8% −8.4% −7.9% −10.9%
DistBoost −14.0% +6.0% +37.1% −22.8% −35.0% −3.4% −14.1% −10.2% +18.3% +1.1%
UniBoost (Ours) −19.1% +1.1% +35.1% +52.1% +97.4% +34.0% −25.3% −8.1% +33.8% +1.2%
Green = improvement; yellow = minor degradation (≤15%); orange = moderate degradation (≤50%); red = severe degradation (>50%). UniBoost delivers the most substantial gains across all major tail metrics, especially P99 TTLT (+35%) and P95 TTFT (+97%), without sacrificing throughput.

Compared to TRAIL+, the strongest published LLM scheduling baseline, UniBoost achieves a +35% improvement in P99 end-to-end latency and a remarkable +97% gain in P95 TTFT — meaning the 95th-percentile time-to-first-token is nearly halved. The mean TTLT trade-off (−19%) reflects UniBoost’s deliberate choice to prioritize tail over mean: by protecting short requests from convoy delays, some long requests wait slightly longer at the mean. Crucially, throughput is essentially unchanged (+1.2%), confirming that tail improvement comes from smarter scheduling, not from sacrificing utilization. TBT P99 improves by +34%, indicating that even the decode phase benefits from reduced memory pressure and fewer preemption-induced stalls.

Discussion

Composability, extensions, and open problems

UniBoost is designed to drop into existing LLM serving stacks. This section covers how it interacts with common system optimizations and outlines directions for future work.

Prefix / Radix Caching

Caching is outside the scope of this work, but UniBoost extends straightforwardly to prefix and radix caching. Caching primarily reduces effective prefill work: if a request has $h_i$ cached tokens (e.g., from a shared system prompt or an earlier multi-turn context), we replace the effective-work formula with $$\tilde{w}_i(t) = \max\!\bigl(w_i(t),\; s_i^{\text{pre}} - h_i\bigr).$$ This single substitution is sufficient — the UniBoost implementation and its theoretical validity remain unchanged. In practice, the cache-hit count $h_i$ is known at admission time (it is the output of the radix-cache lookup), so no additional estimation is required.

Chunked Prefill and Continuous Batching

UniBoost already adopts continuous batching and chunked prefill by design. Prefill is treated as incremental service: each chunk advances the effective work $\tilde{w}_i(t)$ and therefore updates priority consistently with decode. The chunk size is typically set based on the desired TBT/TPOT SLO; the table below shows that UniBoost’s gains are robust across a wide range of chunk sizes.

Table A1 · Sensitivity to chunk size (relative change vs. FCFS baseline)
Chunk size E2E TTFT TBT tok/s
MeanP95P99 MeanP95P99 MeanP95P99
128 −17.8%+2.4%+38.6% +12.3%+18.5%+8.2% −22.1%−5.9%+36.4% +0.6%
512 −18.7%+1.5%+36.0% +35.8%+58.1%+22.3% −24.5%−7.5%+34.8% +1.0%
1024 (default) −19.1%+1.1%+35.1% +52.1%+97.4%+34.0% −25.3%−8.1%+33.8% +1.2%
2048 −19.4%+0.8%+34.5% +68.4%+93.6%+45.1% −26.0%−8.6%+33.2% +1.3%
Larger chunks improve TTFT (fewer scheduling interruptions during prefill) at the cost of slightly higher mean TBT. P99 gains are robust across all chunk sizes.

P/D Disaggregation

In prefill-decode disaggregated deployments (e.g., Splitwise, Nexus), prefill and decode run on separate GPU pools. UniBoost applies independently within each pool: the prefill scheduler uses the prompt-length signal $s_i^{\text{pre}}$ to boost new arrivals, and the decode scheduler uses attained decode tokens $w_i^{\text{dec}}(t)$ as the effective work signal. The two schedulers share the same $\gamma$ parameter, which can be estimated jointly from the end-to-end latency tail observed at the global dispatcher.

Composability with Global Dispatch Policies

Cluster-level routers decide which instance a request is sent to; UniBoost decides how to schedule requests within that instance. When the routing policy is approximately load-oblivious, the per-instance scheduling guarantees carry over directly. The table below reports the relative improvement of UniBoost over the vLLM baseline under three routing strategies on 64 H100 GPUs.

Table A2 · UniBoost vs. vLLM under different global schedulers (64× H100)
Global Scheduler Avg E2EP90 E2EP99 E2E Avg TBTP90 TBTP99 TBT
Random +2.1%−3.7%−13.9% −68.8%−17.0%−69.5%
Round-robin +2.7%−0.9%−13.5% −48.0%+5.5%−60.5%
JSQ (Join-Shortest-Queue) −23.0%−17.2%−23.5% −73.5%−52.6%−77.5%
With JSQ routing, TBT and E2E improvements become more pronounced, confirming that UniBoost composes favorably with stronger global dispatch policies. Green = improvement over vLLM baseline.

Scalability with Cluster Size

The gains remain consistent across scales from 32× to 512×, indicating that UniBoost’s per-instance benefits are largely independent of cluster size.

Table A3 · Relative improvement of UniBoost over vLLM at different cluster scales
ScaleAvg E2EP99 E2EAvg TTFTP99 TTFT
32×−2.3%−13.9%+8.4%−8.7%
64×−2.8%−24.2%+3.6%−10.3%
128×−1.9%−12.7%+7.8%−11.9%
512×−0.4%−23.2%+1.2%−13.5%
P99 E2E and P99 TTFT improve consistently across all cluster sizes.

Limitations and Future Work

UniBoost targets single-replica, single-model scheduling. Several important directions remain open. First, integrating the cache-aware effective-work formula and measuring the combined effect with radix caching is a natural next step. Second, extending UniBoost to multi-model and multi-replica deployments requires new theoretical tools for how per-instance tail guarantees compose under load-aware routing. Third, the MemGuard bound of $O(\log S_i)$ preemptions per request assumes fixed KV-block sizes; incorporating variable-length KV blocks (as in speculative decoding) would require a more refined analysis. Finally, developing formal guarantees that capture preemption and eviction costs in the M/G/k model remains an important open problem.

Takeaway Stop predicting. Shape priorities softly, manage the cache jointly, and the tail takes care of itself. UniBoost drops into vLLM and SGLang continuous batching today.
Citation

Cite this work

@inproceedings{li2026beyondprediction,
  title     = {Beyond Prediction: Tail-Aware Scheduling for LLM Inference},
  author    = {Yueying Li and Yuanfan Chen and Jiayang Chen and Esha Choukse
               and Haoran Qiu and G. Edward Suh and Rodrigo Fonseca
               and Ziv Scully and Udit Gupta},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026},
}