Beyond Prediction: Tail-Aware Scheduling for LLM Inference

Beyond Prediction:
Tail-Aware Scheduling for LLM Inference

LLM serving is judged by its tail — yet today's schedulers chase the mean by predicting decode lengths. We stop predicting. One control knob, γ, smoothly shapes priority and tames P99 latency — beating SRPT even when it is handed perfect length oracles.

Yueying Li¹, Yuanfan Chen¹*, Jiayang Chen¹*, Esha Choukse², Haoran Qiu², G. Edward Suh^3,5, Rodrigo Fonseca², Ziv Scully⁴, Udit Gupta⁵

¹Cornell CS/ ²Microsoft Azure Systems Research/ ³NVIDIA/ ⁴Cornell ORIE/ ⁵Cornell ECE *equal contribution

One knob, the whole frontier Drag γ to interpolate between FCFS and SJF — watch the tail. Interactive Illustration

Boost parameter γ = 1.0

γ→0 · SJFsweet spotFCFS · γ→∞

P50 latency

—

P99 latency

—

Decode length is fundamentally unpredictable. Run the same prompt through the same model 20 times and the output length still swings by more than 2× (coefficient of variation up to 0.47), purely from sampling stochasticity and numerical nondeterminism. Two requests with identical prefills can diverge by orders of magnitude in completion time, and the size distribution shifts across workloads and over time — especially for reasoning models that think, reflect, and call tools.

Reasoning workloads are heavy-tailed. Datasets like BigCodeBench and S1K exhibit extreme output length variance — a single long-thinking request can be 10–100× longer than the median. Under such distributions, SRPT causes priority inversion: a short request that arrives slightly after a long one gets starved, inflating TTFT and P99 TTLT simultaneously.

No single policy dominates. FCFS is fair but slow; SRPT minimizes mean latency but has no tail guarantees. Prediction-based hybrids (TRAIL, LTR) are fragile under distribution shift. The right answer depends on the workload — and workloads change.

Policy / Work	Heavy-tailed	Light-tailed	No Size Pred.	Preemption Overhead	Tail-Aware by Design	TTLT Evaluated
Shortest Prefix First (vLLM)	△	△	✗	✗	✗	✗
Rank-prediction SJF (LTR)	✓	△	✗	△	✗	✗
Prediction-based SRPT (TRAIL)	✓	△	✗	△	✗	✓
FCFS (vLLM default)	✗	✓	✓	✓	✗	△
Skip-Join MLFQ / LAS	△	△	✓	△	✓	△
UniBoost (Ours)	✓	✓	✓	✓	✓	✓

Policy / Work

Heavy-tailed

Light-tailed

No Size Pred.

Preemption Overhead

Tail-Aware by Design

TTLT Evaluated

Shortest Prefix First (vLLM)

△

✗

Rank-prediction SJF (LTR)

✓

△

✗

△

✗

Prediction-based SRPT (TRAIL)

✓

△

✗

△

✗

✓

FCFS (vLLM default)

✗

✓

✗

△

Skip-Join MLFQ / LAS

△

✓

△

✓

△

UniBoost (Ours)

✓

UniBoost is inspired by tail-optimal scheduling theory. Rather than hard size-based ranking (SJF/SRPT), it uses soft, continuous priority shaping: each request receives a smoothly varying score computed from lightweight signals, and requests are served in order of increasing arrival time minus boost. This boosting approach can provably suppress extreme delays under partial observability.

Applying boosting to LLM serving faces three practical challenges: (1) LLM inference is stateful and memory-coupled — KV caches make preemption expensive; (2) the optimal boost parameter depends on the workload distribution, which shifts across tasks; (3) prefill and decode phases have fundamentally different scheduling dynamics. UniBoost addresses all three through a staged design.

Scheduler	End-to-End Latency (TTLT)	TTFT	TBT	Throughput
TRAIL+ (baseline)	+0.0%	+0.0%	+0.0%	+0.0%	+0.0%	+0.0%	+0.0%	+0.0%	+0.0%	+0.0%
SJF	−11.1%	−12.5%	+11.2%	−74.6%	−70.5%	−9.7%	−5.2%	−7.1%	−6.3%	−7.9%
Sarathi	−12.3%	−13.2%	−7.6%	−49.9%	−40.5%	−8.1%	−6.8%	−8.4%	−7.9%	−10.9%
DistBoost	−14.0%	+6.0%	+37.1%	−22.8%	−35.0%	−3.4%	−14.1%	−10.2%	+18.3%	+1.1%
UniBoost (Ours)	−19.1%	+1.1%	+35.1%	+52.1%	+97.4%	+34.0%	−25.3%	−8.1%	+33.8%	+1.2%

Scheduler

End-to-End Latency (TTLT)

TTFT

TBT

Throughput

Mean

P95

P99

Mean

P95

P99

Mean

P95

P99

tok/s

TRAIL+ (baseline)

+0.0%

SJF

−11.1%

−12.5%

+11.2%

−74.6%

−70.5%

−9.7%

−5.2%

−7.1%

−6.3%

−7.9%

Sarathi

−12.3%

−13.2%

−7.6%

−49.9%

−40.5%

−8.1%

−6.8%

−8.4%

−7.9%

−10.9%

DistBoost

−14.0%

+6.0%

+37.1%

−22.8%

−35.0%

−3.4%

−14.1%

−10.2%

+18.3%

+1.1%

UniBoost (Ours)

−19.1%

+1.1%

+35.1%

+52.1%

+97.4%

+34.0%

−25.3%

−8.1%

+33.8%

+1.2%

Chunk size	E2E	TTFT	TBT	tok/s
128	−17.8%	+2.4%	+38.6%	+12.3%	+18.5%	+8.2%	−22.1%	−5.9%	+36.4%	+0.6%
512	−18.7%	+1.5%	+36.0%	+35.8%	+58.1%	+22.3%	−24.5%	−7.5%	+34.8%	+1.0%
1024 (default)	−19.1%	+1.1%	+35.1%	+52.1%	+97.4%	+34.0%	−25.3%	−8.1%	+33.8%	+1.2%
2048	−19.4%	+0.8%	+34.5%	+68.4%	+93.6%	+45.1%	−26.0%	−8.6%	+33.2%	+1.3%

Chunk size

E2E

TTFT

TBT

tok/s

Mean

P95

P99

Mean

P95

P99

Mean

P95

P99

128

−17.8%

+2.4%

+38.6%

+12.3%

+18.5%

+8.2%

−22.1%

−5.9%

+36.4%

+0.6%

512

−18.7%

+1.5%

+36.0%

+35.8%

+58.1%

+22.3%

−24.5%

−7.5%

+34.8%

+1.0%

1024 (default)

−19.1%

+1.1%

+35.1%

+52.1%

+97.4%

+34.0%

−25.3%

−8.1%

+33.8%

+1.2%

2048

−19.4%

+0.8%

+34.5%

+68.4%

+93.6%

+45.1%

−26.0%

−8.6%

+33.2%

+1.3%

Global Scheduler	Avg E2E	P90 E2E	P99 E2E	Avg TBT	P90 TBT	P99 TBT
Random	+2.1%	−3.7%	−13.9%	−68.8%	−17.0%	−69.5%
Round-robin	+2.7%	−0.9%	−13.5%	−48.0%	+5.5%	−60.5%
JSQ (Join-Shortest-Queue)	−23.0%	−17.2%	−23.5%	−73.5%	−52.6%	−77.5%

Global Scheduler

Avg E2E

P90 E2E

P99 E2E

Avg TBT

P90 TBT

P99 TBT

Random

+2.1%

−3.7%

−13.9%

−68.8%

−17.0%

−69.5%

Round-robin

+2.7%

−0.9%

−13.5%

−48.0%

+5.5%

−60.5%

JSQ (Join-Shortest-Queue)

−23.0%

−17.2%

−23.5%

−73.5%

−52.6%

−77.5%

Scale	Avg E2E	P99 E2E	Avg TTFT	P99 TTFT
32×	−2.3%	−13.9%	+8.4%	−8.7%
64×	−2.8%	−24.2%	+3.6%	−10.3%
128×	−1.9%	−12.7%	+7.8%	−11.9%
512×	−0.4%	−23.2%	+1.2%	−13.5%

Scale

Avg E2E

P99 E2E

Avg TTFT

P99 TTFT

32×

−2.3%

−13.9%

+8.4%

−8.7%

64×

−2.8%

−24.2%

+3.6%

−10.3%

128×

−1.9%

−12.7%

+7.8%

−11.9%

512×

−0.4%

−23.2%

+1.2%

−13.5%

@inproceedings{li2026beyondprediction, title = {Beyond Prediction: Tail-Aware Scheduling for LLM Inference}, author = {Yueying Li and Yuanfan Chen and Jiayang Chen and Esha Choukse and Haoran Qiu and G. Edward Suh and Rodrigo Fonseca and Ziv Scully and Udit Gupta}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2026}, }

Beyond Prediction:
Tail-Aware Scheduling for LLM Inference

Tail latency, not mean latency — and no length prediction needed

You can't schedule what you can't predict

Comparison with Prior Work

A four-phase design for tail-aware LLM scheduling

The Boost Priority Function

DistBoost — Prefill-Boosted Strawman

UniBoost-Base — Unified Priority Space

MemGuard — KV-Aware Hysteresis

γ-Ada — Adaptive Parameter Estimation

Consistent tail latency improvements across workloads

Percentile Latency Improvements

Design Phase Ablation

SLO Attainment

Preemption Efficiency

Performance vs. TRAIL Baseline

Composability, extensions, and open problems

Prefix / Radix Caching

Chunked Prefill and Continuous Batching

P/D Disaggregation

Composability with Global Dispatch Policies

Scalability with Cluster Size

Limitations and Future Work

Cite this work