LLM serving is judged by its tail — yet today's schedulers chase the mean by predicting decode lengths. We stop predicting. One control knob, γ, smoothly shapes priority and tames P99 latency — beating SRPT even when it is handed perfect length oracles.
LLM serving exhibits extreme output-length variability. Existing schedulers approximate SJF/SRPT using predicted decode lengths — but these prediction-driven policies are fragile and, fundamentally, mean-optimal: they offer no guarantees on tail behavior. UniBoost replaces prediction with soft, continuous priority boosting and KV-cache-aware preemption. A single parameter γ interpolates between FCFS and SJF — and a mid setting beats both.
No single LLM scheduling policy dominates. The optimal strategy depends on the job size distribution and arrival burstiness. Optimizing for tail latency requires a scheduler that adaptively interpolates between two extremes: HoL-blocking mitigation (preempt long jobs) and starvation protection (guard long jobs).
Size-based policies (SJF/SRPT) are mean-optimal only when job sizes are known. In LLM serving, they almost never are — and the mean is the wrong target anyway.
Decode length is fundamentally unpredictable. Run the same prompt through the same model 20 times and the output length still swings by more than 2× (coefficient of variation up to 0.47), purely from sampling stochasticity and numerical nondeterminism. Two requests with identical prefills can diverge by orders of magnitude in completion time, and the size distribution shifts across workloads and over time — especially for reasoning models that think, reflect, and call tools.
Reasoning workloads are heavy-tailed. Datasets like BigCodeBench and S1K exhibit extreme output length variance — a single long-thinking request can be 10–100× longer than the median. Under such distributions, SRPT causes priority inversion: a short request that arrives slightly after a long one gets starved, inflating TTFT and P99 TTLT simultaneously.
No single policy dominates. FCFS is fair but slow; SRPT minimizes mean latency but has no tail guarantees. Prediction-based hybrids (TRAIL, LTR) are fragile under distribution shift. The right answer depends on the workload — and workloads change.
| Policy / Work | Heavy-tailed | Light-tailed | No Size Pred. | Preemption Overhead | Tail-Aware by Design | TTLT Evaluated |
|---|---|---|---|---|---|---|
| Shortest Prefix First (vLLM) | △ | △ | ✗ | ✗ | ✗ | ✗ |
| Rank-prediction SJF (LTR) | ✓ | △ | ✗ | △ | ✗ | ✗ |
| Prediction-based SRPT (TRAIL) | ✓ | △ | ✗ | △ | ✗ | ✓ |
| FCFS (vLLM default) | ✗ | ✓ | ✓ | ✓ | ✗ | △ |
| Skip-Join MLFQ / LAS | △ | △ | ✓ | △ | ✓ | △ |
| UniBoost (Ours) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Soft, continuous priority shaping — no length prediction required. One parameter γ controls the entire FCFS–SJF spectrum.
UniBoost is inspired by tail-optimal scheduling theory. Rather than hard size-based ranking (SJF/SRPT), it uses soft, continuous priority shaping: each request receives a smoothly varying score computed from lightweight signals, and requests are served in order of increasing arrival time minus boost. This boosting approach can provably suppress extreme delays under partial observability.
Applying boosting to LLM serving faces three practical challenges: (1) LLM inference is stateful and memory-coupled — KV caches make preemption expensive; (2) the optimal boost parameter depends on the workload distribution, which shifts across tasks; (3) prefill and decode phases have fundamentally different scheduling dynamics. UniBoost addresses all three through a staged design.
Reading the score. $a_i$ = arrival time (FCFS rank) · $w_i(t)$ = attained service (tokens done) · $s_i^{\text{pre}}$ = prefill length · $\tilde{w}_i = \max(w_i, s_i^{\text{pre}})$ = effective work in either phase. The boost $b_\gamma(\tilde{w}_i)$ is large when little work is done and decays to 0 as work accrues; the earliest boosted arrival $\Phi = a_i - b_\gamma(\tilde{w}_i)$. A barely-started request is pulled far ahead; as it runs its boost fades and it drifts back toward its true rank $a_i$ — short jobs jump the queue, yet long jobs are never starved.
Why one knob spans the frontier. $b_\gamma$ is decreasing and convex in $w$ — steep near $w{=}0$, flat once a job has run a while — and $\gamma$ sets how fast attained work converts to priority. $\gamma \to \infty$: $b_\gamma \to 0 \Rightarrow \Phi \to a_i$, recovering FCFS (protect the tail). $\gamma \to 0$: $b_\gamma(w) \approx \tfrac{1}{\gamma}\log\tfrac{1}{\gamma w}$, so the boost gap between two jobs $\approx \tfrac{1}{\gamma}\log(w_2/w_1) \to \infty$ and the least-attained job always wins — SJF/LAS (chase the mean). Provably strongly tail-optimal ($K_\pi{=}1$) among blind policies [Yu & Scully 2024; Harlev et al. 2025].
Treats prefill and decode as disjoint queues. Applies boost-based priority to prefill requests while serving decode requests in FCFS order. Simple but susceptible to inter-queue Head-of-Line blocking when long decode jobs delay short prefill requests.
Merges prefill and decode into a single priority queue using the effective work metric $\tilde{w}_i(t)$. Allows cross-phase preemption and promotion, eliminating inter-queue HoL blocking while maintaining distribution-aware priority ordering.
Discretizes priority updates using geometric quantization aligned with KV block sizes. A request's priority is only reconsidered when its decoded token count crosses a threshold $k \cdot 2^{\lfloor \log_2(w_i/k) \rfloor}$, bounding preemption opportunities to $O(\log S_i)$ per request and preventing KV-swap thrashing.
Estimates the optimal boost parameter $\gamma$ online by fitting a log-linear slope to the observed response time tail in the $[t_{95}, t_{99}]$ band. This allows the scheduler to automatically adapt to workload distribution shifts and varying arrival rates without manual tuning.
Evaluated on Qwen-72B, Llama-3-70B, TPA-CP4, and CodeLlama-34B across reasoning, chat, and mixed workloads.
Across all three latency dimensions, UniBoost’s advantage widens as we move into the upper tail. The TTFT improvement is particularly striking: because UniBoost boosts newly-arrived requests before they accumulate service, short prefill jobs are never blocked behind long decode sequences. The TTLT gains reflect a deeper structural win — by keeping priority aligned with attained work rather than predicted length, UniBoost avoids the “attained-service inversion” that causes SRPT to trigger cascading KV-cache evictions under memory pressure. The result is a scheduler that is simultaneously better at the mean and at the tail, without any length oracle.
The ablation isolates the contribution of each design phase. Phase 1 (DistBoost) already improves over Sarathi for moderate loads, but its disjoint-queue design creates a new failure mode: long decode jobs block short prefill arrivals, causing the system to destabilize earlier. Phase 2’s unified priority space resolves this by allowing cross-phase preemption, shifting the saturation knee rightward. MemGuard’s geometric quantization then provides the decisive gain near saturation: by bounding preemption opportunities to O(log Si) per request, it prevents the KV-swap thrashing that collapses P99 under tight memory. The final γ-Ada layer adds robustness to workload shifts without any manual tuning.
SLO attainment captures what operators actually care about: what fraction of requests meet a latency budget? The 2.9–8.7× improvement in achievable SLO threshold means that a cluster running UniBoost can commit to a latency guarantee that is nearly an order of magnitude tighter than what Sarathi or SJF can sustain at the same 99% attainment level. Equivalently, at a fixed SLO, UniBoost can serve significantly higher load before attainment drops below the target — directly translating to higher revenue per GPU.
Preemption is the hidden cost of priority scheduling in LLM serving: every time a request is evicted, its KV-cache must be swapped to CPU memory and reloaded later, consuming PCIe bandwidth and stalling the GPU. SRPT’s aggressive size-based preemption triggers this cascade frequently, especially at high load (ρ = 0.99) where memory is scarce. MemGuard’s geometric quantization ensures that a request’s priority is only reconsidered at exponentially spaced token milestones, so the total number of preemption opportunities is logarithmic in the request length. The CDF confirms this: at ρ = 0.99, over 90% of UniBoost requests complete with zero restarts, compared to frequent multi-restart patterns under SRPT and Sarathi.
| Scheduler | End-to-End Latency (TTLT) | TTFT | TBT | Throughput | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | P95 | P99 | Mean | P95 | P99 | Mean | P95 | P99 | tok/s | |
| TRAIL+ (baseline) | +0.0% | +0.0% | +0.0% | +0.0% | +0.0% | +0.0% | +0.0% | +0.0% | +0.0% | +0.0% |
| SJF | −11.1% | −12.5% | +11.2% | −74.6% | −70.5% | −9.7% | −5.2% | −7.1% | −6.3% | −7.9% |
| Sarathi | −12.3% | −13.2% | −7.6% | −49.9% | −40.5% | −8.1% | −6.8% | −8.4% | −7.9% | −10.9% |
| DistBoost | −14.0% | +6.0% | +37.1% | −22.8% | −35.0% | −3.4% | −14.1% | −10.2% | +18.3% | +1.1% |
| UniBoost (Ours) | −19.1% | +1.1% | +35.1% | +52.1% | +97.4% | +34.0% | −25.3% | −8.1% | +33.8% | +1.2% |
Compared to TRAIL+, the strongest published LLM scheduling baseline, UniBoost achieves a +35% improvement in P99 end-to-end latency and a remarkable +97% gain in P95 TTFT — meaning the 95th-percentile time-to-first-token is nearly halved. The mean TTLT trade-off (−19%) reflects UniBoost’s deliberate choice to prioritize tail over mean: by protecting short requests from convoy delays, some long requests wait slightly longer at the mean. Crucially, throughput is essentially unchanged (+1.2%), confirming that tail improvement comes from smarter scheduling, not from sacrificing utilization. TBT P99 improves by +34%, indicating that even the decode phase benefits from reduced memory pressure and fewer preemption-induced stalls.
UniBoost is designed to drop into existing LLM serving stacks. This section covers how it interacts with common system optimizations and outlines directions for future work.
Caching is outside the scope of this work, but UniBoost extends straightforwardly to prefix and radix caching. Caching primarily reduces effective prefill work: if a request has $h_i$ cached tokens (e.g., from a shared system prompt or an earlier multi-turn context), we replace the effective-work formula with $$\tilde{w}_i(t) = \max\!\bigl(w_i(t),\; s_i^{\text{pre}} - h_i\bigr).$$ This single substitution is sufficient — the UniBoost implementation and its theoretical validity remain unchanged. In practice, the cache-hit count $h_i$ is known at admission time (it is the output of the radix-cache lookup), so no additional estimation is required.
UniBoost already adopts continuous batching and chunked prefill by design. Prefill is treated as incremental service: each chunk advances the effective work $\tilde{w}_i(t)$ and therefore updates priority consistently with decode. The chunk size is typically set based on the desired TBT/TPOT SLO; the table below shows that UniBoost’s gains are robust across a wide range of chunk sizes.
| Chunk size | E2E | TTFT | TBT | tok/s | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | P95 | P99 | Mean | P95 | P99 | Mean | P95 | P99 | ||
| 128 | −17.8% | +2.4% | +38.6% | +12.3% | +18.5% | +8.2% | −22.1% | −5.9% | +36.4% | +0.6% |
| 512 | −18.7% | +1.5% | +36.0% | +35.8% | +58.1% | +22.3% | −24.5% | −7.5% | +34.8% | +1.0% |
| 1024 (default) | −19.1% | +1.1% | +35.1% | +52.1% | +97.4% | +34.0% | −25.3% | −8.1% | +33.8% | +1.2% |
| 2048 | −19.4% | +0.8% | +34.5% | +68.4% | +93.6% | +45.1% | −26.0% | −8.6% | +33.2% | +1.3% |
In prefill-decode disaggregated deployments (e.g., Splitwise, Nexus), prefill and decode run on separate GPU pools. UniBoost applies independently within each pool: the prefill scheduler uses the prompt-length signal $s_i^{\text{pre}}$ to boost new arrivals, and the decode scheduler uses attained decode tokens $w_i^{\text{dec}}(t)$ as the effective work signal. The two schedulers share the same $\gamma$ parameter, which can be estimated jointly from the end-to-end latency tail observed at the global dispatcher.
Cluster-level routers decide which instance a request is sent to; UniBoost decides how to schedule requests within that instance. When the routing policy is approximately load-oblivious, the per-instance scheduling guarantees carry over directly. The table below reports the relative improvement of UniBoost over the vLLM baseline under three routing strategies on 64 H100 GPUs.
| Global Scheduler | Avg E2E | P90 E2E | P99 E2E | Avg TBT | P90 TBT | P99 TBT |
|---|---|---|---|---|---|---|
| Random | +2.1% | −3.7% | −13.9% | −68.8% | −17.0% | −69.5% |
| Round-robin | +2.7% | −0.9% | −13.5% | −48.0% | +5.5% | −60.5% |
| JSQ (Join-Shortest-Queue) | −23.0% | −17.2% | −23.5% | −73.5% | −52.6% | −77.5% |
The gains remain consistent across scales from 32× to 512×, indicating that UniBoost’s per-instance benefits are largely independent of cluster size.
| Scale | Avg E2E | P99 E2E | Avg TTFT | P99 TTFT |
|---|---|---|---|---|
| 32× | −2.3% | −13.9% | +8.4% | −8.7% |
| 64× | −2.8% | −24.2% | +3.6% | −10.3% |
| 128× | −1.9% | −12.7% | +7.8% | −11.9% |
| 512× | −0.4% | −23.2% | +1.2% | −13.5% |
UniBoost targets single-replica, single-model scheduling. Several important directions remain open. First, integrating the cache-aware effective-work formula and measuring the combined effect with radix caching is a natural next step. Second, extending UniBoost to multi-model and multi-replica deployments requires new theoretical tools for how per-instance tail guarantees compose under load-aware routing. Third, the MemGuard bound of $O(\log S_i)$ preemptions per request assumes fixed KV-block sizes; incorporating variable-length KV blocks (as in speculative decoding) would require a more refined analysis. Finally, developing formal guarantees that capture preemption and eviction costs in the M/G/k model remains an important open problem.
@inproceedings{li2026beyondprediction,
title = {Beyond Prediction: Tail-Aware Scheduling for LLM Inference},
author = {Yueying Li and Yuanfan Chen and Jiayang Chen and Esha Choukse
and Haoran Qiu and G. Edward Suh and Rodrigo Fonseca
and Ziv Scully and Udit Gupta},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2026},
}