fix(inference): patch vLLM 0.22 O(B*L) sampler hot paths by joanvelja · Pull Request #2772 · PrimeIntellect-ai/prime-rl

joanvelja · 2026-06-11T08:14:25Z

Problem

Qwen3.5 family models have weird hparams for generation, conditional on usage (model card under Best Practices to see what I am talking about).

Long decode throughput was being dominated by two Python-side costs that grow with generated length, not model compute. On the target long-generation workload, the combination caused roughly a 5x decode slowdown; removing it recovers about 3x end-to-end decode throughput.

The main offender is vLLM 0.22 sampler penalties. With penalties enabled, every decode step rebuilds a padded [batch, output_len] tensor from Python lists. At B=128 and ~12k generated tokens, this alone was about 80 ms/step.

The second offender is thinking-token-budget accounting (minor, is useful in my usecase but less important than default params). Until the end marker is found, vLLM rescans the full generated output every step. That is also O(output_len) Python work inside the per-token loop.

Fixes

I haven't yet submitted patches to vLLM (past PRs of mine are still on hold, and who knows how long they'll take).
The patch I propose keeps the same semantics, but moves the hot path off per-token Python list reconstruction.

Capture the live InputBatch once, then build the penalties tensor from InputBatch.token_ids_cpu.

out_lens = np.fromiter(map(len, output_token_ids), np.int64, n)
max_len = int(out_lens.max())

buf, buf_idx = staging.get(n, max_len)
dst = buf.numpy()
dst.fill(vocab_size)

for i in range(n):
    length = out_lens[i]
    if length:
        start = num_prompt[i]
        dst[i, :length] = token_ids_cpu[i, start : start + length]

dst[dst == -1] = vocab_size
tensor = buf.to(device, non_blocking=True)

Make the fast path conditional on object identity with the live batch rows. If vLLM hands us combined/foreign rows, we do not guess; we fall back to upstream.

req_lists = input_batch.req_output_token_ids
if len(req_lists) < n:
    return None
for i in range(n):
    if output_token_ids[i] is not req_lists[i]:
        return None

Keep token_ids_cpu authoritative under async scheduling by writing repaired sampled ids back into the numpy buffer when vLLM replaces trailing -1 placeholders.

req_output_token_ids[first_placeholder:] = new_ids
start = int(self.num_prompt_tokens[index]) + first_placeholder
self.token_ids_cpu[index, start : start + len(new_ids)] = new_ids

Replace full thinking-output rescans with a watermark scan over newly generated tokens, while still rescanning from scratch if output shrinks after spec rejection or KV-load discard.

hi = len(out)
while hi > 0 and out[hi - 1] == -1:
    hi -= 1
pos = state.get("_prime_scan_pos", 0)

lo = max(0, pos - (m - 1))
idx = find_last_in_window(out, self.think_end_token_ids, lo, hi)
state["end_thinking"] = idx if idx != -1 else -2
state["_prime_scan_pos"] = hi

The -2 sentinel means "scanned, not found". The original code only triggers the expensive full rescan on == -1; all downstream comparisons treat -2 like "not found" because they check > -1 / >= 0.

Apply the patch through the vLLM general plugin so engine and worker processes all see the same behavior.

from prime_rl.inference.vllm.sampler_perf import apply_sampler_perf_patches

apply_sampler_perf_patches()

Safety checks

Pinned to vLLM 0.22.0; other versions raise until the vendored paths are revalidated.
Fast penalties path falls back to upstream for unrecognized row objects.
Residual -1 placeholders are mapped to the pad bin, matching upstream masked_fill_(output_tokens_t == -1, vocab_size) behavior.
CPU-only staging avoids CUDA events; GPU staging uses double-buffered pinned memory and synchronizes before CPU overwrite.
Randomized equivalence tests compare the fast tensor builder and thinking-budget scan against upstream behavior.

Validation

ruff check and ruff format via commit/push hooks.
uv run --no-sync python -m py_compile src/prime_rl/inference/vllm/sampler_perf.py tests/unit/inference/test_sampler_perf.py
Targeted pytest could not be run in this local macOS checkout because the lockfile supports Linux platforms only.

Note

^{Cursor Bugbot is generating a summary for commit a296f0e. Configure here.}

Two pure-Python per-step costs collapse decode ~5x on long generations with Qwen3.5 default sampling (presence_penalty=1.5) + thinking_token_budget: penalties rebuild a padded [B, out_len] tensor from lists every step, and the thinking-budget holder rescans the whole output for </think> every step. Replace Sampler.apply_penalties with a vectorized numpy slice of InputBatch.token_ids_cpu (pinned double-buffered staging, identity-checked fallback to upstream for unrecognized rows), add async write-back so token_ids_cpu stays authoritative, and wrap _update_think_state with an incremental watermark scan (-2 sentinel skips upstream's full rescan). Equivalence-tested against upstream on randomized streams.

joanvelja marked this pull request as ready for review June 11, 2026 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): patch vLLM 0.22 O(B*L) sampler hot paths#2772

fix(inference): patch vLLM 0.22 O(B*L) sampler hot paths#2772
joanvelja wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
joanvelja:fix/vllm-sampler-hotpath-clean

joanvelja commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joanvelja commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fixes

Safety checks

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joanvelja commented Jun 11, 2026 •

edited

Loading