Skip to content

fix(inference): patch vLLM 0.22 O(B*L) sampler hot paths#2772

Open
joanvelja wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
joanvelja:fix/vllm-sampler-hotpath-clean
Open

fix(inference): patch vLLM 0.22 O(B*L) sampler hot paths#2772
joanvelja wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
joanvelja:fix/vllm-sampler-hotpath-clean

Conversation

@joanvelja

@joanvelja joanvelja commented Jun 11, 2026

Copy link
Copy Markdown

Problem

Qwen3.5 family models have weird hparams for generation, conditional on usage (model card under Best Practices to see what I am talking about).

Long decode throughput was being dominated by two Python-side costs that grow with generated length, not model compute. On the target long-generation workload, the combination caused roughly a 5x decode slowdown; removing it recovers about 3x end-to-end decode throughput.

The main offender is vLLM 0.22 sampler penalties. With penalties enabled, every decode step rebuilds a padded [batch, output_len] tensor from Python lists. At B=128 and ~12k generated tokens, this alone was about 80 ms/step.

The second offender is thinking-token-budget accounting (minor, is useful in my usecase but less important than default params). Until the end marker is found, vLLM rescans the full generated output every step. That is also O(output_len) Python work inside the per-token loop.

Fixes

I haven't yet submitted patches to vLLM (past PRs of mine are still on hold, and who knows how long they'll take).
The patch I propose keeps the same semantics, but moves the hot path off per-token Python list reconstruction.

  1. Capture the live InputBatch once, then build the penalties tensor from InputBatch.token_ids_cpu.
out_lens = np.fromiter(map(len, output_token_ids), np.int64, n)
max_len = int(out_lens.max())

buf, buf_idx = staging.get(n, max_len)
dst = buf.numpy()
dst.fill(vocab_size)

for i in range(n):
    length = out_lens[i]
    if length:
        start = num_prompt[i]
        dst[i, :length] = token_ids_cpu[i, start : start + length]

dst[dst == -1] = vocab_size
tensor = buf.to(device, non_blocking=True)
  1. Make the fast path conditional on object identity with the live batch rows. If vLLM hands us combined/foreign rows, we do not guess; we fall back to upstream.
req_lists = input_batch.req_output_token_ids
if len(req_lists) < n:
    return None
for i in range(n):
    if output_token_ids[i] is not req_lists[i]:
        return None
  1. Keep token_ids_cpu authoritative under async scheduling by writing repaired sampled ids back into the numpy buffer when vLLM replaces trailing -1 placeholders.
req_output_token_ids[first_placeholder:] = new_ids
start = int(self.num_prompt_tokens[index]) + first_placeholder
self.token_ids_cpu[index, start : start + len(new_ids)] = new_ids
  1. Replace full thinking-output rescans with a watermark scan over newly generated tokens, while still rescanning from scratch if output shrinks after spec rejection or KV-load discard.
hi = len(out)
while hi > 0 and out[hi - 1] == -1:
    hi -= 1
pos = state.get("_prime_scan_pos", 0)

lo = max(0, pos - (m - 1))
idx = find_last_in_window(out, self.think_end_token_ids, lo, hi)
state["end_thinking"] = idx if idx != -1 else -2
state["_prime_scan_pos"] = hi

The -2 sentinel means "scanned, not found". The original code only triggers the expensive full rescan on == -1; all downstream comparisons treat -2 like "not found" because they check > -1 / >= 0.

  1. Apply the patch through the vLLM general plugin so engine and worker processes all see the same behavior.
from prime_rl.inference.vllm.sampler_perf import apply_sampler_perf_patches

apply_sampler_perf_patches()

Safety checks

  • Pinned to vLLM 0.22.0; other versions raise until the vendored paths are revalidated.
  • Fast penalties path falls back to upstream for unrecognized row objects.
  • Residual -1 placeholders are mapped to the pad bin, matching upstream masked_fill_(output_tokens_t == -1, vocab_size) behavior.
  • CPU-only staging avoids CUDA events; GPU staging uses double-buffered pinned memory and synchronizes before CPU overwrite.
  • Randomized equivalence tests compare the fast tensor builder and thinking-budget scan against upstream behavior.

Validation

  • ruff check and ruff format via commit/push hooks.
  • uv run --no-sync python -m py_compile src/prime_rl/inference/vllm/sampler_perf.py tests/unit/inference/test_sampler_perf.py
  • Targeted pytest could not be run in this local macOS checkout because the lockfile supports Linux platforms only.

Note

Cursor Bugbot is generating a summary for commit a296f0e. Configure here.

Two pure-Python per-step costs collapse decode ~5x on long generations with
Qwen3.5 default sampling (presence_penalty=1.5) + thinking_token_budget:
penalties rebuild a padded [B, out_len] tensor from lists every step, and the
thinking-budget holder rescans the whole output for </think> every step.

Replace Sampler.apply_penalties with a vectorized numpy slice of
InputBatch.token_ids_cpu (pinned double-buffered staging, identity-checked
fallback to upstream for unrecognized rows), add async write-back so
token_ids_cpu stays authoritative, and wrap _update_think_state with an
incremental watermark scan (-2 sentinel skips upstream's full rescan).
Equivalence-tested against upstream on randomized streams.
@joanvelja joanvelja marked this pull request as ready for review June 11, 2026 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant