fix(inference): patch vLLM 0.22 O(B*L) sampler hot paths#2772
Open
joanvelja wants to merge 1 commit into
Open
Conversation
Two pure-Python per-step costs collapse decode ~5x on long generations with Qwen3.5 default sampling (presence_penalty=1.5) + thinking_token_budget: penalties rebuild a padded [B, out_len] tensor from lists every step, and the thinking-budget holder rescans the whole output for </think> every step. Replace Sampler.apply_penalties with a vectorized numpy slice of InputBatch.token_ids_cpu (pinned double-buffered staging, identity-checked fallback to upstream for unrecognized rows), add async write-back so token_ids_cpu stays authoritative, and wrap _update_think_state with an incremental watermark scan (-2 sentinel skips upstream's full rescan). Equivalence-tested against upstream on randomized streams.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Qwen3.5 family models have weird hparams for generation, conditional on usage (model card under Best Practices to see what I am talking about).
Long decode throughput was being dominated by two Python-side costs that grow with generated length, not model compute. On the target long-generation workload, the combination caused roughly a 5x decode slowdown; removing it recovers about 3x end-to-end decode throughput.
The main offender is vLLM 0.22 sampler penalties. With penalties enabled, every decode step rebuilds a padded
[batch, output_len]tensor from Python lists. At B=128 and ~12k generated tokens, this alone was about 80 ms/step.The second offender is thinking-token-budget accounting (minor, is useful in my usecase but less important than default params). Until the end marker is found, vLLM rescans the full generated output every step. That is also O(output_len) Python work inside the per-token loop.
Fixes
I haven't yet submitted patches to vLLM (past PRs of mine are still on hold, and who knows how long they'll take).
The patch I propose keeps the same semantics, but moves the hot path off per-token Python list reconstruction.
InputBatchonce, then build the penalties tensor fromInputBatch.token_ids_cpu.token_ids_cpuauthoritative under async scheduling by writing repaired sampled ids back into the numpy buffer when vLLM replaces trailing-1placeholders.The
-2sentinel means "scanned, not found". The original code only triggers the expensive full rescan on== -1; all downstream comparisons treat-2like "not found" because they check> -1/>= 0.Safety checks
0.22.0; other versions raise until the vendored paths are revalidated.-1placeholders are mapped to the pad bin, matching upstreammasked_fill_(output_tokens_t == -1, vocab_size)behavior.Validation
ruff checkandruff formatvia commit/push hooks.uv run --no-sync python -m py_compile src/prime_rl/inference/vllm/sampler_perf.py tests/unit/inference/test_sampler_perf.pyNote
Cursor Bugbot is generating a summary for commit a296f0e. Configure here.