Skip to content

perf: add llm decode metadata update fast path.#1294

Merged
RobbieLeung merged 1 commit intojd-opensource:mainfrom
RobbieLeung:feat/fast_graph
Apr 28, 2026
Merged

perf: add llm decode metadata update fast path.#1294
RobbieLeung merged 1 commit intojd-opensource:mainfrom
RobbieLeung:feat/fast_graph

Conversation

@RobbieLeung
Copy link
Copy Markdown
Collaborator

  • add a decode-only fused metadata update kernel for ordinary LLM CUDA graph execution
  • reuse persistent kv seq len delta buffers and keep block_tables on the legacy copy path
  • add decode fast-path coverage and fallback equivalence tests

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR implements a CUDA-based fast path for updating LLM decode metadata, replacing standard copy operations to optimize persistent buffer management in the CUDA graph executor. Feedback primarily addresses style guide compliance, including renaming functions to snake_case, replacing magic numbers with named constants, annotating constant arguments, and removing an unused include.

Comment thread xllm/core/kernels/cuda/llm_decode_metadata_update.cu Outdated
Comment thread xllm/core/kernels/cuda/llm_decode_metadata_update.cu Outdated
Comment thread xllm/core/kernels/cuda/llm_decode_metadata_update.cu Outdated
Comment thread xllm/core/kernels/cuda/llm_decode_metadata_update.cu
Comment thread xllm/core/kernels/cuda/llm_decode_metadata_update.h Outdated
Comment thread xllm/core/runtime/cuda_graph_executor_impl.cpp Outdated
Comment thread xllm/core/runtime/cuda_graph_executor_test.cpp Outdated
Comment thread xllm/core/runtime/cuda_graph_executor_test.cpp Outdated
Comment thread xllm/core/runtime/cuda_graph_executor_test.cpp Outdated
@zhang-minchao
Copy link
Copy Markdown
Collaborator

加一个启用llm_decode_metadata_update.cu前后的timeline对比吧

Comment thread xllm/core/runtime/cuda_graph_executor_impl.cpp Outdated
- add a decode-only fused metadata update kernel for ordinary LLM CUDA graph execution
- reuse persistent kv seq len delta buffers and keep block_tables on the legacy copy path
- add decode fast-path coverage and fallback equivalence tests
@RobbieLeung
Copy link
Copy Markdown
Collaborator Author

RobbieLeung commented Apr 16, 2026

加一个启用llm_decode_metadata_update.cu前后的timeline对比吧

让AI写了个benchmark,加速了2-2.6x左右

@zhang-minchao
Copy link
Copy Markdown
Collaborator

加一个启用llm_decode_metadata_update.cu前后的timeline对比吧

让AI写了个beachmark,加速了2-2.6x左右

我的意思可以贴上GPU profiling timeline 的图示,展示效果更清晰

@RobbieLeung RobbieLeung merged commit be14834 into jd-opensource:main Apr 28, 2026
31 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants