-
Notifications
You must be signed in to change notification settings - Fork 895
docs: add SGLang HiCache L3 observability note #2603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
xzh25
wants to merge
1
commit into
kvcache-ai:main
Choose a base branch
from
xzh25:codex/sglang-hicache-l3-observability
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
71 changes: 71 additions & 0 deletions
71
docs/source/performance/sglang-hicache-l3-aistudio-observability.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| # SGLang HiCache + Mooncake L3 Observability on AI Studio A800 | ||
|
|
||
| This note records a small, reproducible SGLang HiCache + Mooncake Store L3 | ||
| experiment on a single AI Studio A800 runtime. It is intended as an | ||
| observability and reproduction note, not as a performance-win benchmark. | ||
|
|
||
| ## Scope | ||
|
|
||
| - Platform: Baidu AI Studio A800 runtime. | ||
| - Model: Qwen3-0.6B. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| - Runtime layout: single-node TCP-oriented setup. | ||
| - Workload: repeated-prefix requests with short output to emphasize TTFT. | ||
| - Goal: verify whether SGLang's Mooncake backend emits L3 write/read metrics and | ||
| whether Store reload beats a no-store baseline in this constrained setup. | ||
|
|
||
| ## Result Summary | ||
|
|
||
| | case | no-store p50 TTFT | store reload p50 TTFT | exists hit pages | get success pages | conclusion | | ||
| |---|---:|---:|---:|---:|---| | ||
| | p4096_c1_n32 | 44.815 ms | 45.955 ms | 3095 | 3095 | L3 read-back observed; reload was 2.544% slower | | ||
| | p8192_c1_n16 | 69.110 ms | 73.600 ms | 0 | 0 | no read-back hit; reload was 6.497% slower | | ||
|
|
||
| A previous gapfill run also showed successful Store write-back metrics: | ||
|
|
||
| ```text | ||
| set_requested_pages = 13268 | ||
| set_success_pages = 13268 | ||
| ``` | ||
|
|
||
| ## Interpretation | ||
|
|
||
| The `p4096_c1_n32` case demonstrates that backend-level L3 read-back can be | ||
| observed through `exists` and `get` counters. It does not demonstrate a latency | ||
| improvement. In this small-model, single-node setup, Store reload added enough | ||
| overhead that no-store remained faster. | ||
|
|
||
| This result is useful as a lower-bound diagnostic case: | ||
|
|
||
| - L3 write-back is observable. | ||
| - One L3 read-back case is observable. | ||
| - Positive TTFT or throughput gains require a stronger workload or topology, | ||
| such as larger models, longer shared prefixes, higher concurrency, multi-turn | ||
| histories, RAG-style repeated documents, or an independent storage process | ||
| that remains alive across SGLang restarts. | ||
|
|
||
| ## Follow-Up Work | ||
|
|
||
| 1. Rebuild or install a same-runtime `mooncake_client` for AI Studio Ubuntu | ||
| 20.04 to validate an independent persistent Store topology. | ||
| 2. Extend the matrix to larger models, longer prefixes, multi-turn workloads, | ||
| RAG workloads, and higher concurrency. | ||
| 3. Add first-class counters for L3 exists/get/set pages and success ratios in | ||
| SGLang/Mooncake integration logs so negative and positive results can be | ||
| interpreted without screenshots or ad hoc parsing. | ||
| 4. Validate prefix-aware eviction/orphan suffix cleanup under real Store | ||
| pressure before presenting it as an optimization. | ||
|
|
||
| ## Claim Boundary | ||
|
|
||
| Safe claim: | ||
|
|
||
| ```text | ||
| SGLang HiCache + Mooncake Store can produce observable L3 write-back and one | ||
| controlled read-back case on AI Studio A800. | ||
| ``` | ||
|
|
||
| Unsafe claim: | ||
|
|
||
| ```text | ||
| Mooncake Store improves TTFT or throughput in this setup. | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,11 +6,13 @@ Benchmarks evaluating Mooncake's integration with SGLang across PD disaggregatio | |
| |----------|----------|---------------| | ||
| | [PD Disaggregation Performance](../sglang-benchmark-results-v1) | SGLang PD disaggregation with Mooncake Transfer Engine | 1P1D PD disaggregation achieves approximately **30% lower ITL** while maintaining comparable throughput against two regular instances. | | ||
| | [HiCache with Mooncake Backend Benchmark](../sglang-hicache-benchmark-results-v1) | SGLang HiCache using Mooncake Store as L3 storage | Mooncake-backed HiCache improves prefill performance in multi-turn workloads by maintaining higher KV cache hit rates as conversation rounds grow. | | ||
| | [AI Studio A800 L3 Observability](../sglang-hicache-l3-aistudio-observability) | SGLang HiCache with Mooncake Store on a single AI Studio A800 runtime | L3 write-back and one read-back case were observed, but Store reload did not beat no-store in this constrained setup. | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be |
||
|
|
||
| :::{toctree} | ||
| :maxdepth: 1 | ||
| :hidden: | ||
|
|
||
| ../sglang-benchmark-results-v1 | ||
| ../sglang-hicache-benchmark-results-v1 | ||
| ../sglang-hicache-l3-aistudio-observability | ||
| ::: | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make the example more broadly applicable for wider use?