kep: add engine runtime API support#94
Conversation
Signed-off-by: TzZtzt <trafalgarz@outlook.com>
Signed-off-by: TzZtzt <trafalgarz@outlook.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| @@ -0,0 +1,51 @@ | |||
| title: KEP Template | |||
There was a problem hiding this comment.
Please update this file to keep in track.
cheyang
left a comment
There was a problem hiding this comment.
Please update kep.yaml to keep in track.
cheyang
left a comment
There was a problem hiding this comment.
Thanks for putting this together, @TrafalgarZZZ. The motivation is solid — topology awareness for PD disaggregation and dynamic LoRA lifecycle are real operational pain points. A few things need addressing before this KEP can move forward:
Blocking issues:
-
KEP number mismatch: The document title says "KEP-92" but this is PR #94 and lives under
keps/94-engine-runtime/. Please fix the title. -
kep.yaml is still the raw template: All metadata fields (
authors,owning-sig,kep-number,creation-date,stage,milestone, etc.) are placeholders. This file should reflect the actual proposal state. -
Empty Test Plan: Unit Tests, Integration Tests, and E2E Tests sections are all blank. At minimum for a provisional KEP, outline what kinds of tests you expect (e.g., "controller unit tests for CRD reconciliation", "integration test for sidecar registration flow").
-
Empty Alternatives section: What other designs were considered? For instance, why a sidecar + CRD approach rather than a DaemonSet-based agent, or annotation-driven injection, or extending the existing RBG controller directly?
Design gaps to address:
-
Controller-to-sidecar communication protocol: The KEP shows the controller pointing at sidecars in the architecture diagram, but never specifies how they communicate. HTTP? gRPC? What port? How is it discovered?
-
Failure and edge cases: What happens when:
- The sidecar starts before the inference engine is ready (beyond the hardcoded 180s timeout)?
register()orunregister()fails (network partition, router pod is restarting)?- The sidecar crashes mid-operation — does the controller retry? Is there reconciliation?
-
updateStrategysemantics: You mentionupdateStrategy: NoUpdatebut don't define what strategies are available or what they mean. This needs at least a brief explanation. -
Sidecar lifecycle: How does the sidecar interact with Pod termination? The code shows signal handlers but the KEP doesn't describe the contract — e.g., is there a preStop hook? What ordering guarantees exist relative to the app container?
Minor nits:
- Missing space: "ClusterEngineRuntimeProfilewill" and "InferenceEngineclass"
- Mixed punctuation styles (Chinese
()vs ASCII()) — pick one for consistency - Files should end with a newline
This is a useful addition to the project but needs the above fleshed out before it's ready for implementable status. Happy to re-review once updated.
Ⅰ. Motivation
Ⅱ. Modifications
Ⅲ. Does this pull request fix one issue?
fixes #XXXX
Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅴ. Describe how to verify it
VI. Special notes for reviews
Checklist
make fmt.