Skip to content

kep: add engine runtime API support#94

Open
TrafalgarZZZ wants to merge 2 commits into
sgl-project:mainfrom
TrafalgarZZZ:doc/engine_runtime_kep
Open

kep: add engine runtime API support#94
TrafalgarZZZ wants to merge 2 commits into
sgl-project:mainfrom
TrafalgarZZZ:doc/engine_runtime_kep

Conversation

@TrafalgarZZZ

Copy link
Copy Markdown
Collaborator

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

fixes #XXXX

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

Signed-off-by: TzZtzt <trafalgarz@outlook.com>
Signed-off-by: TzZtzt <trafalgarz@outlook.com>
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@@ -0,0 +1,51 @@
title: KEP Template

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this file to keep in track.

@cheyang cheyang self-requested a review November 20, 2025 02:59

@cheyang cheyang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update kep.yaml to keep in track.

@cheyang cheyang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @TrafalgarZZZ. The motivation is solid — topology awareness for PD disaggregation and dynamic LoRA lifecycle are real operational pain points. A few things need addressing before this KEP can move forward:

Blocking issues:

  1. KEP number mismatch: The document title says "KEP-92" but this is PR #94 and lives under keps/94-engine-runtime/. Please fix the title.

  2. kep.yaml is still the raw template: All metadata fields (authors, owning-sig, kep-number, creation-date, stage, milestone, etc.) are placeholders. This file should reflect the actual proposal state.

  3. Empty Test Plan: Unit Tests, Integration Tests, and E2E Tests sections are all blank. At minimum for a provisional KEP, outline what kinds of tests you expect (e.g., "controller unit tests for CRD reconciliation", "integration test for sidecar registration flow").

  4. Empty Alternatives section: What other designs were considered? For instance, why a sidecar + CRD approach rather than a DaemonSet-based agent, or annotation-driven injection, or extending the existing RBG controller directly?

Design gaps to address:

  1. Controller-to-sidecar communication protocol: The KEP shows the controller pointing at sidecars in the architecture diagram, but never specifies how they communicate. HTTP? gRPC? What port? How is it discovered?

  2. Failure and edge cases: What happens when:

    • The sidecar starts before the inference engine is ready (beyond the hardcoded 180s timeout)?
    • register() or unregister() fails (network partition, router pod is restarting)?
    • The sidecar crashes mid-operation — does the controller retry? Is there reconciliation?
  3. updateStrategy semantics: You mention updateStrategy: NoUpdate but don't define what strategies are available or what they mean. This needs at least a brief explanation.

  4. Sidecar lifecycle: How does the sidecar interact with Pod termination? The code shows signal handlers but the KEP doesn't describe the contract — e.g., is there a preStop hook? What ordering guarantees exist relative to the app container?

Minor nits:

  • Missing space: "ClusterEngineRuntimeProfilewill" and "InferenceEngineclass"
  • Mixed punctuation styles (Chinese () vs ASCII ()) — pick one for consistency
  • Files should end with a newline

This is a useful addition to the project but needs the above fleshed out before it's ready for implementable status. Happy to re-review once updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants