Skip to content

kep75 role deploy coordination#119

Open
bcfre wants to merge 1 commit into
sgl-project:mainfrom
bcfre:dev
Open

kep75 role deploy coordination#119
bcfre wants to merge 1 commit into
sgl-project:mainfrom
bcfre:dev

Conversation

@bcfre

@bcfre bcfre commented Dec 3, 2025

Copy link
Copy Markdown
Collaborator

Ⅰ. Motivation

Enables users to define cross-role deploy steps, ensuring that Pods across different roles maintain the desired ratio even when cluster resources are constrained. It also provides fine-grained scheduling control at the role level.

Ⅱ. Modifications

kep proposal

Ⅲ. Does this pull request fix one issue?

fixes #112

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Signed-off-by: 柏存 <guoxiongfeng.gxf@alibaba-inc.com>

@cheyang cheyang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal. The overall direction of topology-aware coordinated scheduling for PD deployment makes sense, and the user stories are well motivated. However, the KEP needs several rounds of revision before it's ready for approval. Here's what I found:


Critical: Incomplete draft text

Motivation point 3 reads:

在特定场景下,PD扽里

This is clearly an unfinished Chinese sentence left in the document. Please complete or remove it.


API Design: Unexported fields won't serialize

Several struct fields use lowercase names, which means they won't be marshaled/unmarshaled in Go:

  • ClusterTopologySpec.topologyScope
  • ClusterTopologyStatus.validate
  • DeployTopologyConstraint.clusterTopologyName
  • DeployTopologyConstraint.constraintMode

These need to be exported (capitalized) with proper json: tags to work with the Kubernetes API machinery.


API Design: Missing JSON tags

In CoordinationRollingDeploy, both RoleDeployStep and DeployTopologyConstraint lack json:"..." struct tags. All serializable fields in CRD types need explicit JSON tags.


YAML example doesn't match Go struct

The YAML example has:

rollingDeploy:
  prefill: 4
  decode: 2
deployTopologyConstraint:
  ...

But the Go struct defines RoleDeployStep map[string]int as a field inside CoordinationRollingDeploy, and DeployTopologyConstraint is also nested inside it. The YAML should reflect the actual struct hierarchy. Also, podGroupPolicy: nil is not valid YAML for a Kubernetes manifest — just omit it.

Additionally, the pd-rollout-deploy coordination entry is missing the roles field that the Coordination struct requires.


Typo in CRD

CurrentToplogyLevel should be CurrentTopologyLevel (missing 'o').


Missing standard KEP sections

Compared to KEP-30 and the template, this KEP is missing:

  • Table of Contents
  • Release Signoff Checklist
  • Test Plan
  • Graduation Criteria
  • Implementation History
  • Drawbacks
  • Alternatives Considered

At minimum for a provisional KEP, the Test Plan and Graduation Criteria sections should be outlined.


kep.yaml: Graduation milestones

Alpha, beta, and stable are all set to v0.6.0. This doesn't follow the expected graduation pattern — typically a feature graduates across multiple releases.


Design gaps to address

  1. What happens if a batch fails to schedule? Is there a timeout? How does the controller detect deadlock?
  2. How does this interact with cluster autoscaler — if nodes are scaling up, does the controller wait?
  3. What's the concrete mechanism for determining "previous batch is successfully scheduled"? All pods Running? All pods Bound?
  4. The section numbering jumps to "6. Progressive Scheduling Deployment" without prior numbered sections — seems like content was reorganized without cleanup.

Summary: The concept is solid and addresses a real gap. The document needs cleanup (remove incomplete text, fix API definitions, align YAML examples with structs) and the addition of standard KEP sections before this can move forward.

@cheyang

cheyang commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

KEP-75 Design Review: Enhanced Topology-Based Multi-Role Coordinated Scheduling

Thanks for proposing this KEP. The motivation around topology-aware PD placement is clear and well-articulated. However, there are several issues that should be addressed before this design can move forward.

Critical Issues

  1. Incomplete content -- Line 14 has a truncated Chinese sentence in the Motivation section that needs to be completed or removed.
  2. Unexported struct fields -- Several Go struct fields (topologyScope, validate, clusterTopologyName, constraintMode) are unexported and will be silently dropped from the CRD schema. They need to be capitalized and given JSON tags.
  3. YAML/Go mismatch -- The example YAML does not reflect the actual Go struct nesting. Also missing the required roles field.

Design Concerns

  1. RollingDeploy vs CoordinationScaling -- The relationship between the new RollingDeploy strategy and the existing CoordinationScaling strategy needs clarification.
  2. ClusterTopology CRD -- Missing details on resource scope, RBAC, and relationship to existing Kubernetes topology mechanisms.
  3. Error handling -- No timeout or rollback mechanism specified for when a PodSchedulingGate batch gets stuck.

Minor

  • Typo: CurrentToplogyLevel should be CurrentTopologyLevel
  • Section numbering jumps to 6 without 1-5
  • Missing Test Plan and Alternatives sections
  • All graduation milestones target v0.6.0 (unusual)

Overall this is a well-motivated KEP with a solid core idea. Addressing the API correctness items and the design gaps would significantly strengthen the proposal.


2. **Network Topology Optimization**: In PD-separated deployment architectures, placing P and D instances that process the same request within close network topology domains(NVLink > RDMA > TCP) can reduce KV transmission latency and increase throughput. Placements across different network switches can reduce the available bandwidth for KV cache transfer by approximately 20% [[1](https://arxiv.org/pdf/2508.19559)].

3. 在特定场景下,PD扽里

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a truncated Chinese sentence: 3. 在特定场景下,PD扽里. This appears to be an incomplete thought. Please either complete this motivation item or remove it.

// If topologyScope is empty, index maps from smaller to larger network domains
Layers []TopologyLayer `json:"Layers"`
// Larger numbers indicate larger managed network domains with worse communication performance
topologyScope map[TopologyLayerName]int

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several fields use lowercase (unexported) identifiers which are invisible to encoding/json and the CRD code generator:

  • ClusterTopologySpec.topologyScope -> should be TopologyScope
  • ClusterTopologyStatus.validate -> should be Validate
  • DeployTopologyConstraint.clusterTopologyName -> should be ClusterTopologyName
  • DeployTopologyConstraint.constraintMode -> should be ConstraintMode

All of these also need json:"..." struct tags to be included in the CRD schema.

spec:
coordination:
- name: pd-rollout-deploy
strategy:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML example places prefill: 4 and decode: 2 directly under rollingDeploy:, and deployTopologyConstraint as a sibling. But in the Go struct CoordinationRollingDeploy, these are RoleDeployStep map[string]int and DeployTopologyConstraint *DeployTopologyConstraint (nested fields). The YAML should nest values under roleDeployStep:. Also the roles field is missing from the pd-rollout-deploy coordination entry, but it is a required field in the Coordination struct.


// new field
// RollingDeploy defines the coordination strategies about rolling deploy.
RollingDeploy *CoordinationRollingDeploy `json:"rollingDeploy,omitempty"`

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codebase already has CoordinationScaling in CoordinationStrategy (with MaxSkew and Progression). This KEP adds CoordinationRollingDeploy as another field. Please clarify:

  • How does RollingDeploy differ from Scaling? Both handle progressive deployment.
  • Can they coexist on the same coordination entry?
  • The Risks section says rolling deploy and rolling update are mutually exclusive -- does the same apply to rolling deploy vs scaling?


### Cluster Topology Definition

- **New CRD**: Introduce a Custom Resource Definition (CRD) to describe the cluster topology hierarchy.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ClusterTopology CRD is introduced but needs more design context:

  • Is it cluster-scoped or namespaced?
  • What RBAC permissions does it require?
  • How does it relate to existing Kubernetes topology labels (topology.kubernetes.io/*)?
  • Should it be in the same API group as RBG or a separate one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Development Roadmap (v0.6.0)

3 participants