kep75 role deploy coordination by bcfre · Pull Request #119 · sgl-project/rbg

bcfre · 2025-12-03T02:28:48Z

Ⅰ. Motivation

Enables users to define cross-role deploy steps, ensuring that Pods across different roles maintain the desired ratio even when cluster resources are constrained. It also provides fine-grained scheduling control at the role level.

Ⅱ. Modifications

kep proposal

Ⅲ. Does this pull request fix one issue?

fixes #112

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

Format your code make fmt.
Add unit tests or integration tests.
Update the documentation related to the change.

gemini-code-assist · 2025-12-03T02:28:51Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

codecov-commenter · 2025-12-03T02:35:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Signed-off-by: 柏存 <guoxiongfeng.gxf@alibaba-inc.com>

cheyang

Thanks for the proposal. The overall direction of topology-aware coordinated scheduling for PD deployment makes sense, and the user stories are well motivated. However, the KEP needs several rounds of revision before it's ready for approval. Here's what I found:

Critical: Incomplete draft text

Motivation point 3 reads:

在特定场景下，PD扽里

This is clearly an unfinished Chinese sentence left in the document. Please complete or remove it.

API Design: Unexported fields won't serialize

Several struct fields use lowercase names, which means they won't be marshaled/unmarshaled in Go:

ClusterTopologySpec.topologyScope
ClusterTopologyStatus.validate
DeployTopologyConstraint.clusterTopologyName
DeployTopologyConstraint.constraintMode

These need to be exported (capitalized) with proper json: tags to work with the Kubernetes API machinery.

API Design: Missing JSON tags

In CoordinationRollingDeploy, both RoleDeployStep and DeployTopologyConstraint lack json:"..." struct tags. All serializable fields in CRD types need explicit JSON tags.

YAML example doesn't match Go struct

The YAML example has:

rollingDeploy:
  prefill: 4
  decode: 2
deployTopologyConstraint:
  ...

But the Go struct defines RoleDeployStep map[string]int as a field inside CoordinationRollingDeploy, and DeployTopologyConstraint is also nested inside it. The YAML should reflect the actual struct hierarchy. Also, podGroupPolicy: nil is not valid YAML for a Kubernetes manifest — just omit it.

Additionally, the pd-rollout-deploy coordination entry is missing the roles field that the Coordination struct requires.

Typo in CRD

CurrentToplogyLevel should be CurrentTopologyLevel (missing 'o').

Missing standard KEP sections

Compared to KEP-30 and the template, this KEP is missing:

Table of Contents
Release Signoff Checklist
Test Plan
Graduation Criteria
Implementation History
Drawbacks
Alternatives Considered

At minimum for a provisional KEP, the Test Plan and Graduation Criteria sections should be outlined.

kep.yaml: Graduation milestones

Alpha, beta, and stable are all set to v0.6.0. This doesn't follow the expected graduation pattern — typically a feature graduates across multiple releases.

Design gaps to address

What happens if a batch fails to schedule? Is there a timeout? How does the controller detect deadlock?
How does this interact with cluster autoscaler — if nodes are scaling up, does the controller wait?
What's the concrete mechanism for determining "previous batch is successfully scheduled"? All pods Running? All pods Bound?
The section numbering jumps to "6. Progressive Scheduling Deployment" without prior numbered sections — seems like content was reorganized without cleanup.

Summary: The concept is solid and addresses a real gap. The document needs cleanup (remove incomplete text, fix API definitions, align YAML examples with structs) and the addition of standard KEP sections before this can move forward.

cheyang · 2026-06-12T10:50:21Z

KEP-75 Design Review: Enhanced Topology-Based Multi-Role Coordinated Scheduling

Thanks for proposing this KEP. The motivation around topology-aware PD placement is clear and well-articulated. However, there are several issues that should be addressed before this design can move forward.

Critical Issues

Incomplete content -- Line 14 has a truncated Chinese sentence in the Motivation section that needs to be completed or removed.
Unexported struct fields -- Several Go struct fields (topologyScope, validate, clusterTopologyName, constraintMode) are unexported and will be silently dropped from the CRD schema. They need to be capitalized and given JSON tags.
YAML/Go mismatch -- The example YAML does not reflect the actual Go struct nesting. Also missing the required roles field.

Design Concerns

RollingDeploy vs CoordinationScaling -- The relationship between the new RollingDeploy strategy and the existing CoordinationScaling strategy needs clarification.
ClusterTopology CRD -- Missing details on resource scope, RBAC, and relationship to existing Kubernetes topology mechanisms.
Error handling -- No timeout or rollback mechanism specified for when a PodSchedulingGate batch gets stuck.

Minor

Typo: CurrentToplogyLevel should be CurrentTopologyLevel
Section numbering jumps to 6 without 1-5
Missing Test Plan and Alternatives sections
All graduation milestones target v0.6.0 (unusual)

Overall this is a well-motivated KEP with a solid core idea. Addressing the API correctness items and the design gaps would significantly strengthen the proposal.

cheyang · 2026-06-12T10:50:23Z

+
+2. **Network Topology Optimization**: In PD-separated deployment architectures, placing P and D instances that process the same request within close network topology domains(NVLink > RDMA > TCP) can reduce KV transmission latency and increase throughput. Placements across different network switches can reduce the available bandwidth for KV cache transfer by approximately 20% [[1](https://arxiv.org/pdf/2508.19559)].
+
+3. 在特定场景下，PD扽里


There is a truncated Chinese sentence: 3. 在特定场景下，PD扽里. This appears to be an incomplete thought. Please either complete this motivation item or remove it.

cheyang · 2026-06-12T10:50:25Z

+	// If topologyScope is empty, index maps from smaller to larger network domains
+	Layers []TopologyLayer `json:"Layers"`
+	// Larger numbers indicate larger managed network domains with worse communication performance
+	topologyScope map[TopologyLayerName]int


Several fields use lowercase (unexported) identifiers which are invisible to encoding/json and the CRD code generator:

ClusterTopologySpec.topologyScope -> should be TopologyScope

ClusterTopologyStatus.validate -> should be Validate

DeployTopologyConstraint.clusterTopologyName -> should be ClusterTopologyName

DeployTopologyConstraint.constraintMode -> should be ConstraintMode

All of these also need json:"..." struct tags to be included in the CRD schema.

cheyang · 2026-06-12T10:50:26Z

+spec:
+  coordination:
+    - name: pd-rollout-deploy
+      strategy:


The YAML example places prefill: 4 and decode: 2 directly under rollingDeploy:, and deployTopologyConstraint as a sibling. But in the Go struct CoordinationRollingDeploy, these are RoleDeployStep map[string]int and DeployTopologyConstraint *DeployTopologyConstraint (nested fields). The YAML should nest values under roleDeployStep:. Also the roles field is missing from the pd-rollout-deploy coordination entry, but it is a required field in the Coordination struct.

cheyang · 2026-06-12T10:50:29Z

+
+	// new field
+	// RollingDeploy defines the coordination strategies about rolling deploy.
+	RollingDeploy *CoordinationRollingDeploy `json:"rollingDeploy,omitempty"`


The codebase already has CoordinationScaling in CoordinationStrategy (with MaxSkew and Progression). This KEP adds CoordinationRollingDeploy as another field. Please clarify:

How does RollingDeploy differ from Scaling? Both handle progressive deployment.

Can they coexist on the same coordination entry?

The Risks section says rolling deploy and rolling update are mutually exclusive -- does the same apply to rolling deploy vs scaling?

cheyang · 2026-06-12T10:50:30Z

+
+### Cluster Topology Definition
+
+- **New CRD**: Introduce a Custom Resource Definition (CRD) to describe the cluster topology hierarchy.


The new ClusterTopology CRD is introduced but needs more design context:

Is it cluster-scoped or namespaced?

What RBAC permissions does it require?

How does it relate to existing Kubernetes topology labels (topology.kubernetes.io/*)?

Should it be in the same API group as RBG or a separate one?

bcfre force-pushed the dev branch from 13db1f1 to 2f42cdd Compare December 10, 2025 02:54

kep75 topology based role coordinated scheduler

2f3198a

Signed-off-by: 柏存 <guoxiongfeng.gxf@alibaba-inc.com>

bcfre force-pushed the dev branch from 2f42cdd to 2f3198a Compare December 30, 2025 03:15

cheyang reviewed Jun 11, 2026

View reviewed changes

cheyang reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kep75 role deploy coordination#119

kep75 role deploy coordination#119
bcfre wants to merge 1 commit into
sgl-project:mainfrom
bcfre:dev

bcfre commented Dec 3, 2025

Uh oh!

gemini-code-assist Bot commented Dec 3, 2025

Uh oh!

codecov-commenter commented Dec 3, 2025

Uh oh!

cheyang left a comment

Uh oh!

cheyang commented Jun 12, 2026

Uh oh!

cheyang Jun 12, 2026

Uh oh!

cheyang Jun 12, 2026

Uh oh!

cheyang Jun 12, 2026

Uh oh!

cheyang Jun 12, 2026

Uh oh!

cheyang Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		2. Network Topology Optimization: In PD-separated deployment architectures, placing P and D instances that process the same request within close network topology domains(NVLink > RDMA > TCP) can reduce KV transmission latency and increase throughput. Placements across different network switches can reduce the available bandwidth for KV cache transfer by approximately 20% [[1](https://arxiv.org/pdf/2508.19559)].

		3. 在特定场景下，PD扽里


		### Cluster Topology Definition

		- New CRD: Introduce a Custom Resource Definition (CRD) to describe the cluster topology hierarchy.

Conversation

bcfre commented Dec 3, 2025

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 3, 2025

Uh oh!

codecov-commenter commented Dec 3, 2025

Codecov Report

Uh oh!

cheyang left a comment

Choose a reason for hiding this comment

Uh oh!

cheyang commented Jun 12, 2026

KEP-75 Design Review: Enhanced Topology-Based Multi-Role Coordinated Scheduling

Critical Issues

Design Concerns

Minor

Uh oh!

cheyang Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

cheyang Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

cheyang Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

cheyang Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

cheyang Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants