kep75 role deploy coordination#119
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Signed-off-by: 柏存 <guoxiongfeng.gxf@alibaba-inc.com>
cheyang
left a comment
There was a problem hiding this comment.
Thanks for the proposal. The overall direction of topology-aware coordinated scheduling for PD deployment makes sense, and the user stories are well motivated. However, the KEP needs several rounds of revision before it's ready for approval. Here's what I found:
Critical: Incomplete draft text
Motivation point 3 reads:
在特定场景下,PD扽里
This is clearly an unfinished Chinese sentence left in the document. Please complete or remove it.
API Design: Unexported fields won't serialize
Several struct fields use lowercase names, which means they won't be marshaled/unmarshaled in Go:
ClusterTopologySpec.topologyScopeClusterTopologyStatus.validateDeployTopologyConstraint.clusterTopologyNameDeployTopologyConstraint.constraintMode
These need to be exported (capitalized) with proper json: tags to work with the Kubernetes API machinery.
API Design: Missing JSON tags
In CoordinationRollingDeploy, both RoleDeployStep and DeployTopologyConstraint lack json:"..." struct tags. All serializable fields in CRD types need explicit JSON tags.
YAML example doesn't match Go struct
The YAML example has:
rollingDeploy:
prefill: 4
decode: 2
deployTopologyConstraint:
...But the Go struct defines RoleDeployStep map[string]int as a field inside CoordinationRollingDeploy, and DeployTopologyConstraint is also nested inside it. The YAML should reflect the actual struct hierarchy. Also, podGroupPolicy: nil is not valid YAML for a Kubernetes manifest — just omit it.
Additionally, the pd-rollout-deploy coordination entry is missing the roles field that the Coordination struct requires.
Typo in CRD
CurrentToplogyLevel should be CurrentTopologyLevel (missing 'o').
Missing standard KEP sections
Compared to KEP-30 and the template, this KEP is missing:
- Table of Contents
- Release Signoff Checklist
- Test Plan
- Graduation Criteria
- Implementation History
- Drawbacks
- Alternatives Considered
At minimum for a provisional KEP, the Test Plan and Graduation Criteria sections should be outlined.
kep.yaml: Graduation milestones
Alpha, beta, and stable are all set to v0.6.0. This doesn't follow the expected graduation pattern — typically a feature graduates across multiple releases.
Design gaps to address
- What happens if a batch fails to schedule? Is there a timeout? How does the controller detect deadlock?
- How does this interact with cluster autoscaler — if nodes are scaling up, does the controller wait?
- What's the concrete mechanism for determining "previous batch is successfully scheduled"? All pods Running? All pods Bound?
- The section numbering jumps to "6. Progressive Scheduling Deployment" without prior numbered sections — seems like content was reorganized without cleanup.
Summary: The concept is solid and addresses a real gap. The document needs cleanup (remove incomplete text, fix API definitions, align YAML examples with structs) and the addition of standard KEP sections before this can move forward.
KEP-75 Design Review: Enhanced Topology-Based Multi-Role Coordinated SchedulingThanks for proposing this KEP. The motivation around topology-aware PD placement is clear and well-articulated. However, there are several issues that should be addressed before this design can move forward. Critical Issues
Design Concerns
Minor
Overall this is a well-motivated KEP with a solid core idea. Addressing the API correctness items and the design gaps would significantly strengthen the proposal. |
|
|
||
| 2. **Network Topology Optimization**: In PD-separated deployment architectures, placing P and D instances that process the same request within close network topology domains(NVLink > RDMA > TCP) can reduce KV transmission latency and increase throughput. Placements across different network switches can reduce the available bandwidth for KV cache transfer by approximately 20% [[1](https://arxiv.org/pdf/2508.19559)]. | ||
|
|
||
| 3. 在特定场景下,PD扽里 |
There was a problem hiding this comment.
There is a truncated Chinese sentence: 3. 在特定场景下,PD扽里. This appears to be an incomplete thought. Please either complete this motivation item or remove it.
| // If topologyScope is empty, index maps from smaller to larger network domains | ||
| Layers []TopologyLayer `json:"Layers"` | ||
| // Larger numbers indicate larger managed network domains with worse communication performance | ||
| topologyScope map[TopologyLayerName]int |
There was a problem hiding this comment.
Several fields use lowercase (unexported) identifiers which are invisible to encoding/json and the CRD code generator:
ClusterTopologySpec.topologyScope-> should beTopologyScopeClusterTopologyStatus.validate-> should beValidateDeployTopologyConstraint.clusterTopologyName-> should beClusterTopologyNameDeployTopologyConstraint.constraintMode-> should beConstraintMode
All of these also need json:"..." struct tags to be included in the CRD schema.
| spec: | ||
| coordination: | ||
| - name: pd-rollout-deploy | ||
| strategy: |
There was a problem hiding this comment.
The YAML example places prefill: 4 and decode: 2 directly under rollingDeploy:, and deployTopologyConstraint as a sibling. But in the Go struct CoordinationRollingDeploy, these are RoleDeployStep map[string]int and DeployTopologyConstraint *DeployTopologyConstraint (nested fields). The YAML should nest values under roleDeployStep:. Also the roles field is missing from the pd-rollout-deploy coordination entry, but it is a required field in the Coordination struct.
|
|
||
| // new field | ||
| // RollingDeploy defines the coordination strategies about rolling deploy. | ||
| RollingDeploy *CoordinationRollingDeploy `json:"rollingDeploy,omitempty"` |
There was a problem hiding this comment.
The codebase already has CoordinationScaling in CoordinationStrategy (with MaxSkew and Progression). This KEP adds CoordinationRollingDeploy as another field. Please clarify:
- How does
RollingDeploydiffer fromScaling? Both handle progressive deployment. - Can they coexist on the same coordination entry?
- The Risks section says rolling deploy and rolling update are mutually exclusive -- does the same apply to rolling deploy vs scaling?
|
|
||
| ### Cluster Topology Definition | ||
|
|
||
| - **New CRD**: Introduce a Custom Resource Definition (CRD) to describe the cluster topology hierarchy. |
There was a problem hiding this comment.
The new ClusterTopology CRD is introduced but needs more design context:
- Is it cluster-scoped or namespaced?
- What RBAC permissions does it require?
- How does it relate to existing Kubernetes topology labels (
topology.kubernetes.io/*)? - Should it be in the same API group as RBG or a separate one?
Ⅰ. Motivation
Enables users to define cross-role deploy steps, ensuring that Pods across different roles maintain the desired ratio even when cluster resources are constrained. It also provides fine-grained scheduling control at the role level.
Ⅱ. Modifications
kep proposal
Ⅲ. Does this pull request fix one issue?
fixes #112
Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅴ. Describe how to verify it
VI. Special notes for reviews
Checklist
make fmt.