feat: Merge Decouple PodGroup KEP (#5832) into Gang Scheduling (#4671) by helayoty · Pull Request #5980 · kubernetes/enhancements

helayoty · 2026-03-26T21:55:54Z

One-line PR description:

Consolidates KEP-5832 (Decouple PodGroup API) into KEP-4671 (Gang Scheduling) so that the decoupled PodGroup design lives in a single, self-contained KEP rather than being split across two documents.

Issue link:
- WAS: Decouple PodGroup API #5832
- Gang Scheduling Support in Kubernetes #4671
Other comments:
KEP-5832 was created as a companion to KEP-4671 to detail the PodGroup decoupling design. Maintaining two separate KEPs for what is effectively one feature creates confusion for reviewers and implementers. Merging them produces a single authoritative document that is easier to review, approve, and track through the enhancement process.

/sig scheduling
/area workload-aware

k8s-ci-robot · 2026-03-26T21:56:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: helayoty
Once this PR has been reviewed and has the lgtm label, please assign dom4ha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

helayoty · 2026-03-26T21:56:15Z

/assign @wojtek-t

wojtek-t

Few small comments from me.

wojtek-t · 2026-03-30T11:50:37Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml

 owning-sig: sig-scheduling
 participating-sigs:
  - sig-apps
+  - sig-api-machinery


Why sig api-machinery?

This was a feedback during PodGroup KEP review, since it's new API.

No - api-machinery is not owning APIs.
api-machinery owns machinery for exposing APIs, which we don't influence/change with this KEP

wojtek-t · 2026-03-30T11:50:54Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml


 see-also:
-  - "/keps/sig-scheduling/583-coscheduling"
+  - "/keps/sig-scheduling/5832-decouple-podgroup-api"


Let's not replace, but add a new entry.

Sorry, but aren't we going to remove the 5832-decouple-podgroup-api ?

Good question - I though we should mark it as abandonned?
But maybe we should indeed remove it.

I'm all for removing (or archiving) it. Moving it to replaces section similar to coscheduling.

keps/sig-scheduling/4671-gang-scheduling/kep.yaml

keps/sig-scheduling/4671-gang-scheduling/README.md

wojtek-t · 2026-03-30T13:23:22Z

keps/sig-scheduling/4671-gang-scheduling/README.md

 The longer version of this design describing the whole thought process of choosing the
 above described approach can be found in the [extended proposal] document.

-[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?


Let's not remove it.

I just moved it down with other references.

mm4tt · 2026-03-30T14:00:12Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml


 see-also:
-  - "/keps/sig-scheduling/583-coscheduling"
+  - "/keps/sig-scheduling/5832-decouple-podgroup-api"


Sorry, but aren't we going to remove the 5832-decouple-podgroup-api ?

mm4tt · 2026-03-30T14:14:12Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+- `Workload` represents long-lived configuration-intent, whereas `PodGroups` represent transient units of scheduling. 
+  Tying runtime execution units to the persistent definition object violates separation of concerns.
+- Lifecycle coupling prevents standalone `PodGroup` objects from owning other resources (e.g., ResourceClaims) 


This sentence is a bit confusing. Maybe:

Decoupling the lifecycles allows standalone PodGroup objects to own resources (e.g., ResourceClaims). This enables garbage collection to be scoped to specific scheduling units, rather than tying it to the entire Workload or individual Pods."

?

The text was copied from the original (approved) KEP so I avoided commenting on things that were simply copied.

mm4tt · 2026-03-30T14:17:19Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+- `Workload` becomes a scheduling policy object that defines scheduling constraints and requirements.
+- `PodGroupTemplate` provides the blueprint for runtime `PodGroup` creation.
+- `PodGroup` is a controller-owned runtime object with its own lifecycle that represents a single scheduling unit.


It doesn't have to be controller owned. So maybe: PodGroup is a standalone runtime object with its own lifecycle - typically managed by a controller - that represents a single scheduling unit. ?

mm4tt · 2026-03-30T14:20:28Z

keps/sig-scheduling/4671-gang-scheduling/README.md

 - Introduce a concept of a `PodGroup` positioned as runtime counterparts for the Workload
 - Ensure that decoupled model of `Workload` and `PodGroup` provide clear responsibility split, improved scalability and simplified lifecycle management
+- Enhance status ownership by making `PodGroup` status track podGroup-level runtime state
+- Ensure proper ownership of `PodGroup` objects via controller `ownerReferences`


Is this really a goal? It looks like a mean to achieve something else. What about saying here:

Enable automatic lifecycle management and resource cleanup for PodGroup objects through integration with Kubernetes garbage collection.

?

These goals have been copied as-is from the decouple KEP, is it ok to update them? @mm4tt @wojtek-t

We own that - let's update it.

mm4tt · 2026-03-30T14:24:33Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+- Ensure proper ownership of `PodGroup` objects via controller `ownerReferences`
 - Ensuring that we can extend `Workload` API in backward compatible way toward north-star API
- Ensuring that `Workload` API will be usable for both built-in and third-party workload controllers and APIs
+- Simplify integration with `Workload` API and true workload[^6] controllers to make `Workload` API 


"I'm not sure if 'Simplify integration' is the right goal here. Since this KEP is introducing the Workload API, there isn't an existing integration to simplify yet. This feels like a bit of an 'inception'.

The previous version ('Ensuring that Workload API will be usable...') seemed more accurate as it describes a key property of the new API (its universality). If we want to emphasize ease of use, maybe something like: 'Ensure the Workload & PodGroup API provides a consistent and accessible integration path for both built-in and third-party controllers.'?"

Same. These goals have been copied as-is from the decouple KEP, is it ok to update them? @mm4tt @wojtek-t

Yes - let's update it.

keps/sig-scheduling/4671-gang-scheduling/README.md

Signed-off-by: helayoty <heelayot@microsoft.com>

wojtek-t · 2026-04-13T11:11:38Z

keps/sig-scheduling/4671-gang-scheduling/README.md

 - Introduce a concept of a `PodGroup` positioned as runtime counterparts for the Workload
 - Ensure that decoupled model of `Workload` and `PodGroup` provide clear responsibility split, improved scalability and simplified lifecycle management
+- Enhance status ownership by making `PodGroup` status track podGroup-level runtime state
+- Ensure proper ownership of `PodGroup` objects via controller `ownerReferences`


We own that - let's update it.

wojtek-t · 2026-04-13T11:12:05Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+- Ensure proper ownership of `PodGroup` objects via controller `ownerReferences`
 - Ensuring that we can extend `Workload` API in backward compatible way toward north-star API
- Ensuring that `Workload` API will be usable for both built-in and third-party workload controllers and APIs
+- Simplify integration with `Workload` API and true workload[^6] controllers to make `Workload` API 


Yes - let's update it.

wojtek-t · 2026-04-13T11:13:45Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+   //   - Status=False: Scheduling failed (i.e., timeout, unschedulable, etc.).
+   //
+   // Known reasons for PodGroupScheduled condition:
+   // - "Scheduled": All required pods have been successfully scheduled.


Given you touch it already, let's unify with what was actually implemented:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/scheduling/v1alpha2/types.go#L512-L522

wojtek-t · 2026-04-13T11:15:12Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+  // Known condition types:
+  // - "PodGroupScheduled": Indicates whether the scheduling requirement has been satisfied.
+  // - "DisruptionTarget": Indicates whether the PodGroup is about to be terminated
+	//   due to disruption such as preemption.


nit: fix indentation

wojtek-t · 2026-04-13T11:15:35Z

/assign @macsko

macsko · 2026-04-13T11:22:24Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+[^7]: DNS subdomain is a naming convention defined in [RFC 1123](https://tools.ietf.org/html/rfc1123) that Kubernetes uses for most resource names.
+
+lO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/Kdit?


What is this? (lO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/Kdit?)

macsko · 2026-04-13T11:23:13Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+[#5501]: https://github.com/kubernetes/enhancements/pull/5501
+[#5501]: https://github.com/kubernetes/enhancements/pull/5501
+[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?
+[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?
 [KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871
+[KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes/enhancements]: https://git.k8s.io/enhancements
+[kubernetes/enhancements]: https://git.k8s.io/enhancements


Deduplicate:

Suggested change

[#5501]: https://github.com/kubernetes/enhancements/pull/5501

[#5501]: https://github.com/kubernetes/enhancements/pull/5501

[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?

[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?

[KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871

[KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871

[kubernetes.io]: https://kubernetes.io/

[kubernetes.io]: https://kubernetes.io/

[kubernetes/enhancements]: https://git.k8s.io/enhancements

[kubernetes/enhancements]: https://git.k8s.io/enhancements

[#5501]: https://github.com/kubernetes/enhancements/pull/5501

[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?

[KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871

[kubernetes.io]: https://kubernetes.io/

[kubernetes/enhancements]: https://git.k8s.io/enhancements

macsko · 2026-04-13T11:32:26Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+#### Story 5: Controller Scalability
+
+As a workload controller author, I want `PodGroup` status to be stored in a separate object, so that per-replica scheduling updates do not require read-modify-write operations on a large, shared `Workload` object, which would otherwise create scalability and contention issues at scale.


nit, but helpful for the future:
Could you split the long lines of text into multiple smaller ones? It simplifies the review and further modifications

macsko · 2026-04-13T16:04:43Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+
+**- EventsToRegister (Enqueue)**: The extension will register a new event for when a `PodGroup` object is created.
+
+**- Update PodGroup Status**: The kube-scheduler sets `PodGroupScheduled=True` after the group passed the Permit phase. If `PodGroup` is unschedulable, the scheduler sets `PodGroupScheduled=False` whenever a gang is conditionally accepted (waiting for preemption) or rejected.


Why is this under GangScheduling plugin hooks? It's implemented by the scheduling cycle

macsko · 2026-04-13T16:05:04Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+**- WaitOnPermit**: used as a barrier to wait for the pods to be assigned to the nodes before 
+initiating potential preemptions and their bindings. The extension waits for all pods in the `PodGroup` to reach permit stage by using each pod's `schedulingGroup.podGroupName` to identify the `PodGroup` that the pod belongs to.
+
+**- EventsToRegister (Enqueue)**: The extension will register a new event for when a `PodGroup` object is created.


And that an unscheduled pod is added

macsko · 2026-04-13T16:08:05Z

keps/sig-scheduling/4671-gang-scheduling/README.md

 [X] I/we understand the owners of the involved components may require updates to
 existing tests to make this code solid enough prior to committing the changes necessary
-to implement this enhancement.
+


Undo this change

macsko · 2026-04-13T16:09:10Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+We will add basic API tests for the new `Workload` and `PodGroup` APIs, that will later be
+promoted to conformance. These tests will cover `PodGroup` creation, 
+validation, status updates, and lifecycle management. 
+More tests will be added for beta release.


Do we really need more e2e tests for beta?

macsko · 2026-04-13T16:10:54Z

keps/sig-scheduling/4671-gang-scheduling/README.md

@@ -1436,17 +1616,26 @@ Watching for workloads:
  - estimated throughput: < XX/s
  - originating component: kube-scheduler, kube-controller-manager (GC)


kube-scheduler is no longer watching for Workloads:

Suggested change

- originating component: kube-scheduler, kube-controller-manager (GC)

- originating component: kube-controller-manager (GC)

macsko · 2026-04-13T16:13:02Z

keps/sig-scheduling/4671-gang-scheduling/README.md

 - 2025-09: Initial KEP-4671 proposal.
+- 2026-01-23: KEP-5832 created for PodGroup API alpha release.
 - 2026-02: Structural revision for 1.36 to decouple Policy (Workload) and State (PodGroup). The API remains in Alpha 
  to finalize the architecture.
+- 2026-02-06: KEP-5832 updated to sync with API decision of keeping Workload API in alpha release.
+- 2026-03-26: KEP-5832 merged into KEP-4671 as a single consolidated KEP.


I think we can keep the months only:

Suggested change

- 2025-09: Initial KEP-4671 proposal.

- 2026-01-23: KEP-5832 created for PodGroup API alpha release.

- 2026-02: Structural revision for 1.36 to decouple Policy (Workload) and State (PodGroup). The API remains in Alpha

to finalize the architecture.

- 2026-02-06: KEP-5832 updated to sync with API decision of keeping Workload API in alpha release.

- 2026-03-26: KEP-5832 merged into KEP-4671 as a single consolidated KEP.

- 2025-09: Initial KEP-4671 proposal.

- 2026-01: KEP-5832 created for PodGroup API alpha release.

- 2026-02: Structural revision for 1.36 to decouple Policy (Workload) and State (PodGroup). The API remains in Alpha

to finalize the architecture.

- 2026-02: KEP-5832 updated to sync with API decision of keeping Workload API in alpha release.

- 2026-03: KEP-5832 merged into KEP-4671 as a single consolidated KEP.

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. labels Mar 26, 2026

github-project-automation bot added this to SIG Scheduling and Workload-aware & Topology-aware Workstream Mar 26, 2026

github-project-automation bot moved this to Needs Triage in SIG Scheduling Mar 26, 2026

github-project-automation bot moved this to Backlog in Workload-aware & Topology-aware Workstream Mar 26, 2026

k8s-ci-robot requested review from dom4ha and macsko March 26, 2026 21:56

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 26, 2026

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 26, 2026

k8s-ci-robot assigned wojtek-t Mar 26, 2026

helayoty moved this from Needs Triage to Needs Review in SIG Scheduling Mar 26, 2026

helayoty moved this from Backlog to Needs Review in Workload-aware & Topology-aware Workstream Mar 26, 2026

wojtek-t reviewed Mar 30, 2026

View reviewed changes

mm4tt reviewed Mar 30, 2026

View reviewed changes

helayoty force-pushed the helayoty/merge-podgroup-with-workload branch from 06d9e7a to 8e14592 Compare April 9, 2026 10:48

Merge PodGroup KEP into Gang Scheduling

2ad2005

Signed-off-by: helayoty <heelayot@microsoft.com>

helayoty force-pushed the helayoty/merge-podgroup-with-workload branch from 8e14592 to 2ad2005 Compare April 9, 2026 11:40

helayoty requested review from mm4tt and wojtek-t April 9, 2026 11:44

wojtek-t reviewed Apr 13, 2026

View reviewed changes

k8s-ci-robot assigned macsko Apr 13, 2026

macsko reviewed Apr 13, 2026

View reviewed changes


		[^7]: DNS subdomain is a naming convention defined in [RFC 1123](https://tools.ietf.org/html/rfc1123) that Kubernetes uses for most resource names.

		lO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/Kdit?


		#### Story 5: Controller Scalability

		As a workload controller author, I want `PodGroup` status to be stored in a separate object, so that per-replica scheduling updates do not require read-modify-write operations on a large, shared `Workload` object, which would otherwise create scalability and contention issues at scale.


		- EventsToRegister (Enqueue): The extension will register a new event for when a `PodGroup` object is created.

		- Update PodGroup Status: The kube-scheduler sets `PodGroupScheduled=True` after the group passed the Permit phase. If `PodGroup` is unschedulable, the scheduler sets `PodGroupScheduled=False` whenever a gang is conditionally accepted (waiting for preemption) or rejected.

		@@ -1436,17 +1616,26 @@ Watching for workloads:
		- estimated throughput: < XX/s
		- originating component: kube-scheduler, kube-controller-manager (GC)

	- originating component: kube-scheduler, kube-controller-manager (GC)
	- originating component: kube-controller-manager (GC)

Conversation

helayoty commented Mar 26, 2026

Uh oh!

k8s-ci-robot commented Mar 26, 2026

Uh oh!

helayoty commented Mar 26, 2026

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-t commented Apr 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty Apr 9, 2026 •

edited

Loading