Skip to content

feat: Merge Decouple PodGroup KEP (#5832) into Gang Scheduling (#4671)#5980

Open
helayoty wants to merge 1 commit intokubernetes:masterfrom
helayoty:helayoty/merge-podgroup-with-workload
Open

feat: Merge Decouple PodGroup KEP (#5832) into Gang Scheduling (#4671)#5980
helayoty wants to merge 1 commit intokubernetes:masterfrom
helayoty:helayoty/merge-podgroup-with-workload

Conversation

@helayoty
Copy link
Copy Markdown
Member

  • One-line PR description:

Consolidates KEP-5832 (Decouple PodGroup API) into KEP-4671 (Gang Scheduling) so that the decoupled PodGroup design lives in a single, self-contained KEP rather than being split across two documents.

  • Issue link:
  • Other comments:
    KEP-5832 was created as a companion to KEP-4671 to detail the PodGroup decoupling design. Maintaining two separate KEPs for what is effectively one feature creates confusion for reviewers and implementers. Merging them produces a single authoritative document that is easier to review, approve, and track through the enhancement process.

/sig scheduling
/area workload-aware

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. labels Mar 26, 2026
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Mar 26, 2026
@k8s-ci-robot k8s-ci-robot requested review from dom4ha and macsko March 26, 2026 21:56
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 26, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: helayoty
Once this PR has been reviewed and has the lgtm label, please assign dom4ha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 26, 2026
@helayoty
Copy link
Copy Markdown
Member Author

/assign @wojtek-t

@helayoty helayoty moved this from Needs Triage to Needs Review in SIG Scheduling Mar 26, 2026
@helayoty helayoty moved this from Backlog to Needs Review in Workload-aware & Topology-aware Workstream Mar 26, 2026
Copy link
Copy Markdown
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small comments from me.

owning-sig: sig-scheduling
participating-sigs:
- sig-apps
- sig-api-machinery
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sig api-machinery?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a feedback during PodGroup KEP review, since it's new API.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - api-machinery is not owning APIs.
api-machinery owns machinery for exposing APIs, which we don't influence/change with this KEP


see-also:
- "/keps/sig-scheduling/583-coscheduling"
- "/keps/sig-scheduling/5832-decouple-podgroup-api"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not replace, but add a new entry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but aren't we going to remove the 5832-decouple-podgroup-api ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I though we should mark it as abandonned?
But maybe we should indeed remove it.

Copy link
Copy Markdown
Member Author

@helayoty helayoty Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for removing (or archiving) it. Moving it to replaces section similar to coscheduling.

The longer version of this design describing the whole thought process of choosing the
above described approach can be found in the [extended proposal] document.

[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not remove it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved it down with other references.


see-also:
- "/keps/sig-scheduling/583-coscheduling"
- "/keps/sig-scheduling/5832-decouple-podgroup-api"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but aren't we going to remove the 5832-decouple-podgroup-api ?


- `Workload` represents long-lived configuration-intent, whereas `PodGroups` represent transient units of scheduling.
Tying runtime execution units to the persistent definition object violates separation of concerns.
- Lifecycle coupling prevents standalone `PodGroup` objects from owning other resources (e.g., ResourceClaims)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit confusing. Maybe:

Decoupling the lifecycles allows standalone PodGroup objects to own resources (e.g., ResourceClaims). This enables garbage collection to be scoped to specific scheduling units, rather than tying it to the entire Workload or individual Pods." 

?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text was copied from the original (approved) KEP so I avoided commenting on things that were simply copied.


- `Workload` becomes a scheduling policy object that defines scheduling constraints and requirements.
- `PodGroupTemplate` provides the blueprint for runtime `PodGroup` creation.
- `PodGroup` is a controller-owned runtime object with its own lifecycle that represents a single scheduling unit.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be controller owned. So maybe: PodGroup is a standalone runtime object with its own lifecycle - typically managed by a controller - that represents a single scheduling unit. ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

- Introduce a concept of a `PodGroup` positioned as runtime counterparts for the Workload
- Ensure that decoupled model of `Workload` and `PodGroup` provide clear responsibility split, improved scalability and simplified lifecycle management
- Enhance status ownership by making `PodGroup` status track podGroup-level runtime state
- Ensure proper ownership of `PodGroup` objects via controller `ownerReferences`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a goal? It looks like a mean to achieve something else. What about saying here:

Enable automatic lifecycle management and resource cleanup for PodGroup objects through integration with Kubernetes garbage collection.

?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These goals have been copied as-is from the decouple KEP, is it ok to update them? @mm4tt @wojtek-t

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We own that - let's update it.

- Ensure proper ownership of `PodGroup` objects via controller `ownerReferences`
- Ensuring that we can extend `Workload` API in backward compatible way toward north-star API
- Ensuring that `Workload` API will be usable for both built-in and third-party workload controllers and APIs
- Simplify integration with `Workload` API and true workload[^6] controllers to make `Workload` API
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I'm not sure if 'Simplify integration' is the right goal here. Since this KEP is introducing the Workload API, there isn't an existing integration to simplify yet. This feels like a bit of an 'inception'.

The previous version ('Ensuring that Workload API will be usable...') seemed more accurate as it describes a key property of the new API (its universality). If we want to emphasize ease of use, maybe something like: 'Ensure the Workload & PodGroup API provides a consistent and accessible integration path for both built-in and third-party controllers.'?"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. These goals have been copied as-is from the decouple KEP, is it ok to update them? @mm4tt @wojtek-t

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - let's update it.

@helayoty helayoty force-pushed the helayoty/merge-podgroup-with-workload branch from 06d9e7a to 8e14592 Compare April 9, 2026 10:48
Signed-off-by: helayoty <heelayot@microsoft.com>
@helayoty helayoty force-pushed the helayoty/merge-podgroup-with-workload branch from 8e14592 to 2ad2005 Compare April 9, 2026 11:40
@helayoty helayoty requested review from mm4tt and wojtek-t April 9, 2026 11:44
- Introduce a concept of a `PodGroup` positioned as runtime counterparts for the Workload
- Ensure that decoupled model of `Workload` and `PodGroup` provide clear responsibility split, improved scalability and simplified lifecycle management
- Enhance status ownership by making `PodGroup` status track podGroup-level runtime state
- Ensure proper ownership of `PodGroup` objects via controller `ownerReferences`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We own that - let's update it.

- Ensure proper ownership of `PodGroup` objects via controller `ownerReferences`
- Ensuring that we can extend `Workload` API in backward compatible way toward north-star API
- Ensuring that `Workload` API will be usable for both built-in and third-party workload controllers and APIs
- Simplify integration with `Workload` API and true workload[^6] controllers to make `Workload` API
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - let's update it.

// - Status=False: Scheduling failed (i.e., timeout, unschedulable, etc.).
//
// Known reasons for PodGroupScheduled condition:
// - "Scheduled": All required pods have been successfully scheduled.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you touch it already, let's unify with what was actually implemented:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/scheduling/v1alpha2/types.go#L512-L522

// Known condition types:
// - "PodGroupScheduled": Indicates whether the scheduling requirement has been satisfied.
// - "DisruptionTarget": Indicates whether the PodGroup is about to be terminated
// due to disruption such as preemption.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix indentation

@wojtek-t
Copy link
Copy Markdown
Member

/assign @macsko


[^7]: DNS subdomain is a naming convention defined in [RFC 1123](https://tools.ietf.org/html/rfc1123) that Kubernetes uses for most resource names.

lO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/Kdit?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? (lO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/Kdit?)

Comment on lines +1891 to +1900
[#5501]: https://github.com/kubernetes/enhancements/pull/5501
[#5501]: https://github.com/kubernetes/enhancements/pull/5501
[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?
[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?
[KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871
[KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871
[kubernetes.io]: https://kubernetes.io/
[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/enhancements]: https://git.k8s.io/enhancements
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplicate:

Suggested change
[#5501]: https://github.com/kubernetes/enhancements/pull/5501
[#5501]: https://github.com/kubernetes/enhancements/pull/5501
[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?
[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?
[KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871
[KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871
[kubernetes.io]: https://kubernetes.io/
[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[#5501]: https://github.com/kubernetes/enhancements/pull/5501
[extended proposal]: https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?
[KEP-5547]: https://github.com/kubernetes/enhancements/pull/5871
[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements


#### Story 5: Controller Scalability

As a workload controller author, I want `PodGroup` status to be stored in a separate object, so that per-replica scheduling updates do not require read-modify-write operations on a large, shared `Workload` object, which would otherwise create scalability and contention issues at scale.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but helpful for the future:
Could you split the long lines of text into multiple smaller ones? It simplifies the review and further modifications


**- EventsToRegister (Enqueue)**: The extension will register a new event for when a `PodGroup` object is created.

**- Update PodGroup Status**: The kube-scheduler sets `PodGroupScheduled=True` after the group passed the Permit phase. If `PodGroup` is unschedulable, the scheduler sets `PodGroupScheduled=False` whenever a gang is conditionally accepted (waiting for preemption) or rejected.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this under GangScheduling plugin hooks? It's implemented by the scheduling cycle

**- WaitOnPermit**: used as a barrier to wait for the pods to be assigned to the nodes before
initiating potential preemptions and their bindings. The extension waits for all pods in the `PodGroup` to reach permit stage by using each pod's `schedulingGroup.podGroupName` to identify the `PodGroup` that the pod belongs to.

**- EventsToRegister (Enqueue)**: The extension will register a new event for when a `PodGroup` object is created.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that an unscheduled pod is added

[X] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo this change

We will add basic API tests for the new `Workload` and `PodGroup` APIs, that will later be
promoted to conformance. These tests will cover `PodGroup` creation,
validation, status updates, and lifecycle management.
More tests will be added for beta release.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need more e2e tests for beta?

@@ -1436,17 +1616,26 @@ Watching for workloads:
- estimated throughput: < XX/s
- originating component: kube-scheduler, kube-controller-manager (GC)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kube-scheduler is no longer watching for Workloads:

Suggested change
- originating component: kube-scheduler, kube-controller-manager (GC)
- originating component: kube-controller-manager (GC)

Comment on lines 1706 to +1711
- 2025-09: Initial KEP-4671 proposal.
- 2026-01-23: KEP-5832 created for PodGroup API alpha release.
- 2026-02: Structural revision for 1.36 to decouple Policy (Workload) and State (PodGroup). The API remains in Alpha
to finalize the architecture.
- 2026-02-06: KEP-5832 updated to sync with API decision of keeping Workload API in alpha release.
- 2026-03-26: KEP-5832 merged into KEP-4671 as a single consolidated KEP.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep the months only:

Suggested change
- 2025-09: Initial KEP-4671 proposal.
- 2026-01-23: KEP-5832 created for PodGroup API alpha release.
- 2026-02: Structural revision for 1.36 to decouple Policy (Workload) and State (PodGroup). The API remains in Alpha
to finalize the architecture.
- 2026-02-06: KEP-5832 updated to sync with API decision of keeping Workload API in alpha release.
- 2026-03-26: KEP-5832 merged into KEP-4671 as a single consolidated KEP.
- 2025-09: Initial KEP-4671 proposal.
- 2026-01: KEP-5832 created for PodGroup API alpha release.
- 2026-02: Structural revision for 1.36 to decouple Policy (Workload) and State (PodGroup). The API remains in Alpha
to finalize the architecture.
- 2026-02: KEP-5832 updated to sync with API decision of keeping Workload API in alpha release.
- 2026-03: KEP-5832 merged into KEP-4671 as a single consolidated KEP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Needs Review

Development

Successfully merging this pull request may close these issues.

6 participants