KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation by esotsal · Pull Request #5555 · kubernetes/enhancements

esotsal · 2025-09-21T18:21:24Z

One-line PR description: Create new KEP 5554: In place update pod resources alongside static cpu manager policy

Issue link: Support In place update pod resources alongside static cpu manager policy #5554

Other comments:

esotsal · 2025-09-21T18:50:14Z

/cc @natasha41575 @tallclair @pravk03 @Chunxia202410 @ffromani

k8s-ci-robot · 2025-09-21T18:50:20Z

@esotsal: GitHub didn't allow me to request PR reviews from the following users: Chunxia202410.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @natasha41575 @tallclair @pravk03 @Chunxia202410 @ffromani

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

esotsal · 2026-01-29T12:17:40Z

Is there any chance of concurrent allocations which may mutate the node state while a resize is InProgress, therefore mutating the state which made the kubelet think the resize was feasible, in such a way to make it no longer feasible?
I guess this is the key concern here (and I'm not into much IPPR enough to have an obvious immediate answer, even though the locking I see in the allocation manager should prevent that. But once in a while issue like kubernetes/kubernetes#136021 (comment) pop out and make me pause)

Thanks for sharing this issue @ffromani .

I think the most appropriate answer, is i know that i don't know :-(

esotsal · 2026-01-29T12:33:04Z

ran out of time today, I mostly looked at things at a high level today, and the details about the promised CPUs make sense to me overall, but I want to do a deeper review of that part tomorrow

Thanks for your time, please check the new KEP updates, in preparation for v1.36. It includes the ContainerCPUs checkpoint suggestion from Francesco.

esotsal · 2026-01-30T15:36:33Z

Status update for PRR reviewer , upcoming PRR freeze on Wednesday 4th February 2026 (AoE) / Thursday 5th February 2026, 12:00 UTC.

( last update 3rd February 2026 )

Answered below open comments

Feasible CPU Static Policy Options Combination matrix and Supported CPU Static Policy Options Combination matrix KEP sections added to resolve KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation #5555 (comment)
ContainerCPUs checkpoint and Checkpoint file example during resize KEP section added to resolve KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation #5555 (comment)
Test Plan updated to resolve - Improve e2e test section to adequately explain all tests that need to be covered and - Improve implementation section
Table added in e2e tests to resolve Create "expected outcomes” for common cases like: small CPU increase, large CPU increase, CPU request decrease, and static policy scenarios with multiple NUMA zones.
Checking other KEPs like topology manager, physical-cpus, ansnwered with "No, this is a node-local feature." and updated relative section to remove low-code to resolve Move Introduced Errors to API changes
Updated the paragraph , replacing 'low-code' with a descriptive approach as done in KEP-693 and KEP-2626. to resolve Improve Error Description showing flow
Updated Implementation to resolve Improve implementation section 2nd comment, replacing low-code with list of affected components briefly describing what and why at this phase.

Remaining open comments, working on , not blocking for alpha can be fine tuned in beta :

For remaining unresolved comments, it is upon to the reviewers to decide if provided answers are sufficient or blocking for this KEP to go alpha in v1.36.

I think most important ones are below :

Both this KEP and 5526 want to change the cpumanager state file. ( similar comment ), i hope the addition of ContainerCPUs checkpoint KEP section, and the implementation of the suggestion in the PR, will help to unblock this for both KEPs.
Open question for Kubernetes issue 131309

Please let me know, if I've missed a comment.

Thanks in advance!

natasha41575

I see my previous comments have been addressed - thanks!

natasha41575 · 2026-02-02T19:55:03Z

+
+When the topology manager cannot generate a topology hint which satisfies the topology manager policy, the pod resize is marked as [Deferred](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources#resize-status). This means that while the node theoretically has enough CPU for the resize, it's not currently possible but can be re-evaluated later, potentially becoming feasible as pod churn frees up capacity.
+
+Reasons for failure 


(replying to @ffromani's comment on this thread: #5555 (comment))

I want us to be very careful what we set as Deferred, and err on the side of marking things Infeasible. We are planning Scheduler Preemption for IPPR in 1.37.

Scheduler preemption will be triggered on all Deferred resizes. Meaning that the scheduler logic will try to find pods to preempt based on priority class and the size of the pod. Because the scheduler is not NUMA-aware, it is only safe to mark a resize as "Deferred" if the kubelet knows that scheduler-triggered preemption can help the resize to succeed. Otherwise the scheduler will be preempting pods unnecessarily.

In your example scenario I can see it could be possible to do this, but do you think we can reliably implement this kind of logic? It seems both complex and fragile - it would require the kubelet to make a lot of assumptions of the scheduler's behavior and I'm not sure that's a direction we want to go.

My opinion is marking it as "Infeasible" is a safe step forward to unblock us now, while still leaving room to relax it to "Deferred" later if necessary.

ffromani

I had a pass and the KEP content LGTM. I have no major comments about the plan outlined with the KEP and, of course, I fully agree with the goal. I see a future extension path to incorporate the memory manager (and the device manager maybe?).
I think there's room to polish a bit the implementation details section, but it's pretty minor.

KevinTMtz · 2026-02-04T00:38:35Z

AFAIK there's no solution here. The state file change are sadly incompatible with each other, there's no justification in either KEP to carry other KEP's state fields, and there's little chance to shoehorn the fields needed by either KEP (if we ever want to do that, and we should not) into the fields added by the other.

The project has a policy/pattern against partial merge, like for example we cannot merge state file change for either/both KEPs first and then move on. I'm afraid the only real option is to find a way to serialize both KEPs somehow in the cycle.

If we both target the next cycle, the new fields from our features will have to be added to the new CPU manager V3 state in the same release, thus creating a dependency on the serialization changes of whichever feature is merged first.

I wonder how the process would be if the first merged feature is rolled back, but the second one will be maintained. At least we could keep any non-feature related changes that belong to the V3 state for the second feature to use. The path may be easier since we are not modifying the state in the same way, PodLevelResourceManagers is adding a new property, while InPlacePodVerticalScalingExclusiveCPUS is modifying one of the existing properties.

@ffromani @esotsal

esotsal · 2026-02-05T12:52:05Z

Thanks for sharing your thoughts @KevinTMtz , I think likewise.

If we both target the next cycle, the new fields from our features will have to be added to the new CPU manager V3 state in the same release, thus creating a dependency on the serialization changes of whichever feature is merged first.

I think releasing in v1.36 is wanted position for both, according to SIG Node v1.36 KEPs planning both are Considered for release, KEP 5554 with High priority and KEP 5526 with Medium priority.

The path may be easier since we are not modifying the state in the same way, PodLevelResourceManagers is adding a new property, while InPlacePodVerticalScalingExclusiveCPUS is modifying one of the existing properties.

I agree, based on above I think it is manageable and doable to add them both in v1.36 release. I don't have a preference on the merging order, both works for me, up to sig-node community, revieweres and approvers to decide what is most suited.

natasha41575

ippr-specific bits LGTM

the rest of the content also LGTM, but admittedly I am not an expert in topology / cpu manager

/assign @ffromani @tallclair

ffromani · 2026-02-06T08:23:08Z

ippr-specific bits LGTM

the rest of the content also LGTM, but admittedly I am not an expert in topology / cpu manager

This is an interesting part because I kinda feel the same in reverse. cpu/topology manager bits make sense but I can't really comment the IPPR integration.
Let's try to get an holistic vision and connect the dots.
The cpumanager part is about basically providing the minimally different hint which allows the request. On downsize seems trivial, an upsize it may cause a hint to require more NUMA nodes than the original allocation.
Then we defer to topology manager to accept or reject the resize considering the policy, much like admission.
This means either:

we re-run an admission-like flow on resize, at least the TM part
we have a new flow similar to admission in the resize path

Is this a correct 10k-feet summarization of the flow?

natasha41575 · 2026-02-06T17:11:40Z

ippr-specific bits LGTM
the rest of the content also LGTM, but admittedly I am not an expert in topology / cpu manager

This is an interesting part because I kinda feel the same in reverse. cpu/topology manager bits make sense but I can't really comment the IPPR integration. Let's try to get an holistic vision and connect the dots. The cpumanager part is about basically providing the minimally different hint which allows the request. On downsize seems trivial, un upsize it may cause a hint to require more NUMA nodes than the original allocation. Then we defer to topology manager to accept or reject the resize considering the policy, much like admission. This means either:

we re-run an admission-like flow on resize, at least the TM part

we have a new flow similar to admission in the resize path

Is this a correct 10k-feet summarization of the flow?

We actually already run admission checks on resize. So the ideal flow I think would be to integrate TM feasibility checks (i.e. can TM generate a hint?) into the "admission" path, and integrate TM allocation of CPUs into the "resize actuation" path which happens during a pod sync.

As an aside, I actually think kubernetes/kubernetes#133427 could help simplify the implementation of this KEP, because it unifies the existing resize feasibility checks with admission, allowing for different checks depending on whether we are adding or resizing a pod. So I can try to get this one in. IIUC TM already has its own admission handler, so we'd just need to make sure it does the right checks on resize.

Does this make sense?

Let's try to get an holistic vision and connect the dots.

I think @tallclair has a holistic understanding of both sides, so his review will help tremendously.

esotsal · 2026-02-09T17:43:56Z

@dchen1107 , @tallclair , @natasha41575 , @ffromani , @deads2k , @KevinTMtz , @pravk03 are there any blocking items and/or open action point(s) for this KEP, which I might have missed for alpha PRR ? If so please let me know, enhancement freeze is tomorrow so would appreciate your feedback.

Adding @whtssub ( this KEPs wrangler ) who has kindly updated KEPs status #5554 (comment)

Chunxia202410 · 2026-02-10T10:22:51Z

Open question

Are there any objections extended section contributed by @Chunxia202410 for kubernetes/kubernetes#131309 , discussion and decision if it will be a future extension or a separate KEP to be taken in beta. Reasoning is that the proposal is a non-blocking extension of KEP 5554 so it should not block this KEP going to alpha.

Hi @esotsal , thank you for any suggestions from you and the community regarding this issue. Since this feature is quite independent, we plan to address this part as a separate KEP. Thank you.

ffromani · 2026-02-10T17:08:25Z

ippr-specific bits LGTM
the rest of the content also LGTM, but admittedly I am not an expert in topology / cpu manager

This is an interesting part because I kinda feel the same in reverse. cpu/topology manager bits make sense but I can't really comment the IPPR integration. Let's try to get an holistic vision and connect the dots. The cpumanager part is about basically providing the minimally different hint which allows the request. On downsize seems trivial, un upsize it may cause a hint to require more NUMA nodes than the original allocation. Then we defer to topology manager to accept or reject the resize considering the policy, much like admission. This means either:

we re-run an admission-like flow on resize, at least the TM part

we have a new flow similar to admission in the resize path

Is this a correct 10k-feet summarization of the flow?

We actually already run admission checks on resize. So the ideal flow I think would be to integrate TM feasibility checks (i.e. can TM generate a hint?) into the "admission" path, and integrate TM allocation of CPUs into the "resize actuation" path which happens during a pod sync.

As an aside, I actually think kubernetes/kubernetes#133427 could help simplify the implementation of this KEP, because it unifies the existing resize feasibility checks with admission, allowing for different checks depending on whether we are adding or resizing a pod. So I can try to get this one in. IIUC TM already has its own admission handler, so we'd just need to make sure it does the right checks on resize.

Does this make sense?

It does, thanks for clairifying!

Let's try to get an holistic vision and connect the dots.

I think @tallclair has a holistic understanding of both sides, so his review will help tremendously.

+1!!

ffromani · 2026-02-10T17:12:10Z

@dchen1107 , @tallclair , @natasha41575 , @ffromani , @deads2k , @KevinTMtz , @pravk03 are there any blocking items and/or open action point(s) for this KEP, which I might have missed for alpha PRR ? If so please let me know, enhancement freeze is tomorrow so would appreciate your feedback.

Adding @whtssub ( this KEPs wrangler ) who has kindly updated KEPs status #5554 (comment)

LGTM from my side!

deads2k · 2026-02-10T22:20:58Z

PRR looks good for alpha. I made a few comments about things we'll need to be sure we refine in beta.

/approve

esotsal · 2026-02-11T05:01:29Z

PRR looks good for alpha. I made a few comments about things we'll need to be sure we refine in beta.

Thanks updated this PR, to resolve those comments.

natasha41575 · 2026-02-11T18:18:09Z

/lgtm

esotsal · 2026-02-11T19:42:26Z

/lgtm

Thanks, updated KEP to resolve alongside typo ( diff ) . Please take another look.

natasha41575 · 2026-02-11T20:45:14Z

/lgtm

tallclair · 2026-02-12T01:08:36Z

/lgtm
/approve

Decision to defer to TopologyManager for which CPUs to downsize LGTM. I'm not sure about the decision to forbid resizing below the initial count, but that is something we can easily revisit at a later date if there's a use case for it.
I only gave the implementation plan a superficial review, but we can work out the details in the actual implementation PR.

tallclair · 2026-02-12T01:05:36Z

+
+To effectively address the needs of both users and Kubernetes components for the realization of this KEP, the proposed implementation involves the following changes: 
+
+1. Update the `CPUManager` checkpoint file format as stated in [ContainerCPUs checkpoint](#containercpus-checkpoint) section), which will serve as the single source of truth to represent the original and resized exclusive CPUs of an in place CPU resize of a Guaranteed Pod with CPU Static Policy.


Will the new format be feature gated?

If feature gate is not set , ”resize” will be always empty, only original will be used. I haven’t thought making the new format feature gated , Any use case you have in mind ? Is it ok to continue discussion in the implementation PR?

Will the new format be feature gated?

Such a simple question, i was not aware how important it was and the complexity to solve it. Thanks @tallclair for the question, considering v1.36 cycle reviews in this KEPs PR , short answer , it was missed and , yes, it must be feature gated as well as ALL code modifications. Why? To ensure k8s operation activities will not be impacted ( rollback, harmonized co-existence with other features touching checkpoint, ensuring v1.PorReasonInfeasible is returned when needed to reduce impacts on a node with unecessary resizes etc ). #5965 created to update the KEP with the modifications hoping we get to a consensus increasing confidence that most if not ALL risks have been considered and KEP can be 'alpha' in v1.37.

dchen1107 · 2026-02-12T01:51:03Z

/lgtm
/approve based on @tallclair and @ffromani's review.

k8s-ci-robot · 2026-02-12T01:51:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, deads2k, esotsal, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [deads2k]
~~keps/sig-node/OWNERS~~ [dchen1107]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 21, 2025

k8s-ci-robot requested review from dchen1107 and jeremyrickard September 21, 2025 18:21

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 21, 2025

esotsal force-pushed the ippvs-alognside-static-cpu-policy-KEP branch from 4c5c393 to 1240d58 Compare September 21, 2025 18:24

esotsal changed the title ~~KEP 5554: In place update pod resources alongside static cpu manager policy~~ KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation Sep 21, 2025

esotsal mentioned this pull request Sep 21, 2025

Support In place update pod resources alongside static cpu manager policy #5554

Open

7 tasks

esotsal changed the title ~~KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation~~ [WIP] KEP 5554: In place update pod resources alongside static cpu manager policy KEP creation Sep 21, 2025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 21, 2025

k8s-ci-robot requested review from ffromani, natasha41575, pravk03 and tallclair September 21, 2025 18:50

esotsal force-pushed the ippvs-alognside-static-cpu-policy-KEP branch from 1240d58 to 8973b16 Compare September 26, 2025 10:59

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 26, 2025

esotsal force-pushed the ippvs-alognside-static-cpu-policy-KEP branch 10 times, most recently from 24bfb5c to a6c437b Compare September 26, 2025 17:19

natasha41575 reviewed Feb 2, 2026

View reviewed changes

KevinTMtz mentioned this pull request Feb 2, 2026

Pod Level Resource Managers integration work kubernetes/kubernetes#136481

Open

2 tasks

KevinTMtz reviewed Feb 3, 2026

View reviewed changes

Comment thread keps/sig-node/5554-in-place-update-pod-resources-alongside-static-cpu-manager-policy/README.md Outdated

ffromani reviewed Feb 3, 2026

View reviewed changes

natasha41575 reviewed Feb 5, 2026

View reviewed changes

KevinTMtz reviewed Feb 6, 2026

View reviewed changes

Comment thread keps/sig-node/5554-in-place-update-pod-resources-alongside-static-cpu-manager-policy/README.md Outdated

Comment thread keps/sig-node/5554-in-place-update-pod-resources-alongside-static-cpu-manager-policy/README.md Outdated

deads2k reviewed Feb 10, 2026

View reviewed changes

Comment thread keps/sig-node/5554-in-place-update-pod-resources-alongside-static-cpu-manager-policy/README.md Outdated

deads2k reviewed Feb 10, 2026

View reviewed changes

Comment thread keps/sig-node/5554-in-place-update-pod-resources-alongside-static-cpu-manager-policy/README.md Outdated

haircommander reviewed Feb 11, 2026

View reviewed changes

Comment thread keps/sig-node/5554-in-place-update-pod-resources-alongside-static-cpu-manager-policy/README.md Outdated

KEP-5554: Initial KEP creation

47a6822

tallclair reviewed Feb 12, 2026

View reviewed changes


		When the topology manager cannot generate a topology hint which satisfies the topology manager policy, the pod resize is marked as [Deferred](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources#resize-status). This means that while the node theoretically has enough CPU for the resize, it's not currently possible but can be re-evaluated later, potentially becoming feasible as pod churn frees up capacity.

		Reasons for failure


		To effectively address the needs of both users and Kubernetes components for the realization of this KEP, the proposed implementation involves the following changes:

		1. Update the `CPUManager` checkpoint file format as stated in [ContainerCPUs checkpoint](#containercpus-checkpoint) section), which will serve as the single source of truth to represent the original and resized exclusive CPUs of an in place CPU resize of a Guaranteed Pod with CPU Static Policy.

Conversation

esotsal commented Sep 21, 2025

Uh oh!

esotsal commented Sep 21, 2025

Uh oh!

k8s-ci-robot commented Sep 21, 2025

Uh oh!

esotsal commented Jan 29, 2026

Uh oh!

esotsal commented Jan 29, 2026

Uh oh!

esotsal commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natasha41575 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

natasha41575 Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ffromani left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KevinTMtz commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esotsal commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natasha41575 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ffromani commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natasha41575 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

esotsal commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chunxia202410 commented Feb 10, 2026

Uh oh!

ffromani commented Feb 10, 2026

Uh oh!

ffromani commented Feb 10, 2026

Uh oh!

Uh oh!

Uh oh!

deads2k commented Feb 10, 2026

Uh oh!

esotsal commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

natasha41575 commented Feb 11, 2026

Uh oh!

esotsal commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natasha41575 commented Feb 11, 2026

Uh oh!

tallclair commented Feb 12, 2026

Uh oh!

tallclair Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

esotsal Feb 12, 2026

esotsal commented Jan 30, 2026 •

edited

Loading

natasha41575 Feb 2, 2026 •

edited

Loading

ffromani left a comment •

edited

Loading

KevinTMtz commented Feb 4, 2026 •

edited

Loading

esotsal commented Feb 5, 2026 •

edited

Loading

ffromani commented Feb 6, 2026 •

edited

Loading

natasha41575 commented Feb 6, 2026 •

edited

Loading

esotsal commented Feb 9, 2026 •

edited

Loading

esotsal commented Feb 11, 2026 •

edited

Loading

esotsal commented Feb 11, 2026 •

edited

Loading

esotsal Apr 6, 2026 •

edited

Loading