Skip to content

Log total nodes in the scale-up#9347

Closed
pmendelski wants to merge 1 commit intokubernetes:masterfrom
pmendelski:mendel-scaleup-log-summary
Closed

Log total nodes in the scale-up#9347
pmendelski wants to merge 1 commit intokubernetes:masterfrom
pmendelski:mendel-scaleup-log-summary

Conversation

@pmendelski
Copy link
Copy Markdown
Contributor

@pmendelski pmendelski commented Mar 11, 2026

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

One of the most important logs in this component is the one informing about the final scale-up decision. ATM this log is usually a multiline detailed log entry that informs about the scale-up size split into multiple node groups with their actual sizes and target sizes. That's very helpful when analyzing a single scale-up decision but too detailed when analyzing a sequence of scale-up decisions in a long time interval.

I propose adding to the log the final number of nodes added by the scale-up.

Current log (for a scale-up split into 3 node groups):

Final scale-up plan: [{<NODE_GROUP_ID> SIZE->TARGET (max: MAX_SIZE)} {<NODE_GROUP_ID> SIZE->TARGET (max: MAX_SIZE)} {<NODE_GROUP_ID> SIZE->TARGET (max: MAX_SIZE)}]

Proposed log:

Final scale-up plan: <SCALE_UP_SIZE> [{<NODE_GROUP_ID> SIZE->TARGET (max: MAX_SIZE)} {<NODE_GROUP_ID> SIZE->TARGET (max: MAX_SIZE)} {<NODE_GROUP_ID> SIZE->TARGET (max: MAX_SIZE)}]

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-area needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 11, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @pmendelski. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 11, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pmendelski
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


// Execute scale up.
klog.V(1).Infof("Final scale-up plan: %v", scaleUpInfos)
klog.V(1).Infof("Final scale-up plan: %v %v", totalCapacity, scaleUpInfos)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR was extracted from #9315 . Let's continue the discussion here @aleksandra-malinowska @jbtk

Main considerations:

  • potentially breaking log queries - Fair argument, but I think we should find a middle ground. Otherwise it could lead to treating our logs like an API and versioning them. In this case no information is dropped, scaleUpInfo format is unchanged, only a summary is added. I assume that tooling based on log parsing will from time to time require some tweaking and should be avoided.
  • why do we need it? - Figuring out a scale-up size in a regional cluster requires doing N subtractions and adding them in memory. For a single scale-up that's ok, but when analyzing a long sequence of scale-up it's suboptimal.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add something like finalScaleUpSize to give more context in the log line?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a cleanup of any kind - please fix.

+1 to @damikag - it would be better to introduce some context for this number that is added here. If we are changing the log line we could also add information across how many groups this is spread (this is also sth that is visible in the log line, but requires parsing)

Have you considered another approach - adding a separate log line with this information? I guess that there is a tradeoff between changing this log line and perfomance/amount of logs if we add additional log line - have you considered this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair argument, but I think we should find a middle ground. Otherwise it could lead to treating our logs like an API and versioning them.

Like I mentioned previously, this specific log line is fairly unique with regard to this. It's essentially a source-of-truth for the final CA decision, whatever happens later.

I'm not particularly attached to the rest of our logging, and I suspect it could probably use substantial improvements.

only a summary is added

My main issue is that this new addition is in the middle of the log and has zero context, it's just an integer inserted in between "scale-up plan:" and describing the actual plan.

The former makes it potentially misleading to people who routinely query CA logs for debugging purposes, as discussed before.

The latter makes it unhelpful for people who aren't used to reading CA logs. Depending on the scale-up size, it's just a magic number that can be mistaken for something else (# of scaled-up node groups, # of considered options, # of accommodated pods). You'd have to either read the code or reverse-engineer just to find out what it means, neither of which is ideal when starting basic troubleshooting.

At least let's add a message describing what this number means (ideally something easily understood by the new users, like "adding a total of %d nodes across %d groups"). Personally I'd also prefer if it was added at the end or logged separately, but with a descriptive message it's at least less likely to quietly break a debugging flow.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 11, 2026
@jbtk
Copy link
Copy Markdown
Member

jbtk commented Mar 12, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 12, 2026
@pmendelski pmendelski closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants