Log total nodes in the scale-up#9347
Conversation
|
Hi @pmendelski. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pmendelski The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
|
||
| // Execute scale up. | ||
| klog.V(1).Infof("Final scale-up plan: %v", scaleUpInfos) | ||
| klog.V(1).Infof("Final scale-up plan: %v %v", totalCapacity, scaleUpInfos) |
There was a problem hiding this comment.
This PR was extracted from #9315 . Let's continue the discussion here @aleksandra-malinowska @jbtk
Main considerations:
- potentially breaking log queries - Fair argument, but I think we should find a middle ground. Otherwise it could lead to treating our logs like an API and versioning them. In this case no information is dropped, scaleUpInfo format is unchanged, only a summary is added. I assume that tooling based on log parsing will from time to time require some tweaking and should be avoided.
- why do we need it? - Figuring out a scale-up size in a regional cluster requires doing N subtractions and adding them in memory. For a single scale-up that's ok, but when analyzing a long sequence of scale-up it's suboptimal.
There was a problem hiding this comment.
Shall we add something like finalScaleUpSize to give more context in the log line?
There was a problem hiding this comment.
This is not a cleanup of any kind - please fix.
+1 to @damikag - it would be better to introduce some context for this number that is added here. If we are changing the log line we could also add information across how many groups this is spread (this is also sth that is visible in the log line, but requires parsing)
Have you considered another approach - adding a separate log line with this information? I guess that there is a tradeoff between changing this log line and perfomance/amount of logs if we add additional log line - have you considered this?
There was a problem hiding this comment.
Fair argument, but I think we should find a middle ground. Otherwise it could lead to treating our logs like an API and versioning them.
Like I mentioned previously, this specific log line is fairly unique with regard to this. It's essentially a source-of-truth for the final CA decision, whatever happens later.
I'm not particularly attached to the rest of our logging, and I suspect it could probably use substantial improvements.
only a summary is added
My main issue is that this new addition is in the middle of the log and has zero context, it's just an integer inserted in between "scale-up plan:" and describing the actual plan.
The former makes it potentially misleading to people who routinely query CA logs for debugging purposes, as discussed before.
The latter makes it unhelpful for people who aren't used to reading CA logs. Depending on the scale-up size, it's just a magic number that can be mistaken for something else (# of scaled-up node groups, # of considered options, # of accommodated pods). You'd have to either read the code or reverse-engineer just to find out what it means, neither of which is ideal when starting basic troubleshooting.
At least let's add a message describing what this number means (ideally something easily understood by the new users, like "adding a total of %d nodes across %d groups"). Personally I'd also prefer if it was added at the end or logged separately, but with a descriptive message it's at least less likely to quietly break a debugging flow.
|
/ok-to-test |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
One of the most important logs in this component is the one informing about the final scale-up decision. ATM this log is usually a multiline detailed log entry that informs about the scale-up size split into multiple node groups with their actual sizes and target sizes. That's very helpful when analyzing a single scale-up decision but too detailed when analyzing a sequence of scale-up decisions in a long time interval.
I propose adding to the log the final number of nodes added by the scale-up.
Current log (for a scale-up split into 3 node groups):
Proposed log:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: