fix(metrics): negative increment to a counter should not cause panics (NK-1729)#2492
fix(metrics): negative increment to a counter should not cause panics (NK-1729)#2492
Conversation
a518ca4 to
495cc26
Compare
There was a problem hiding this comment.
Pull request overview
This PR prevents runtime custom counter updates with negative deltas from triggering panics during metrics collection/reporting.
Changes:
- Guard
LocalMetrics.CustomCounteragainst negative deltas by ignoring them. - Add a regression test to ensure a negative counter delta does not crash the reporter during background flushing.
- Document the behavior change in
CHANGELOG.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| server/metrics.go | Drops negative deltas for custom counters to avoid Prometheus/tally panics. |
| server/metrics_test.go | Adds regression coverage for negative counter deltas during reporter flush. |
| CHANGELOG.md | Notes the fix in the Unreleased “Fixed” section. |
Comments suppressed due to low confidence (1)
server/metrics_test.go:41
- This test relies on fixed
time.Sleepintervals to “hopefully” span two tally flush cycles, but it never asserts that a flush actually happened after the negative delta. On slow/loaded CI runners the flush may be delayed and the test could pass without exercising the panic path. Consider making this deterministic (e.g., drive/trigger a flush explicitly, or loop with a bounded timeout until a scrape/flush has occurred) and avoid long fixed sleeps to reduce flakiness and runtime.
time.Sleep(1500 * time.Millisecond)
module.MetricsCounterAdd("panic_counter", nil, -1)
time.Sleep(1500 * time.Millisecond)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // CustomCounter adds the given delta to a counter with the specified name and tags. | ||
| func (m *LocalMetrics) CustomCounter(name string, tags map[string]string, delta int64) { | ||
| if delta < 0 { |
There was a problem hiding this comment.
I was going to suggest something similar, do you think we should emit a warn log instead of silently dropping?
There was a problem hiding this comment.
I initially had a log, but then took it out, because counter increment can be a hot path and if it keeps happening then logs will be spammed.
If not log, then maybe counter_errors metric? this way there will be visibility into which counters failed to increment.
There was a problem hiding this comment.
Dedicated metric seems like the best option, perhaps custom_counter_negative_error?
There was a problem hiding this comment.
we can't use custom_ prefix for internal metric, because they are reserved for custom. Maybe:
metrics_collection_errors_count{name=custom_counter_abc,err=negative_increment} 1
| // CustomCounter adds the given delta to a counter with the specified name and tags. | ||
| func (m *LocalMetrics) CustomCounter(name string, tags map[string]string, delta int64) { | ||
| if delta < 0 { |
There was a problem hiding this comment.
I was going to suggest something similar, do you think we should emit a warn log instead of silently dropping?
| ) | ||
|
|
||
| func TestMetricsCounterAddNegativeDoesNotPanics(t *testing.T) { | ||
| logger := zap.NewNop() |
There was a problem hiding this comment.
Given that this test is time dependent and could potentially be flaky, should we t.Skip() it as part of the test suite?
There was a problem hiding this comment.
it reliably fails without a fix:
- tally is configured with 1s flush interval
- waits are 1.5 seconds to guarantee that flush has occurred
There was a problem hiding this comment.
I'm a bit weary of adding tests that depend on sleep as it'll increase the test execution time and it can pile up over time, although I'm aware sometimes it's unavoidable. If we want to keep it part of the suite, please infer the required sleep duration from the config value instead of hardcoding it.
There was a problem hiding this comment.
pushed change deriving sleeps from ReportingFreqSec.
it'll increase the test execution time and it can pile up over time
there is no tally.Flush() I could use even if I created private constructor for metrics collector. It also has to be above 1 second, because config option is with a second resolution.
metric metric name is `metrics_collection_errors_count`
There was a problem hiding this comment.
Pull request overview
This PR addresses a Prometheus/tally panic scenario caused by applying a negative delta to a custom runtime counter, and adds coverage + release notes for the fix.
Changes:
- Guard
LocalMetrics.CustomCounteragainst negative deltas by dropping the update and incrementing an internal error counter. - Add a regression test ensuring a negative
MetricsCounterAdddoes not crash metrics collection. - Document the behavior change in
CHANGELOG.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
server/metrics.go |
Prevents negative custom counter deltas from reaching Prometheus counters; emits an internal error metric instead. |
server/metrics_test.go |
Adds a regression test that exercises the negative-delta path and ensures no panic occurs during reporting. |
CHANGELOG.md |
Notes the fix under Unreleased/Fixed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -523,6 +523,11 @@ func (m *LocalMetrics) StorageWriteRejectCount(tags map[string]string, delta int | |||
|
|
|||
| // CustomCounter adds the given delta to a counter with the specified name and tags. | |||
| // CustomCounter adds the given delta to a counter with the specified name and tags. | ||
| func (m *LocalMetrics) CustomCounter(name string, tags map[string]string, delta int64) { | ||
| if delta < 0 { | ||
| m.PrometheusScope.Tagged(map[string]string{"name": name, "err": "negative_increment"}).Counter("metrics_collection_errors_count").Inc(1) |
There was a problem hiding this comment.
unlikely, but valid. Pushed change where metrics_collection_errors_count is created in the limited scope to avoid high cardinality in pathological cases
| cfg.Metrics.ReportingFreqSec = 1 | ||
| reportingInterval := time.Duration(cfg.Metrics.ReportingFreqSec) * time.Second | ||
| flushWait := reportingInterval + 200*time.Millisecond | ||
|
|
||
| metrics := NewLocalMetrics(logger, logger, nil, cfg) | ||
| defer metrics.Stop(logger) | ||
|
|
||
| module := &RuntimeGoNakamaModule{metrics: metrics} | ||
| module.MetricsCounterAdd("panic_counter", nil, 1) | ||
|
|
||
| time.Sleep(flushWait) | ||
| module.MetricsCounterAdd("panic_counter", nil, -1) | ||
|
|
||
| time.Sleep(flushWait) |
There was a problem hiding this comment.
Good idea, I tried to replace last sleep with explicit metrics.Stop(logger) and when fix is removed (and therefore prometheus collector panics) something deadlocks and Stop() call never exists. So I'll leave it with 2 sleeps.
when tracking custom metrics collection errors
No description provided.