Skip to content

fix(metrics): negative increment to a counter should not cause panics (NK-1729)#2492

Merged
redbaron merged 4 commits intomasterfrom
metrics-negative-counter
Apr 14, 2026
Merged

fix(metrics): negative increment to a counter should not cause panics (NK-1729)#2492
redbaron merged 4 commits intomasterfrom
metrics-negative-counter

Conversation

@redbaron
Copy link
Copy Markdown
Contributor

No description provided.

@redbaron redbaron requested review from Copilot, sesposito and zyro April 14, 2026 14:48
@linear
Copy link
Copy Markdown

linear Bot commented Apr 14, 2026

@redbaron redbaron force-pushed the metrics-negative-counter branch from a518ca4 to 495cc26 Compare April 14, 2026 14:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents runtime custom counter updates with negative deltas from triggering panics during metrics collection/reporting.

Changes:

  • Guard LocalMetrics.CustomCounter against negative deltas by ignoring them.
  • Add a regression test to ensure a negative counter delta does not crash the reporter during background flushing.
  • Document the behavior change in CHANGELOG.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
server/metrics.go Drops negative deltas for custom counters to avoid Prometheus/tally panics.
server/metrics_test.go Adds regression coverage for negative counter deltas during reporter flush.
CHANGELOG.md Notes the fix in the Unreleased “Fixed” section.
Comments suppressed due to low confidence (1)

server/metrics_test.go:41

  • This test relies on fixed time.Sleep intervals to “hopefully” span two tally flush cycles, but it never asserts that a flush actually happened after the negative delta. On slow/loaded CI runners the flush may be delayed and the test could pass without exercising the panic path. Consider making this deterministic (e.g., drive/trigger a flush explicitly, or loop with a bounded timeout until a scrape/flush has occurred) and avoid long fixed sleeps to reduce flakiness and runtime.
	time.Sleep(1500 * time.Millisecond)
	module.MetricsCounterAdd("panic_counter", nil, -1)

	time.Sleep(1500 * time.Millisecond)
}


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread server/metrics.go
Comment on lines 524 to +526
// CustomCounter adds the given delta to a counter with the specified name and tags.
func (m *LocalMetrics) CustomCounter(name string, tags map[string]string, delta int64) {
if delta < 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest something similar, do you think we should emit a warn log instead of silently dropping?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had a log, but then took it out, because counter increment can be a hot path and if it keeps happening then logs will be spammed.

If not log, then maybe counter_errors metric? this way there will be visibility into which counters failed to increment.

Copy link
Copy Markdown
Member

@sesposito sesposito Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dedicated metric seems like the best option, perhaps custom_counter_negative_error?

Copy link
Copy Markdown
Contributor Author

@redbaron redbaron Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't use custom_ prefix for internal metric, because they are reserved for custom. Maybe:

metrics_collection_errors_count{name=custom_counter_abc,err=negative_increment} 1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread server/metrics.go
Comment on lines 524 to +526
// CustomCounter adds the given delta to a counter with the specified name and tags.
func (m *LocalMetrics) CustomCounter(name string, tags map[string]string, delta int64) {
if delta < 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest something similar, do you think we should emit a warn log instead of silently dropping?

Comment thread server/metrics_test.go
)

func TestMetricsCounterAddNegativeDoesNotPanics(t *testing.T) {
logger := zap.NewNop()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this test is time dependent and could potentially be flaky, should we t.Skip() it as part of the test suite?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it reliably fails without a fix:

  • tally is configured with 1s flush interval
  • waits are 1.5 seconds to guarantee that flush has occurred

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit weary of adding tests that depend on sleep as it'll increase the test execution time and it can pile up over time, although I'm aware sometimes it's unavoidable. If we want to keep it part of the suite, please infer the required sleep duration from the config value instead of hardcoding it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed change deriving sleeps from ReportingFreqSec.

it'll increase the test execution time and it can pile up over time

there is no tally.Flush() I could use even if I created private constructor for metrics collector. It also has to be above 1 second, because config option is with a second resolution.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Prometheus/tally panic scenario caused by applying a negative delta to a custom runtime counter, and adds coverage + release notes for the fix.

Changes:

  • Guard LocalMetrics.CustomCounter against negative deltas by dropping the update and incrementing an internal error counter.
  • Add a regression test ensuring a negative MetricsCounterAdd does not crash metrics collection.
  • Document the behavior change in CHANGELOG.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
server/metrics.go Prevents negative custom counter deltas from reaching Prometheus counters; emits an internal error metric instead.
server/metrics_test.go Adds a regression test that exercises the negative-delta path and ensures no panic occurs during reporting.
CHANGELOG.md Notes the fix under Unreleased/Fixed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread server/metrics.go
@@ -523,6 +523,11 @@ func (m *LocalMetrics) StorageWriteRejectCount(tags map[string]string, delta int

// CustomCounter adds the given delta to a counter with the specified name and tags.
Comment thread server/metrics.go Outdated
// CustomCounter adds the given delta to a counter with the specified name and tags.
func (m *LocalMetrics) CustomCounter(name string, tags map[string]string, delta int64) {
if delta < 0 {
m.PrometheusScope.Tagged(map[string]string{"name": name, "err": "negative_increment"}).Counter("metrics_collection_errors_count").Inc(1)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlikely, but valid. Pushed change where metrics_collection_errors_count is created in the limited scope to avoid high cardinality in pathological cases

Comment thread server/metrics_test.go
Comment on lines +27 to +40
cfg.Metrics.ReportingFreqSec = 1
reportingInterval := time.Duration(cfg.Metrics.ReportingFreqSec) * time.Second
flushWait := reportingInterval + 200*time.Millisecond

metrics := NewLocalMetrics(logger, logger, nil, cfg)
defer metrics.Stop(logger)

module := &RuntimeGoNakamaModule{metrics: metrics}
module.MetricsCounterAdd("panic_counter", nil, 1)

time.Sleep(flushWait)
module.MetricsCounterAdd("panic_counter", nil, -1)

time.Sleep(flushWait)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I tried to replace last sleep with explicit metrics.Stop(logger) and when fix is removed (and therefore prometheus collector panics) something deadlocks and Stop() call never exists. So I'll leave it with 2 sleeps.

@redbaron redbaron merged commit 26421e8 into master Apr 14, 2026
7 checks passed
@redbaron redbaron deleted the metrics-negative-counter branch April 15, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants