fix(metrics): negative increment to a counter should not cause panics (NK-1729) by redbaron · Pull Request #2492 · heroiclabs/nakama

redbaron · 2026-04-14T14:48:47Z

No description provided.

linear · 2026-04-14T14:48:50Z

NK-1729 Panic in Nakama metrics reporting

Copilot

Pull request overview

This PR prevents runtime custom counter updates with negative deltas from triggering panics during metrics collection/reporting.

Changes:

Guard LocalMetrics.CustomCounter against negative deltas by ignoring them.
Add a regression test to ensure a negative counter delta does not crash the reporter during background flushing.
Document the behavior change in CHANGELOG.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
server/metrics.go	Drops negative deltas for custom counters to avoid Prometheus/tally panics.
server/metrics_test.go	Adds regression coverage for negative counter deltas during reporter flush.
CHANGELOG.md	Notes the fix in the Unreleased “Fixed” section.

Comments suppressed due to low confidence (1)

server/metrics_test.go:41

This test relies on fixed time.Sleep intervals to “hopefully” span two tally flush cycles, but it never asserts that a flush actually happened after the negative delta. On slow/loaded CI runners the flush may be delayed and the test could pass without exercising the panic path. Consider making this deterministic (e.g., drive/trigger a flush explicitly, or loop with a bounded timeout until a scrape/flush has occurred) and avoid long fixed sleeps to reduce flakiness and runtime.

	time.Sleep(1500 * time.Millisecond)
	module.MetricsCounterAdd("panic_counter", nil, -1)

	time.Sleep(1500 * time.Millisecond)
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sesposito · 2026-04-14T15:10:57Z

 // CustomCounter adds the given delta to a counter with the specified name and tags.
 func (m *LocalMetrics) CustomCounter(name string, tags map[string]string, delta int64) {
+	if delta < 0 {


I was going to suggest something similar, do you think we should emit a warn log instead of silently dropping?

I initially had a log, but then took it out, because counter increment can be a hot path and if it keeps happening then logs will be spammed.

If not log, then maybe counter_errors metric? this way there will be visibility into which counters failed to increment.

Dedicated metric seems like the best option, perhaps custom_counter_negative_error?

we can't use custom_ prefix for internal metric, because they are reserved for custom. Maybe:

metrics_collection_errors_count{name=custom_counter_abc,err=negative_increment} 1

sesposito · 2026-04-14T15:10:57Z

 // CustomCounter adds the given delta to a counter with the specified name and tags.
 func (m *LocalMetrics) CustomCounter(name string, tags map[string]string, delta int64) {
+	if delta < 0 {


I was going to suggest something similar, do you think we should emit a warn log instead of silently dropping?

sesposito · 2026-04-14T15:12:32Z

+)
+
+func TestMetricsCounterAddNegativeDoesNotPanics(t *testing.T) {
+	logger := zap.NewNop()


Given that this test is time dependent and could potentially be flaky, should we t.Skip() it as part of the test suite?

it reliably fails without a fix:

tally is configured with 1s flush interval

waits are 1.5 seconds to guarantee that flush has occurred

I'm a bit weary of adding tests that depend on sleep as it'll increase the test execution time and it can pile up over time, although I'm aware sometimes it's unavoidable. If we want to keep it part of the suite, please infer the required sleep duration from the config value instead of hardcoding it.

pushed change deriving sleeps from ReportingFreqSec.

it'll increase the test execution time and it can pile up over time

there is no tally.Flush() I could use even if I created private constructor for metrics collector. It also has to be above 1 second, because config option is with a second resolution.

metric metric name is `metrics_collection_errors_count`

Copilot

Pull request overview

This PR addresses a Prometheus/tally panic scenario caused by applying a negative delta to a custom runtime counter, and adds coverage + release notes for the fix.

Changes:

Guard LocalMetrics.CustomCounter against negative deltas by dropping the update and incrementing an internal error counter.
Add a regression test ensuring a negative MetricsCounterAdd does not crash metrics collection.
Document the behavior change in CHANGELOG.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`server/metrics.go`	Prevents negative custom counter deltas from reaching Prometheus counters; emits an internal error metric instead.
`server/metrics_test.go`	Adds a regression test that exercises the negative-delta path and ensures no panic occurs during reporting.
`CHANGELOG.md`	Notes the fix under Unreleased/Fixed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -523,6 +523,11 @@ func (m *LocalMetrics) StorageWriteRejectCount(tags map[string]string, delta int

 // CustomCounter adds the given delta to a counter with the specified name and tags.


redbaron · 2026-04-14T17:15:53Z

 // CustomCounter adds the given delta to a counter with the specified name and tags.
 func (m *LocalMetrics) CustomCounter(name string, tags map[string]string, delta int64) {
+	if delta < 0 {
+		m.PrometheusScope.Tagged(map[string]string{"name": name, "err": "negative_increment"}).Counter("metrics_collection_errors_count").Inc(1)


unlikely, but valid. Pushed change where metrics_collection_errors_count is created in the limited scope to avoid high cardinality in pathological cases

redbaron · 2026-04-14T17:24:45Z

+	cfg.Metrics.ReportingFreqSec = 1
+	reportingInterval := time.Duration(cfg.Metrics.ReportingFreqSec) * time.Second
+	flushWait := reportingInterval + 200*time.Millisecond
+
+	metrics := NewLocalMetrics(logger, logger, nil, cfg)
+	defer metrics.Stop(logger)
+
+	module := &RuntimeGoNakamaModule{metrics: metrics}
+	module.MetricsCounterAdd("panic_counter", nil, 1)
+
+	time.Sleep(flushWait)
+	module.MetricsCounterAdd("panic_counter", nil, -1)
+
+	time.Sleep(flushWait)


Good idea, I tried to replace last sleep with explicit metrics.Stop(logger) and when fix is removed (and therefore prometheus collector panics) something deadlocks and Stop() call never exists. So I'll leave it with 2 sleeps.

when tracking custom metrics collection errors

redbaron requested review from Copilot, sesposito and zyro April 14, 2026 14:48

Copilot started reviewing on behalf of redbaron April 14, 2026 14:49 View session

fix(metrics): negative increment to a counter should not cause panics

495cc26

redbaron force-pushed the metrics-negative-counter branch from a518ca4 to 495cc26 Compare April 14, 2026 14:49

Copilot AI reviewed Apr 14, 2026

View reviewed changes

andresfelipemendez approved these changes Apr 14, 2026

View reviewed changes

sesposito requested changes Apr 14, 2026

View reviewed changes

redbaron added 2 commits April 14, 2026 17:05

feat(metrics): derive wait time from metrics flush interval during tests

d0f6720

feat(metrics): track custom counter increments errors in the dedicated

5e5511d

metric metric name is `metrics_collection_errors_count`

redbaron requested review from Copilot and sesposito April 14, 2026 16:39

Copilot started reviewing on behalf of redbaron April 14, 2026 16:40 View session

sesposito approved these changes Apr 14, 2026

View reviewed changes

Copilot AI reviewed Apr 14, 2026

View reviewed changes

feat(metrics): apply same cardinality restrictions as for custom metrics

28f2d06

when tracking custom metrics collection errors

redbaron merged commit 26421e8 into master Apr 14, 2026
7 checks passed

redbaron deleted the metrics-negative-counter branch April 15, 2026 13:53

		@@ -523,6 +523,11 @@ func (m *LocalMetrics) StorageWriteRejectCount(tags map[string]string, delta int

		// CustomCounter adds the given delta to a counter with the specified name and tags.

Conversation

redbaron commented Apr 14, 2026

Uh oh!

linear Bot commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sesposito Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

redbaron Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sesposito Apr 14, 2026 •

edited

Loading

redbaron Apr 14, 2026 •

edited

Loading