fix(cluster): per-entry locking for shared server health state by phsym · Pull Request #90 · ovh/kmip-go

phsym · 2026-04-16T16:33:01Z

Summary

Fix a value-copy bug in DialClusterContext where iterating servers by value made lastError writes ineffective — failed servers were retried on every dial regardless of the configured retryTimeout.
Fix a data race: the dialer closure is captured by the Client and shared across clones via CloneCtx, so concurrent dials raced on lastError reads and writes.
Switch servers to []*connectionEntry and give each entry its own sync.Mutex scoped to lastError. Locks are released around DialContext so concurrent reconnections are not serialized.
Guard against empty addrs (previously panicked on servers[0]).
Minor polish: lowercase the pool-failure error string, rename slog key "last error" → "last_error".

Test plan

go build ./... passes
go test -race ./kmipclient/... passes
New TestClientConnectionPool_ConcurrentDial — 32 concurrent Clone() calls against a 3-server cluster with the first server down; race detector catches regressions on lastError.
New TestDialCluster_EmptyAddrs — verifies nil and []string{} return an error instead of panicking.
Manual verification: with multiple cluster nodes, a failed node should be skipped for the configured retryTimeout duration.

ldesauw · 2026-04-17T08:19:52Z

The mutex will make the parallel connection stall, which will break performance.
As the server is only use in read, a clone would be better.

phsym · 2026-04-17T08:38:50Z

The mutex will make the parallel connection stall, which will break performance. As the server is only use in read, a clone would be better.

Yes I agree, but it's a much larger change. If this becomes an issue with this mutex fix, it means the race condition is real. I can create a follow-up issue for the larger fix

ldesauw · 2026-04-17T08:42:54Z

Yes I agree, but it's a much larger change. If this becomes an issue with this mutex fix, it means the race condition is real. I can create a follow-up issue for the larger fix

Indeed. Yes, please for the follow-up issue and the larger fix.

Copilot

Pull request overview

This PR updates the cluster dialer in DialClusterContext to correctly persist per-server failure state (lastError) across dial attempts and to prevent concurrent cloned clients from racing on shared dialer state.

Changes:

Switch cluster server iteration to index-based iteration (for i := range servers) so lastError updates persist on the slice elements.
Add a mutex guarding the servers slice captured by the dialer closure to avoid concurrent access across cloned clients.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

phsym · 2026-04-22T09:41:06Z

@ldesauw I fixed the mutex contention in a different (and simpler) maner than discussed earlier

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The cluster dialer closure is captured by the Client and shared across cloned clients via CloneCtx, so the servers slice it references is shared state. Previously `servers` was a slice of values, which had two bugs: 1. `s.lastError = time.Now()` in the range loop wrote to a loop- variable copy instead of the slice element, so the retry-timeout skip logic was ineffective — failed servers were retried on every dial attempt. 2. Without synchronization, concurrent dials from different clones race on lastError reads and writes. Switch `servers` to []*connectionEntry and give each entry its own sync.Mutex scoped to lastError. Locks are released around DialContext so concurrent reconnections across clones are not serialized. Also: - Guard against empty addrs (previously panicked on servers[0]). - Lowercase the pool-failure error string. - Rename slog key "last error" -> "last_error". - Add -race test exercising concurrent Clone() against a cluster with one server down, to catch regressions on the shared state. Signed-off-by: Pierre-Henri Symoneaux <pierre-henri.symoneaux@ovhcloud.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ldesauw · 2026-04-22T09:52:41Z

@ldesauw I fixed the mutex contention in a different (and simpler) maner than discussed earlier

Thanks, I checked and it looks good.

phsym requested a review from a team as a code owner April 16, 2026 16:33

phsym force-pushed the fix/cluster-dialer-value-copy branch from 84f08d0 to 39e1d41 Compare April 16, 2026 16:33

alexGNX approved these changes Apr 17, 2026

View reviewed changes

ldesauw reviewed Apr 17, 2026

View reviewed changes

Comment thread kmipclient/dialer_cluster.go Outdated

alexGNX self-requested a review April 17, 2026 08:45

phsym mentioned this pull request Apr 17, 2026

cluster: replace dialer mutex with servers slice clone to avoid contention #94

Closed

phsym requested a review from Copilot April 17, 2026 20:42

Copilot started reviewing on behalf of phsym April 17, 2026 20:43 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread kmipclient/dialer_cluster.go Outdated

Comment thread kmipclient/dialer_cluster.go Outdated

Comment thread kmipclient/dialer_cluster.go Outdated

phsym force-pushed the fix/cluster-dialer-value-copy branch 2 times, most recently from 1471a50 to 97eb4ef Compare April 22, 2026 09:37

phsym requested a review from Copilot April 22, 2026 09:38

Copilot started reviewing on behalf of phsym April 22, 2026 09:39 View session

phsym changed the title ~~fix(cluster): persist lastError updates in dialer range loop~~ fix(cluster): per-entry locking for shared server health state Apr 22, 2026

phsym requested a review from ldesauw April 22, 2026 09:40

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread kmipclient/dialer_cluster.go Outdated

Comment thread kmipclient/dialer_cluster_test.go

Comment thread kmipclient/dialer_cluster.go Outdated

Comment thread kmipclient/dialer_cluster.go Outdated

phsym force-pushed the fix/cluster-dialer-value-copy branch from 97eb4ef to e30dc33 Compare April 22, 2026 09:45

phsym requested a review from Copilot April 22, 2026 09:46

Copilot started reviewing on behalf of phsym April 22, 2026 09:47 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

ldesauw approved these changes Apr 22, 2026

View reviewed changes

phsym merged commit 89504f8 into main Apr 22, 2026
10 checks passed

phsym deleted the fix/cluster-dialer-value-copy branch April 22, 2026 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cluster): per-entry locking for shared server health state#90

fix(cluster): per-entry locking for shared server health state#90
phsym merged 1 commit intomainfrom
fix/cluster-dialer-value-copy

phsym commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

ldesauw commented Apr 17, 2026

Uh oh!

phsym commented Apr 17, 2026

Uh oh!

ldesauw commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phsym commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

ldesauw commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

phsym commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

ldesauw commented Apr 17, 2026

Uh oh!

phsym commented Apr 17, 2026

Uh oh!

ldesauw commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phsym commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

ldesauw commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

phsym commented Apr 16, 2026 •

edited

Loading