Skip to content

fix(cluster): per-entry locking for shared server health state#90

Merged
phsym merged 1 commit intomainfrom
fix/cluster-dialer-value-copy
Apr 22, 2026
Merged

fix(cluster): per-entry locking for shared server health state#90
phsym merged 1 commit intomainfrom
fix/cluster-dialer-value-copy

Conversation

@phsym
Copy link
Copy Markdown
Collaborator

@phsym phsym commented Apr 16, 2026

Summary

  • Fix a value-copy bug in DialClusterContext where iterating servers by value made lastError writes ineffective — failed servers were retried on every dial regardless of the configured retryTimeout.
  • Fix a data race: the dialer closure is captured by the Client and shared across clones via CloneCtx, so concurrent dials raced on lastError reads and writes.
  • Switch servers to []*connectionEntry and give each entry its own sync.Mutex scoped to lastError. Locks are released around DialContext so concurrent reconnections are not serialized.
  • Guard against empty addrs (previously panicked on servers[0]).
  • Minor polish: lowercase the pool-failure error string, rename slog key "last error""last_error".

Test plan

  • go build ./... passes
  • go test -race ./kmipclient/... passes
  • New TestClientConnectionPool_ConcurrentDial — 32 concurrent Clone() calls against a 3-server cluster with the first server down; race detector catches regressions on lastError.
  • New TestDialCluster_EmptyAddrs — verifies nil and []string{} return an error instead of panicking.
  • Manual verification: with multiple cluster nodes, a failed node should be skipped for the configured retryTimeout duration.

@phsym phsym requested a review from a team as a code owner April 16, 2026 16:33
@phsym phsym force-pushed the fix/cluster-dialer-value-copy branch from 84f08d0 to 39e1d41 Compare April 16, 2026 16:33
Comment thread kmipclient/dialer_cluster.go Outdated
@ldesauw
Copy link
Copy Markdown
Contributor

ldesauw commented Apr 17, 2026

The mutex will make the parallel connection stall, which will break performance.
As the server is only use in read, a clone would be better.

@phsym
Copy link
Copy Markdown
Collaborator Author

phsym commented Apr 17, 2026

The mutex will make the parallel connection stall, which will break performance. As the server is only use in read, a clone would be better.

Yes I agree, but it's a much larger change. If this becomes an issue with this mutex fix, it means the race condition is real. I can create a follow-up issue for the larger fix

@ldesauw
Copy link
Copy Markdown
Contributor

ldesauw commented Apr 17, 2026

Yes I agree, but it's a much larger change. If this becomes an issue with this mutex fix, it means the race condition is real. I can create a follow-up issue for the larger fix

Indeed. Yes, please for the follow-up issue and the larger fix.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the cluster dialer in DialClusterContext to correctly persist per-server failure state (lastError) across dial attempts and to prevent concurrent cloned clients from racing on shared dialer state.

Changes:

  • Switch cluster server iteration to index-based iteration (for i := range servers) so lastError updates persist on the slice elements.
  • Add a mutex guarding the servers slice captured by the dialer closure to avoid concurrent access across cloned clients.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kmipclient/dialer_cluster.go Outdated
Comment thread kmipclient/dialer_cluster.go Outdated
Comment thread kmipclient/dialer_cluster.go Outdated
@phsym phsym force-pushed the fix/cluster-dialer-value-copy branch 2 times, most recently from 1471a50 to 97eb4ef Compare April 22, 2026 09:37
@phsym phsym requested a review from Copilot April 22, 2026 09:38
@phsym phsym changed the title fix(cluster): persist lastError updates in dialer range loop fix(cluster): per-entry locking for shared server health state Apr 22, 2026
@phsym phsym requested a review from ldesauw April 22, 2026 09:40
@phsym
Copy link
Copy Markdown
Collaborator Author

phsym commented Apr 22, 2026

@ldesauw I fixed the mutex contention in a different (and simpler) maner than discussed earlier

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kmipclient/dialer_cluster.go Outdated
Comment thread kmipclient/dialer_cluster_test.go
Comment thread kmipclient/dialer_cluster.go Outdated
Comment thread kmipclient/dialer_cluster.go Outdated
The cluster dialer closure is captured by the Client and shared across
cloned clients via CloneCtx, so the servers slice it references is
shared state. Previously `servers` was a slice of values, which had
two bugs:

  1. `s.lastError = time.Now()` in the range loop wrote to a loop-
     variable copy instead of the slice element, so the retry-timeout
     skip logic was ineffective — failed servers were retried on
     every dial attempt.
  2. Without synchronization, concurrent dials from different clones
     race on lastError reads and writes.

Switch `servers` to []*connectionEntry and give each entry its own
sync.Mutex scoped to lastError. Locks are released around DialContext
so concurrent reconnections across clones are not serialized.

Also:
  - Guard against empty addrs (previously panicked on servers[0]).
  - Lowercase the pool-failure error string.
  - Rename slog key "last error" -> "last_error".
  - Add -race test exercising concurrent Clone() against a cluster
    with one server down, to catch regressions on the shared state.
Signed-off-by: Pierre-Henri Symoneaux <pierre-henri.symoneaux@ovhcloud.com>
@phsym phsym force-pushed the fix/cluster-dialer-value-copy branch from 97eb4ef to e30dc33 Compare April 22, 2026 09:45
@phsym phsym requested a review from Copilot April 22, 2026 09:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@phsym phsym merged commit 89504f8 into main Apr 22, 2026
10 checks passed
@phsym phsym deleted the fix/cluster-dialer-value-copy branch April 22, 2026 09:52
@ldesauw
Copy link
Copy Markdown
Contributor

ldesauw commented Apr 22, 2026

@ldesauw I fixed the mutex contention in a different (and simpler) maner than discussed earlier

Thanks, I checked and it looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants