fix(cluster): per-entry locking for shared server health state#90
Conversation
84f08d0 to
39e1d41
Compare
|
The mutex will make the parallel connection stall, which will break performance. |
Yes I agree, but it's a much larger change. If this becomes an issue with this mutex fix, it means the race condition is real. I can create a follow-up issue for the larger fix |
Indeed. Yes, please for the follow-up issue and the larger fix. |
There was a problem hiding this comment.
Pull request overview
This PR updates the cluster dialer in DialClusterContext to correctly persist per-server failure state (lastError) across dial attempts and to prevent concurrent cloned clients from racing on shared dialer state.
Changes:
- Switch cluster server iteration to index-based iteration (
for i := range servers) solastErrorupdates persist on the slice elements. - Add a mutex guarding the
serversslice captured by the dialer closure to avoid concurrent access across cloned clients.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1471a50 to
97eb4ef
Compare
|
@ldesauw I fixed the mutex contention in a different (and simpler) maner than discussed earlier |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The cluster dialer closure is captured by the Client and shared across
cloned clients via CloneCtx, so the servers slice it references is
shared state. Previously `servers` was a slice of values, which had
two bugs:
1. `s.lastError = time.Now()` in the range loop wrote to a loop-
variable copy instead of the slice element, so the retry-timeout
skip logic was ineffective — failed servers were retried on
every dial attempt.
2. Without synchronization, concurrent dials from different clones
race on lastError reads and writes.
Switch `servers` to []*connectionEntry and give each entry its own
sync.Mutex scoped to lastError. Locks are released around DialContext
so concurrent reconnections across clones are not serialized.
Also:
- Guard against empty addrs (previously panicked on servers[0]).
- Lowercase the pool-failure error string.
- Rename slog key "last error" -> "last_error".
- Add -race test exercising concurrent Clone() against a cluster
with one server down, to catch regressions on the shared state.
Signed-off-by: Pierre-Henri Symoneaux <pierre-henri.symoneaux@ovhcloud.com>
97eb4ef to
e30dc33
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Thanks, I checked and it looks good. |
Summary
DialClusterContextwhere iteratingserversby value madelastErrorwrites ineffective — failed servers were retried on every dial regardless of the configuredretryTimeout.Clientand shared across clones viaCloneCtx, so concurrent dials raced onlastErrorreads and writes.serversto[]*connectionEntryand give each entry its ownsync.Mutexscoped tolastError. Locks are released aroundDialContextso concurrent reconnections are not serialized.addrs(previously panicked onservers[0])."last error"→"last_error".Test plan
go build ./...passesgo test -race ./kmipclient/...passesTestClientConnectionPool_ConcurrentDial— 32 concurrentClone()calls against a 3-server cluster with the first server down; race detector catches regressions onlastError.TestDialCluster_EmptyAddrs— verifiesniland[]string{}return an error instead of panicking.retryTimeoutduration.