reverseproxy: fix active health check counter tracking, add initially_unhealthy option#7625
reverseproxy: fix active health check counter tracking, add initially_unhealthy option#7625petersalas wants to merge 2 commits intocaddyserver:masterfrom
Conversation
…_unhealthy option Fix three bugs in active health checks: - countHealthPass/countHealthFail now reset the opposite counter, ensuring only truly consecutive results accumulate toward thresholds. Previously, interleaved pass/fail results could incorrectly trip the threshold. - Context cancellation (e.g. during shutdown/reload) is no longer counted as a health check failure. - Initial health state on provision uses pass/fail counters (which survive reloads via the global hosts pool), so proven-healthy hosts stay healthy across config reloads. Also add the health_initially_unhealthy option, which marks backends as unhealthy until they pass active health checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Sounds like this relates to #7517 ? |
From looking at that PR, maybe tangentially? Somewhat embarrassingly the changes in this PR are about a year old and I’m only now finally taking a moment to try to see if there’s interest in upstreaming them, so it’s possible I’m missing the forest for the trees if there’s a broader rework happening 😅 In the PR description by a “dynamic” addition of a backend, I meant from some other process/module reconfiguring what is internally a static list of upstreams. |
I think the existing behaviour could be desirable, for sampling via error rates. I think maybe this should be a configurable mode instead. The other changes seem pretty reasonable to me. |
|
One thing that still looks problematic to me is the new initial-state logic for active checks.
So independent of the consecutive-vs-cumulative discussion Francis raised, I think this part still needs adjustment before merge. The other pieces seem reasonable to me. |
The intent is that the shared Without the evaluation during provision, I think the observations are still shared, they just wait for the first health check after provision to be evaluated. For example, without this change, if 3 fails are required and there’s a history of failures, the first failure after provision will still mark it unhealthy. My sense is that there shouldn’t be anything special about the first heath check request after a reload? |
I think that is exactly the problem. A concrete example would be two
I agree, there should not be. I think that just shows the deeper issue is the cross-handler sharing itself and provision-time evaluation is making that coupling visible earlier rather than preserving a genuinely per-handler model. |
Ah, yes with different URIs the shared state is (already?) wrong, and this change certainly relies on the shared state in a new way/more than it did previously. If the active counters were per-host-URI instead of per-host would you still have concerns? It seems to me like the lack of URI-specificity might be the fundamental underlying issue, but I’m curious if I’m missing something else before trying to consider possible solutions. |
@francislavoie IMO the hard thing to reason about with the current behavior is that there's no time limit on the counters, so the propensity to change status can accumulate over an arbitrarily long period of time. A policy like But, if you think it's still useful without the time limit (or you just want to preserve the existing behavior by default) I'm happy to preserve it! |
This changes a few behaviors in active health checks:
countHealthPass/countHealthFailnow reset the opposite counter, ensuring only consecutive results accumulate toward thresholds. Previously, interleaved pass/fail results could incorrectly trip the threshold.Upstreams are reinitialized as healthy.HostAlso add the
health_initially_unhealthyoption, which marks backends as unhealthy until they pass active health checks -- see #6410, option 4. Despite the discussion in that issue, the current behavior is that backends are assumed to be healthy. This can be demonstrated with a server that hangs for 10 seconds before failing: until the first health check fails, caddy will forward requests along.All together, these changes allow possibly-unhealthy backends to be dynamically/safely added to a reverse proxy pool without sending traffic to new or previously-unhealthy backends before their next health check.