Skip to content

reverseproxy: fix active health check counter tracking, add initially_unhealthy option#7625

Open
petersalas wants to merge 2 commits intocaddyserver:masterfrom
fixie-ai:psalas/health-initial-behavior
Open

reverseproxy: fix active health check counter tracking, add initially_unhealthy option#7625
petersalas wants to merge 2 commits intocaddyserver:masterfrom
fixie-ai:psalas/health-initial-behavior

Conversation

@petersalas
Copy link
Copy Markdown

@petersalas petersalas commented Apr 3, 2026

This changes a few behaviors in active health checks:

  • countHealthPass/countHealthFail now reset the opposite counter, ensuring only consecutive results accumulate toward thresholds. Previously, interleaved pass/fail results could incorrectly trip the threshold.
  • Initial health state on provision uses pass/fail counters (which survive reloads via the global hosts pool), so the health statuses of a backend pool are preserved across reload. Currently after a reload all Upstreams are reinitialized as healthy.
  • Context cancellation (e.g. during shutdown/reload) is no longer counted as a health check failure.
  • Change health stats tracking to track consecutive passes/fails per-path rather than mixing paths on the same Host

Also add the health_initially_unhealthy option, which marks backends as unhealthy until they pass active health checks -- see #6410, option 4. Despite the discussion in that issue, the current behavior is that backends are assumed to be healthy. This can be demonstrated with a server that hangs for 10 seconds before failing: until the first health check fails, caddy will forward requests along.

All together, these changes allow possibly-unhealthy backends to be dynamically/safely added to a reverse proxy pool without sending traffic to new or previously-unhealthy backends before their next health check.

…_unhealthy option

Fix three bugs in active health checks:
- countHealthPass/countHealthFail now reset the opposite counter, ensuring
  only truly consecutive results accumulate toward thresholds. Previously,
  interleaved pass/fail results could incorrectly trip the threshold.
- Context cancellation (e.g. during shutdown/reload) is no longer counted
  as a health check failure.
- Initial health state on provision uses pass/fail counters (which survive
  reloads via the global hosts pool), so proven-healthy hosts stay healthy
  across config reloads.

Also add the health_initially_unhealthy option, which marks backends as
unhealthy until they pass active health checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 3, 2026

CLA assistant check
All committers have signed the CLA.

@francislavoie
Copy link
Copy Markdown
Member

Sounds like this relates to #7517 ?

@petersalas
Copy link
Copy Markdown
Author

Sounds like this relates to #7517 ?

From looking at that PR, maybe tangentially?

Somewhat embarrassingly the changes in this PR are about a year old and I’m only now finally taking a moment to try to see if there’s interest in upstreaming them, so it’s possible I’m missing the forest for the trees if there’s a broader rework happening 😅

In the PR description by a “dynamic” addition of a backend, I meant from some other process/module reconfiguring what is internally a static list of upstreams.

@francislavoie
Copy link
Copy Markdown
Member

countHealthPass/countHealthFail now reset the opposite counter, ensuring only consecutive results accumulate toward thresholds. Previously, interleaved pass/fail results could incorrectly trip the threshold.

I think the existing behaviour could be desirable, for sampling via error rates. I think maybe this should be a configurable mode instead.

The other changes seem pretty reasonable to me.

@steadytao
Copy link
Copy Markdown
Member

One thing that still looks problematic to me is the new initial-state logic for active checks.

Active health state is documented as per-proxy-handler, but health_initially_unhealthy now seeds that state from activePasses / activeFails stored on the shared Host object in the global host pool. That means two reverse_proxy handlers targeting the same upstream can end up inheriting each other’s active-check history across provision/reload even if they have different health_passes / health_fails settings.

So independent of the consecutive-vs-cumulative discussion Francis raised, I think this part still needs adjustment before merge. The other pieces seem reasonable to me.

@petersalas
Copy link
Copy Markdown
Author

Active health state is documented as per-proxy-handler, but health_initially_unhealthy now seeds that state from activePasses / activeFails stored on the shared Host object in the global host pool. That means two reverse_proxy handlers targeting the same upstream can end up inheriting each other’s active-check history across provision/reload even if they have different health_passes / health_fails settings.

The intent is that the shared Host only stores observations about the host (consecutive passes/fails), which are independent of possibly different policies on handlers. On provision, the handler’s policy is evaluated against the existing observations but my thought was that that was reasonable since nothing about the policy leaks back to the (possibly shared) Host.

Without the evaluation during provision, I think the observations are still shared, they just wait for the first health check after provision to be evaluated. For example, without this change, if 3 fails are required and there’s a history of failures, the first failure after provision will still mark it unhealthy. My sense is that there shouldn’t be anything special about the first heath check request after a reload?

@steadytao
Copy link
Copy Markdown
Member

The intent is that the shared Host only stores observations about the host (consecutive passes/fails), which are independent of possibly different policies on handlers.

I think that is exactly the problem. A concrete example would be two reverse_proxy handlers pointing at the same upstream but using different active-check criteria, for example handler A using health_uri /readyz with health_passes 1 and handler B using health_uri /livez with health_passes 3, both with health_initially_unhealthy. If A accumulates passes on the shared Host then after provision/reload B can inherit that history and start healthy immediately even though B’s own active-check criteria have never passed. So even if the shared Host is only storing “observations”, those observations are still being produced under one handler’s active-check regime and consumed by another handler’s policy which seems hard to square with the docs saying active health state is per-proxy-handler.

My sense is that there shouldn’t be anything special about the first health check request after a reload?

I agree, there should not be. I think that just shows the deeper issue is the cross-handler sharing itself and provision-time evaluation is making that coupling visible earlier rather than preserving a genuinely per-handler model.

@petersalas
Copy link
Copy Markdown
Author

using different active-check criteria, for example handler A using health_uri /readyz with health_passes 1 and handler B using health_uri /livez with health_passes 3

Ah, yes with different URIs the shared state is (already?) wrong, and this change certainly relies on the shared state in a new way/more than it did previously.

If the active counters were per-host-URI instead of per-host would you still have concerns? It seems to me like the lack of URI-specificity might be the fundamental underlying issue, but I’m curious if I’m missing something else before trying to consider possible solutions.

@petersalas
Copy link
Copy Markdown
Author

I think the existing behaviour could be desirable, for sampling via error rates. I think maybe this should be a configurable mode instead.

@francislavoie IMO the hard thing to reason about with the current behavior is that there's no time limit on the counters, so the propensity to change status can accumulate over an arbitrarily long period of time. A policy like 3 failures over 1 minute => unhealthy, 5 passes over 1 minute => healthy certainly seems like it could be useful, but without the time bound the behavior is hard to predict.

But, if you think it's still useful without the time limit (or you just want to preserve the existing behavior by default) I'm happy to preserve it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants