reverseproxy: fix active health check counter tracking, add initially_unhealthy option by petersalas · Pull Request #7625 · caddyserver/caddy

petersalas · 2026-04-03T23:45:50Z

This changes a few behaviors in active health checks:

countHealthPass/countHealthFail now reset the opposite counter, ensuring only consecutive results accumulate toward thresholds. Previously, interleaved pass/fail results could incorrectly trip the threshold.
Initial health state on provision uses pass/fail counters (which survive reloads via the global hosts pool), so the health statuses of a backend pool are preserved across reload. Currently after a reload all Upstreams are reinitialized as healthy.
Context cancellation (e.g. during shutdown/reload) is no longer counted as a health check failure.
Change health stats tracking to track consecutive passes/fails per-path rather than mixing paths on the same Host

Also add the health_initially_unhealthy option, which marks backends as unhealthy until they pass active health checks -- see #6410, option 4. Despite the discussion in that issue, the current behavior is that backends are assumed to be healthy. This can be demonstrated with a server that hangs for 10 seconds before failing: until the first health check fails, caddy will forward requests along.

All together, these changes allow possibly-unhealthy backends to be dynamically/safely added to a reverse proxy pool without sending traffic to new or previously-unhealthy backends before their next health check.

…_unhealthy option Fix three bugs in active health checks: - countHealthPass/countHealthFail now reset the opposite counter, ensuring only truly consecutive results accumulate toward thresholds. Previously, interleaved pass/fail results could incorrectly trip the threshold. - Context cancellation (e.g. during shutdown/reload) is no longer counted as a health check failure. - Initial health state on provision uses pass/fail counters (which survive reloads via the global hosts pool), so proven-healthy hosts stay healthy across config reloads. Also add the health_initially_unhealthy option, which marks backends as unhealthy until they pass active health checks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CLAassistant · 2026-04-03T23:45:57Z

All committers have signed the CLA.

francislavoie · 2026-04-03T23:58:53Z

Sounds like this relates to #7517 ?

petersalas · 2026-04-04T03:19:59Z

Sounds like this relates to #7517 ?

From looking at that PR, maybe tangentially?

Somewhat embarrassingly the changes in this PR are about a year old and I’m only now finally taking a moment to try to see if there’s interest in upstreaming them, so it’s possible I’m missing the forest for the trees if there’s a broader rework happening 😅

In the PR description by a “dynamic” addition of a backend, I meant from some other process/module reconfiguring what is internally a static list of upstreams.

francislavoie · 2026-04-04T03:47:49Z

countHealthPass/countHealthFail now reset the opposite counter, ensuring only consecutive results accumulate toward thresholds. Previously, interleaved pass/fail results could incorrectly trip the threshold.

I think the existing behaviour could be desirable, for sampling via error rates. I think maybe this should be a configurable mode instead.

The other changes seem pretty reasonable to me.

steadytao · 2026-04-04T07:09:28Z

One thing that still looks problematic to me is the new initial-state logic for active checks.

Active health state is documented as per-proxy-handler, but health_initially_unhealthy now seeds that state from activePasses / activeFails stored on the shared Host object in the global host pool. That means two reverse_proxy handlers targeting the same upstream can end up inheriting each other’s active-check history across provision/reload even if they have different health_passes / health_fails settings.

So independent of the consecutive-vs-cumulative discussion Francis raised, I think this part still needs adjustment before merge. The other pieces seem reasonable to me.

petersalas · 2026-04-04T14:09:32Z

Active health state is documented as per-proxy-handler, but health_initially_unhealthy now seeds that state from activePasses / activeFails stored on the shared Host object in the global host pool. That means two reverse_proxy handlers targeting the same upstream can end up inheriting each other’s active-check history across provision/reload even if they have different health_passes / health_fails settings.

The intent is that the shared Host only stores observations about the host (consecutive passes/fails), which are independent of possibly different policies on handlers. On provision, the handler’s policy is evaluated against the existing observations but my thought was that that was reasonable since nothing about the policy leaks back to the (possibly shared) Host.

Without the evaluation during provision, I think the observations are still shared, they just wait for the first health check after provision to be evaluated. For example, without this change, if 3 fails are required and there’s a history of failures, the first failure after provision will still mark it unhealthy. My sense is that there shouldn’t be anything special about the first heath check request after a reload?

steadytao · 2026-04-04T14:34:17Z

The intent is that the shared Host only stores observations about the host (consecutive passes/fails), which are independent of possibly different policies on handlers.

I think that is exactly the problem. A concrete example would be two reverse_proxy handlers pointing at the same upstream but using different active-check criteria, for example handler A using health_uri /readyz with health_passes 1 and handler B using health_uri /livez with health_passes 3, both with health_initially_unhealthy. If A accumulates passes on the shared Host then after provision/reload B can inherit that history and start healthy immediately even though B’s own active-check criteria have never passed. So even if the shared Host is only storing “observations”, those observations are still being produced under one handler’s active-check regime and consumed by another handler’s policy which seems hard to square with the docs saying active health state is per-proxy-handler.

My sense is that there shouldn’t be anything special about the first health check request after a reload?

I agree, there should not be. I think that just shows the deeper issue is the cross-handler sharing itself and provision-time evaluation is making that coupling visible earlier rather than preserving a genuinely per-handler model.

petersalas · 2026-04-04T15:25:02Z

using different active-check criteria, for example handler A using health_uri /readyz with health_passes 1 and handler B using health_uri /livez with health_passes 3

Ah, yes with different URIs the shared state is (already?) wrong, and this change certainly relies on the shared state in a new way/more than it did previously.

If the active counters were per-host-URI instead of per-host would you still have concerns? It seems to me like the lack of URI-specificity might be the fundamental underlying issue, but I’m curious if I’m missing something else before trying to consider possible solutions.

petersalas · 2026-04-08T22:05:12Z

I think the existing behaviour could be desirable, for sampling via error rates. I think maybe this should be a configurable mode instead.

@francislavoie IMO the hard thing to reason about with the current behavior is that there's no time limit on the counters, so the propensity to change status can accumulate over an arbitrarily long period of time. A policy like 3 failures over 1 minute => unhealthy, 5 passes over 1 minute => healthy certainly seems like it could be useful, but without the time bound the behavior is hard to predict.

But, if you think it's still useful without the time limit (or you just want to preserve the existing behavior by default) I'm happy to preserve it!

Add per-path health stats tracking

8a4189c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reverseproxy: fix active health check counter tracking, add initially_unhealthy option#7625

reverseproxy: fix active health check counter tracking, add initially_unhealthy option#7625
petersalas wants to merge 2 commits intocaddyserver:masterfrom
fixie-ai:psalas/health-initial-behavior

petersalas commented Apr 3, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 3, 2026 •

edited

Loading

Uh oh!

francislavoie commented Apr 3, 2026

Uh oh!

petersalas commented Apr 4, 2026

Uh oh!

francislavoie commented Apr 4, 2026

Uh oh!

steadytao commented Apr 4, 2026

Uh oh!

petersalas commented Apr 4, 2026

Uh oh!

steadytao commented Apr 4, 2026

Uh oh!

petersalas commented Apr 4, 2026

Uh oh!

petersalas commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

petersalas commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

francislavoie commented Apr 3, 2026

Uh oh!

petersalas commented Apr 4, 2026

Uh oh!

francislavoie commented Apr 4, 2026

Uh oh!

steadytao commented Apr 4, 2026

Uh oh!

petersalas commented Apr 4, 2026

Uh oh!

steadytao commented Apr 4, 2026

Uh oh!

petersalas commented Apr 4, 2026

Uh oh!

petersalas commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

petersalas commented Apr 3, 2026 •

edited

Loading

CLAassistant commented Apr 3, 2026 •

edited

Loading