Skip to content

feat: add health check with restart-on-failure for self-healing#434

Open
LaurenceJJones wants to merge 3 commits intocrowdsecurity:mainfrom
LaurenceJJones:feature/healthcheck-self-heal
Open

feat: add health check with restart-on-failure for self-healing#434
LaurenceJJones wants to merge 3 commits intocrowdsecurity:mainfrom
LaurenceJJones:feature/healthcheck-self-heal

Conversation

@LaurenceJJones
Copy link
Copy Markdown
Contributor

@LaurenceJJones LaurenceJJones commented Jan 19, 2026

WIP

Add periodic health checking to detect missing firewall infrastructure (chains, ipsets) and trigger a process restart when detected.

Implementation:

  • CheckHealth() on Backend interface verifies chains/ipsets exist (not content)
  • Health checker goroutine runs periodically (default 30s)
  • On failure detection, returns ErrUnrecoverable to exit the process

Why restart instead of in-memory self-heal:
The StreamBouncer from go-cs-bouncer has an internal 'startup' flag that is set to true only on first run, causing LAPI to send all decisions. After startup, it only sends deltas. This flag is not exposed or resettable.

Storing decisions in memory to replay after reinit was considered, but restarting the process is simpler and leverages the existing StreamBouncer behavior - on restart, startup=true triggers a full decision sync from LAPI. Systemd handles restart limiting.

We don't offer a container deployment but users have their own, as long as they set the restart options it will also handle this.

LaurenceJJones and others added 3 commits January 19, 2026 08:59
Add periodic health checking to detect missing firewall infrastructure
(chains, ipsets) and trigger a process restart when detected.

Implementation:
- CheckHealth() on Backend interface verifies chains/ipsets exist
- Health checker goroutine runs periodically (default 30s)
- On failure detection, returns ErrUnrecoverable to exit the process
- Prometheus metrics track health status and failure counts
- HTTP /health endpoint exposes health status as JSON

Why restart instead of in-memory self-heal:
The StreamBouncer from go-cs-bouncer has an internal 'startup' flag
that is set to true only on first run, causing LAPI to send all
decisions. After startup, it only sends deltas. This flag is not
exposed or resettable.

Storing decisions in memory to replay after reinit was considered,
but restarting the process is simpler and leverages the existing
StreamBouncer behavior - on restart, startup=true triggers a full
decision sync from LAPI. Systemd handles restart limiting.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Since the process restarts on health check failure, metrics and the
/health endpoint provide no value - they reset/disappear on restart.

Keep only the core health check logic that detects missing firewall
infrastructure and triggers the restart.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused receivers in CheckHealth stubs (nftables, pf)
- Bump cyclomatic complexity limit 29->30 for health check addition
- Bump function-length limit 153->160 for health check addition
- Simplify deprecated daemon option check

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant