Skip to content

Fix incident auto-resolution using stale resolved alerts from previous recoveries#6117

Draft
Copilot wants to merge 1 commit intodevelopfrom
copilot/fix-incident-auto-resolution-bug
Draft

Fix incident auto-resolution using stale resolved alerts from previous recoveries#6117
Copilot wants to merge 1 commit intodevelopfrom
copilot/fix-incident-auto-resolution-bug

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 15, 2026

When a server goes down and recovers, check_resolved() picks up resolved alerts from that recovery indefinitely. If the server goes down again, the new incident immediately auto-resolves against the old resolved alerts.

Changes

  • Incident.check_resolved(): Filter resolved alerts to only those created after the current incident (creation > self.creation), and pass the incident creation time into is_enough_firing() so past_alert_instances() also excludes stale alerts
  • AlertmanagerWebhookLog.past_alert_instances(): Accept optional since param to add a creation > lower bound on the query
  • AlertmanagerWebhookLog.is_enough_firing: Changed from @property to method to thread the since parameter through
# Before: finds any resolved alert for this server, including from previous recoveries
last_resolved = frappe.get_last_doc("Alertmanager Webhook Log", {
    "status": "Resolved",
    "group_key": ("like", f"%{self.incident_scope}%"),
    "alert": self.alert,
})
if not last_resolved.is_enough_firing:

# After: only resolved alerts from after this incident was created
last_resolved = frappe.get_last_doc("Alertmanager Webhook Log", {
    "status": "Resolved",
    "group_key": ("like", f"%{self.incident_scope}%"),
    "alert": self.alert,
    "creation": (">", self.creation),
})
if not last_resolved.is_enough_firing(since=self.creation):
  • Added regression test test_subsequent_incident_not_resolved_by_previous_resolved_alerts covering the full cycle: incident → resolve → new incident → verify it stays open → new resolved alert → verify it resolves

…time

When a server goes down and recovers, subsequent incidents for that
server would incorrectly auto-resolve because check_resolved() picked up
old resolved alerts from the previous recovery. The fix:

1. Filter resolved alerts in check_resolved() to only consider alerts
   created after the current incident.
2. Add 'since' parameter to past_alert_instances() and is_enough_firing()
   to exclude old alerts when checking if enough sites are still firing.
3. Add regression test for the specific scenario.

Agent-Logs-Url: https://github.com/frappe/press/sessions/7ffc7add-4767-489b-8824-9affdfb4e903

Co-authored-by: balamurali27 <25403045+balamurali27@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants