Fix incident auto-resolution using stale resolved alerts from previous recoveries by Copilot · Pull Request #6117 · frappe/press

Copilot · 2026-04-15T08:07:41Z

When a server goes down and recovers, check_resolved() picks up resolved alerts from that recovery indefinitely. If the server goes down again, the new incident immediately auto-resolves against the old resolved alerts.

Changes

Incident.check_resolved(): Filter resolved alerts to only those created after the current incident (creation > self.creation), and pass the incident creation time into is_enough_firing() so past_alert_instances() also excludes stale alerts
AlertmanagerWebhookLog.past_alert_instances(): Accept optional since param to add a creation > lower bound on the query
AlertmanagerWebhookLog.is_enough_firing: Changed from @property to method to thread the since parameter through

# Before: finds any resolved alert for this server, including from previous recoveries
last_resolved = frappe.get_last_doc("Alertmanager Webhook Log", {
    "status": "Resolved",
    "group_key": ("like", f"%{self.incident_scope}%"),
    "alert": self.alert,
})
if not last_resolved.is_enough_firing:

# After: only resolved alerts from after this incident was created
last_resolved = frappe.get_last_doc("Alertmanager Webhook Log", {
    "status": "Resolved",
    "group_key": ("like", f"%{self.incident_scope}%"),
    "alert": self.alert,
    "creation": (">", self.creation),
})
if not last_resolved.is_enough_firing(since=self.creation):

Added regression test test_subsequent_incident_not_resolved_by_previous_resolved_alerts covering the full cycle: incident → resolve → new incident → verify it stays open → new resolved alert → verify it resolves

…time When a server goes down and recovers, subsequent incidents for that server would incorrectly auto-resolve because check_resolved() picked up old resolved alerts from the previous recovery. The fix: 1. Filter resolved alerts in check_resolved() to only consider alerts created after the current incident. 2. Add 'since' parameter to past_alert_instances() and is_enough_firing() to exclude old alerts when checking if enough sites are still firing. 3. Add regression test for the specific scenario. Agent-Logs-Url: https://github.com/frappe/press/sessions/7ffc7add-4767-489b-8824-9affdfb4e903 Co-authored-by: balamurali27 <25403045+balamurali27@users.noreply.github.com>

Copilot AI assigned Copilot and balamurali27 Apr 15, 2026

Copilot created this pull request from a session on behalf of balamurali27 April 15, 2026 08:07 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incident auto-resolution using stale resolved alerts from previous recoveries#6117

Fix incident auto-resolution using stale resolved alerts from previous recoveries#6117
Copilot wants to merge 1 commit intodevelopfrom
copilot/fix-incident-auto-resolution-bug

Copilot AI commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 15, 2026

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants