Fix incident auto-resolution using stale resolved alerts from previous recoveries#6117
Draft
Fix incident auto-resolution using stale resolved alerts from previous recoveries#6117
Conversation
…time When a server goes down and recovers, subsequent incidents for that server would incorrectly auto-resolve because check_resolved() picked up old resolved alerts from the previous recovery. The fix: 1. Filter resolved alerts in check_resolved() to only consider alerts created after the current incident. 2. Add 'since' parameter to past_alert_instances() and is_enough_firing() to exclude old alerts when checking if enough sites are still firing. 3. Add regression test for the specific scenario. Agent-Logs-Url: https://github.com/frappe/press/sessions/7ffc7add-4767-489b-8824-9affdfb4e903 Co-authored-by: balamurali27 <25403045+balamurali27@users.noreply.github.com>
Copilot created this pull request from a session on behalf of
balamurali27
April 15, 2026 08:07
View session
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a server goes down and recovers,
check_resolved()picks up resolved alerts from that recovery indefinitely. If the server goes down again, the new incident immediately auto-resolves against the old resolved alerts.Changes
Incident.check_resolved(): Filter resolved alerts to only those created after the current incident (creation > self.creation), and pass the incident creation time intois_enough_firing()sopast_alert_instances()also excludes stale alertsAlertmanagerWebhookLog.past_alert_instances(): Accept optionalsinceparam to add acreation >lower bound on the queryAlertmanagerWebhookLog.is_enough_firing: Changed from@propertyto method to thread thesinceparameter throughtest_subsequent_incident_not_resolved_by_previous_resolved_alertscovering the full cycle: incident → resolve → new incident → verify it stays open → new resolved alert → verify it resolves