For 4.3.1: Revert CQ shared store: Delete from index on remove or roll over (#13959) (backport #16142)#16195
Merged
michaelklishin merged 2 commits intov4.3.xfrom Apr 27, 2026
Merged
Conversation
Restore three independent improvements from the reverted commit that are unrelated to the broken current_file_removes mechanism: - Relax index_update_fields assertion: true= -> _= so a missing key does not crash the process - Add prioritise_cast/3 to rabbit_msg_store_gc so delete requests are processed before compaction requests, avoiding unnecessary compaction of files that are already pending deletion - compact_file/2 early-exit guard was already present after the revert (cherry picked from commit 69fd9ff)
4.3.1: Revert CQ shared store: Delete from index on remove or roll over (#13959) (backport #16142)
4.3.1: Revert CQ shared store: Delete from index on remove or roll over (#13959) (backport #16142)4.3.1: Revert CQ shared store: Delete from index on remove or roll over (#13959) (backport #16142)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reverts 0278980 ("CQ shared store: Delete from index on remove or roll over", PR #13959), which introduced a regression in the classic queue message store GC. Three independent improvements from that commit are retained.
Fixes #16141.
Problem
PR #13959 replaced the
scan_and_vacuum_message_filecall indelete_filewith an eager index cleanup mechanism (current_file_removes). As a side effect, messages removed from non-current files now producenot_foundindex lookups duringscan_and_vacuum_message_fileinstead ofpreviously_validones. This was noted in the PR review by @gomoripeti but not addressed before merge.Under high throughput with many queues, the byte-by-byte
scan_next_bytescanning mode triggered bynot_foundentries causes GC compaction to fall far enough behind the publish rate that disk usage can grow without bound. The stall also causes consumer latency spikes and broker unresponsiveness on established TCP connections.Reproduction: 100 classic queues at 500 msg/s (120 KB messages) with a slow-ack consumer queue in the same vhost (acks held 1-30 min, up to 1000 in flight). On an m7g.large with a 196 GB EBS volume, disk fell from 185.4 GB to ~169 GB in ~100 minutes. Consumer latency reached a median of 1.5s and a max of 568s.
Reproduction scripts: https://github.com/lukebakken/rmq-gc-lag
Changes
Two commits:
current_file_removesmechanism:index_update_fieldsassertion (true=to_=) so a missing key does not crash the processprioritise_cast/3torabbit_msg_store_gcso delete requests are processed before compaction requests, avoiding unnecessary compaction of files already pending deletioncompact_file/2early-exit guard (file already deleted) was already present after the revertTesting
Ran the reproduction workload against this branch for 60 minutes (three consecutive 20-minute monitoring windows) at 500 msg/s with ~1000 unacked messages. Disk held stable in a 0.5 GB oscillation band (184.96-185.47 GB). Ready messages held at 0 throughout. No latency spikes.
For comparison, the same workload against unpatched
mainlost ~16 GB of disk in ~100 minutes with ready messages growing to 3500-4200.This is an automatic backport of pull request #16142 done by [Mergify](https://mergify.com).