Revert CQ shared store: Delete from index on remove or roll over (#13959) by lukebakken · Pull Request #16142 · rabbitmq/rabbitmq-server

lukebakken · 2026-04-15T18:22:30Z

Summary

Reverts 0278980 ("CQ shared store: Delete from index on remove or roll over", PR #13959), which introduced a regression in the classic queue message store GC. Three independent improvements from that commit are retained.

Fixes #16141.

Problem

PR #13959 replaced the scan_and_vacuum_message_file call in delete_file with an eager index cleanup mechanism (current_file_removes). As a side effect, messages removed from non-current files now produce not_found index lookups during scan_and_vacuum_message_file instead of previously_valid ones. This was noted in the PR review by @gomoripeti but not addressed before merge.

Under high throughput with many queues, the byte-by-byte scan_next_byte scanning mode triggered by not_found entries causes GC compaction to fall far enough behind the publish rate that disk usage can grow without bound. The stall also causes consumer latency spikes and broker unresponsiveness on established TCP connections.

Reproduction: 100 classic queues at 500 msg/s (120 KB messages) with a slow-ack consumer queue in the same vhost (acks held 1-30 min, up to 1000 in flight). On an m7g.large with a 196 GB EBS volume, disk fell from 185.4 GB to ~169 GB in ~100 minutes. Consumer latency reached a median of 1.5s and a max of 568s.

Reproduction scripts: https://github.com/lukebakken/rmq-gc-lag

Changes

Two commits:

Revert 0278980 in full.
Restore three independent improvements from that commit that are unrelated to the broken current_file_removes mechanism:
- Relax index_update_fields assertion (true= to _=) so a missing key does not crash the process
- Add prioritise_cast/3 to rabbit_msg_store_gc so delete requests are processed before compaction requests, avoiding unnecessary compaction of files already pending deletion
- compact_file/2 early-exit guard (file already deleted) was already present after the revert

Testing

Ran the reproduction workload against this branch for 60 minutes (three consecutive 20-minute monitoring windows) at 500 msg/s with ~1000 unacked messages. Disk held stable in a 0.5 GB oscillation band (184.96-185.47 GB). Ready messages held at 0 throughout. No latency spikes.

For comparison, the same workload against unpatched main lost ~16 GB of disk in ~100 minutes with ready messages growing to 3500-4200.

michaelklishin · 2026-04-15T19:22:15Z

@lukebakken Loïc will be back next week. We will have a day or two to decide before the final 4.3.0 cut-off/freeze.

…bbitmq#13959)" This reverts commit 0278980.

Restore three independent improvements from the reverted commit that are unrelated to the broken current_file_removes mechanism: - Relax index_update_fields assertion: true= -> _= so a missing key does not crash the process - Add prioritise_cast/3 to rabbit_msg_store_gc so delete requests are processed before compaction requests, avoiding unnecessary compaction of files that are already pending deletion - compact_file/2 early-exit guard was already present after the revert

lukebakken · 2026-04-16T14:29:11Z

Well this is something that shouldn't be rushed!

lhoguin

Perhaps I went a bit too far on this. Deletion was too expensive so the PR fixed that, at the cost of making some compactions more expensive. A plain revert would mean we'd just go back to that state, but since we keep some changes I'm OK with merging this now and seeing how it fares overall in the wild.

lhoguin · 2026-04-20T09:22:41Z

In particular perhaps the prioritisation of delete is good enough to avoid the bigger issues of deletes.

lhoguin · 2026-04-20T09:50:54Z

At some point we should rework the file format to avoid these issues entirely. We need the file itself to contain information about messages existing, or holes existing, and what size they are. The in-memory index should only ever refer to existing messages and act more like a cache over what is in the files, to be used by readers.

michaelklishin · 2026-04-20T16:27:22Z

@lukebakken so that you don't have to guess: the earliest this can ship now is 4.3.1.

michaelklishin · 2026-04-20T16:27:48Z

@Mergifyio backport v4.3.x

(the PR will be marked as delayed for 4.3.1)

mergify · 2026-04-20T16:28:04Z

backport v4.3.x

✅ Backports have been created

Details

#16195 Revert CQ shared store: Delete from index on remove or roll over (#13959) (backport #16142) has been created for branch v4.3.x

lukebakken · 2026-04-20T18:39:11Z

Thanks @lhoguin

lukebakken added bug effort-high performance labels Apr 15, 2026

lukebakken assigned michaelklishin and lukebakken Apr 15, 2026

lukebakken requested review from lhoguin, michaelklishin and the-mikedavis April 15, 2026 18:22

lukebakken force-pushed the lukebakken/cq-gc branch from 6f9c8f9 to f3939cc Compare April 15, 2026 18:35

lukebakken added 2 commits April 16, 2026 07:28

Revert "CQ shared store: Delete from index on remove or roll over (ra…

c4db698

…bbitmq#13959)" This reverts commit 0278980.

lukebakken force-pushed the lukebakken/cq-gc branch from f3939cc to 69fd9ff Compare April 16, 2026 14:28

lhoguin approved these changes Apr 20, 2026

View reviewed changes

lhoguin merged commit 3bffa63 into rabbitmq:main Apr 20, 2026
189 checks passed

lhoguin mentioned this pull request Apr 20, 2026

CQ shared store v2 file format #16194

Open

michaelklishin added the backport-pending Use with PRs that are yet to be backported for any reason label Apr 20, 2026

mergify bot mentioned this pull request Apr 20, 2026

DO NOT MERGE For 4.3.1: Revert CQ shared store: Delete from index on remove or roll over (#13959) (backport #16142) #16195

Open

lukebakken deleted the lukebakken/cq-gc branch April 20, 2026 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert CQ shared store: Delete from index on remove or roll over (#13959)#16142

Revert CQ shared store: Delete from index on remove or roll over (#13959)#16142
lhoguin merged 2 commits intorabbitmq:mainfrom
amazon-mq:lukebakken/cq-gc

lukebakken commented Apr 15, 2026

Uh oh!

michaelklishin commented Apr 15, 2026

Uh oh!

lukebakken commented Apr 16, 2026

Uh oh!

lhoguin left a comment

Uh oh!

lhoguin commented Apr 20, 2026

Uh oh!

Uh oh!

lhoguin commented Apr 20, 2026

Uh oh!

michaelklishin commented Apr 20, 2026

Uh oh!

michaelklishin commented Apr 20, 2026 •

edited

Loading

Uh oh!

mergify bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

lukebakken commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lukebakken commented Apr 15, 2026

Summary

Problem

Changes

Testing

Uh oh!

michaelklishin commented Apr 15, 2026

Uh oh!

lukebakken commented Apr 16, 2026

Uh oh!

lhoguin left a comment

Choose a reason for hiding this comment

Uh oh!

lhoguin commented Apr 20, 2026

Uh oh!

Uh oh!

lhoguin commented Apr 20, 2026

Uh oh!

michaelklishin commented Apr 20, 2026

Uh oh!

michaelklishin commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Backports have been created

Uh oh!

lukebakken commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelklishin commented Apr 20, 2026 •

edited

Loading

mergify bot commented Apr 20, 2026 •

edited

Loading