ct/l0/gc: collect stranded L0 objects after all cloud topics deleted#30200
ct/l0/gc: collect stranded L0 objects after all cloud topics deleted#30200
Conversation
030cb4e to
ed04e68
Compare
There was a problem hiding this comment.
Pull request overview
Fixes an L0 GC edge case where deleting all cloud topics could leave previously-created L0 objects stranded in object storage by ensuring the GC watermark still advances when the partition snapshot is empty.
Changes:
- Update
max_gc_eligible_epochto fall through tosnap_revisionwhen there are no cloud topic partitions. - Adjust unit tests to validate the new empty-snapshot watermark behavior.
- Add a ducktape regression test that deletes the last cloud topic and verifies L0 objects drain after GC resumes.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/rptest/tests/cloud_topics/l0_gc_test.py | Adds an integration regression test for GC progress after all cloud topics are deleted. |
| src/v/cloud_topics/level_zero/gc/tests/level_zero_gc_tests.cc | Updates unit tests to expect snap_revision as the watermark for empty snapshots. |
| src/v/cloud_topics/level_zero/gc/level_zero_gc.cc | Changes empty-snapshot behavior to return snap_revision instead of nullopt, enabling stranded-object collection. |
Code ReviewOverviewFixes a garbage-collection hole in cloud-topics L0 GC: when every cloud topic is deleted, Correctness
Tests
Code quality / style
Performance / security
Risk assessmentLow. Change is localized to one branch of one function; the watermark value used ( Overall: LGTM modulo the test-tightening suggestions. |
ed04e68 to
d133c87
Compare
There was a problem hiding this comment.
Pull request overview
Fixes a Cloud Topics L0 GC edge case where deleting all cloud topics could leave previously-created L0 objects stranded in object storage indefinitely by ensuring an empty partition snapshot still yields a GC watermark.
Changes:
- Update
epoch_source::max_gc_eligible_epochto fall through tosnap_revisionwhen the partition snapshot is empty (instead of returningstd::nullopt). - Adjust Level Zero GC unit tests to reflect the new empty-snapshot behavior.
- Add a ducktape regression test that pauses GC, accumulates L0 objects, deletes the only cloud topic, resumes GC, and asserts the bucket drains.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/v/cloud_topics/level_zero/gc/level_zero_gc.cc |
Uses snap_revision as the watermark when there are no cloud topic partitions, enabling collection of stranded L0 objects. |
src/v/cloud_topics/level_zero/gc/tests/level_zero_gc_tests.cc |
Updates unit test expectations for empty-snapshot cases to assert the watermark equals snap_revision. |
tests/rptest/tests/cloud_topics/l0_gc_test.py |
Adds an integration regression test covering the “all topics deleted but L0 objects remain” scenario. |
| "epoch {}", | ||
| result); | ||
| if (probe_) { | ||
| probe_->set_min_partition_gc_epoch(result); |
There was a problem hiding this comment.
I don't think this is inherently contradictory, but it's an interesting point.
ReviewNice, targeted fix. The bug analysis is clear and the fix is minimal: when CorrectnessThe safety argument holds:
Tests
Minor suggestions (non-blocking)
BackportMarked for v26.1.x — agreed, this is a user-visible correctness bug where deleted topics leave bytes stranded in cloud storage indefinitely. Worth backporting. Overall: LGTM. |
CI test resultstest results on build#83276
test results on build#83599 |
max_gc_eligible_epoch returned std::nullopt when the partition snapshot was empty, causing try_to_collect to report no_collectible_epoch after listing. L0 objects left over from previously deleted topics stayed in the bucket forever. Fall through with snap_revision (controller last applied offset) as the watermark: this is the zero-iteration case of the existing join, which already starts at snap_revision and walks it down per partition. Safety rests on invariants from the existing algorithm: - get_partitions ensures that the topic table snapshot is in sync with the controller stm (i.e. snap revision) - L0 objects never have an epoch less than any constinuent topic's initial revision ID. So with an empty topic table at snapshot revision N, any _new_ data must have an epoch > N. Therefore any L0 objects w/ epoch <= N is safe to GC. Includes a ducktape regression test: - pause GC - accumulate some L0 objects - delete the only cloud topic - resume GC - assert that orphaned objects drain. Without the fixt the test times out w/ stuck objects. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
d133c87 to
0eee6a8
Compare
| # L0 objects exist to be collected. | ||
| self.produce_some(topics=[topic.name]) | ||
|
|
||
| wait_until( |
There was a problem hiding this comment.
This should be direct assert, right? If produce succeeded then the objects must be already there.
There was a problem hiding this comment.
yeah fair. get_objects_from_si should make a direct call to the object store. test isn't trying to make any claims about the produce path though, so I'm inclined to use wait_until rather than assume this synchronous chain between produce -> upload -> s3 <- LIST, even if it's technically expected.
|
/backport v26.1.x |
max_gc_eligible_epoch returned std::nullopt when the partition
snapshot was empty, causing try_to_collect to report
no_collectible_epoch after listing. L0 objects left over from
previously deleted topics stayed in the bucket forever.
Fall through with snap_revision (controller last applied offset)
as the watermark: this is the zero-iteration case of the existing
join, which already starts at snap_revision and walks it down per
partition.
Safety rests on invariants from the existing algorithm:
with the controller stm (i.e. snap revision)
initial revision ID.
So with an empty topic table at snapshot revision N, any new data
must have an epoch > N. Therefore any L0 objects w/ epoch <= N is
safe to GC.
Includes a ducktape regression test:
Without the fixt the test times out w/ stuck objects.
Backports Required
Release Notes
Bug Fixes