Skip to content

db_stress: fix crashes during concurrent column family drop+recreate …#14601

Open
midom wants to merge 1 commit intofacebook:mainfrom
midom:main
Open

db_stress: fix crashes during concurrent column family drop+recreate …#14601
midom wants to merge 1 commit intofacebook:mainfrom
midom:main

Conversation

@midom
Copy link
Copy Markdown

@midom midom commented Apr 11, 2026

…(#14601)

Fix two bugs in db_stress that cause crashes when column families are dropped and recreated (via clear_column_family_one_in):

  1. Out-of-bounds access in GetOptions (introduced in 217e075, [Remote Compaction] Simulate e2e flow in Stress Test #13800): GetOptions(cf_id) used the RocksDB internal CF ID as a direct index into column_families_. After a CF drop+recreate, the new CF gets a monotonically increasing ID (e.g., 10 for an initially 10-CF setup) that exceeds column_families_.size(), causing undefined behavior. Fix: search column_families_ for the handle matching the CF ID, falling back to base options if the CF was dropped.

  2. Use-after-free on CF handles (present since 457c78e, 2014): MaybeClearOneColumnFamily had two issues:

    • It called delete on the old CF handle immediately while other threads could still hold pointers to it (use-after-free/segfault).
    • CreateColumnFamilyImpl zeroes its output handle (*handle = nullptr) before writing the new one. Since it was called with &column_families_[cf], this created a window where concurrent readers see nullptr (assertion failure). Fix: use a local variable for CreateColumnFamily output to avoid the nullptr window, and defer deletion of old handles to CleanUpColumnFamilies when all threads have stopped.

…acebook#14601)

Fix two bugs in db_stress that cause crashes when column families are
dropped and recreated (via clear_column_family_one_in):

1. Out-of-bounds access in GetOptions (introduced in 217e075, facebook#13800):
   GetOptions(cf_id) used the RocksDB internal CF ID as a direct index
   into column_families_. After a CF drop+recreate, the new CF gets a
   monotonically increasing ID (e.g., 10 for an initially 10-CF setup)
   that exceeds column_families_.size(), causing undefined behavior.
   Fix: search column_families_ for the handle matching the CF ID,
   falling back to base options if the CF was dropped.

2. Use-after-free on CF handles (present since 457c78e, 2014):
   MaybeClearOneColumnFamily had two issues:
   - It called delete on the old CF handle immediately while other
     threads could still hold pointers to it (use-after-free/segfault).
   - CreateColumnFamilyImpl zeroes its output handle (*handle = nullptr)
     before writing the new one. Since it was called with
     &column_families_[cf], this created a window where concurrent
     readers see nullptr (assertion failure).
   Fix: use a local variable for CreateColumnFamily output to avoid the
   nullptr window, and defer deletion of old handles to
   CleanUpColumnFamilies when all threads have stopped.
@meta-cla meta-cla Bot added the CLA Signed label Apr 11, 2026
@github-actions
Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 83.5s.

Copy link
Copy Markdown
Contributor

@xingbowang xingbowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug found here is real. The stress test code quality is typically lower than engine code. Although this drop/recreate test was disabled years ago. My guess is low ROI.

@@ -796,9 +800,17 @@ Status StressTest::SetOptions(ThreadState* thread) {
}

Options StressTest::GetOptions(int cf_id) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are here, we should probably fix the type, uint32_t is the right type for cf_id

Comment on lines -799 to +810
auto cfh = column_families_[cf_id];
assert(cfh);
return db_->GetOptions(cfh);
// cf_id is the RocksDB internal column family ID, not the index into
// column_families_. After a CF drop+recreate, the new CF gets a higher ID
// that may exceed column_families_.size(). Search for the matching handle.
for (auto* cfh : column_families_) {
if (cfh && cfh->GetID() == static_cast<uint32_t>(cf_id)) {
return db_->GetOptions(cfh);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On cf id not found, we used to cause segfault, accessing nullptr. Now we just return default options_. Although with delete/recreate, this is a valid case we need to handle properly for remote compaction. I think it is better to surface CF not found error back to caller in remote compaction and have remote compaction just abort the compaction gracefully.

Comment on lines -799 to -801
auto cfh = column_families_[cf_id];
assert(cfh);
return db_->GetOptions(cfh);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code treat CF id as the index of column_families_. This is right, if CF is never deleted and recreated. Given we do have code that recreate CF (although this is never called, that's why the bug was never triggered), we should not hold this assumption anymore. I think the fix is correct. Meantime, once we re-enable the drop/recreate test, accessing it would be better guarded by a mutex for thread safety. Although since it is just a pointer of array that is never resized, most hardware would guarantee thread safety on raw pointer read/write, but tsan might warn about this.
Meantime, on re-enable drop/recreate, I don't know whether the stress test handles this gracefully. Essentially whether it would create concurrent write/read failure on a dropped CF properly. Also the re-constructed expected state might need to be updated as well. In short, to re-enable this functionality test, it will likely require a lot more work than this fix. Meantime, dropping a CF concurrently while another thread is read/write it is likely rare in production. Application typically do synchronization by themself on concurrent drop. So I don't know how much value would this provide. If we do have this kind of use case, then I agree we should spend effort on fixing this.

@xingbowang
Copy link
Copy Markdown
Contributor

If you want to spend more time on this, I would just address the type and return type and ship fix this. For the bigger issue, they can be incrementally addressed.

laurynas-biveinis added a commit to laurynas-biveinis/rocksdb that referenced this pull request Apr 14, 2026
…ook#14601)

Summary:
`RangeTreeLockManager::CompareDbtEndpoints()` called `Comparator::Compare()`
on range lock endpoint keys that never contain user-defined timestamps.
With a timestamp-aware comparator (`timestamp_size() > 0`), this caused
assertion failures in debug builds and silent endpoint misordering in
release builds.

Replace `Compare()` with the 4-argument `CompareWithoutTimestamp()` using
`a_has_ts=false, b_has_ts=false`, which is the correct contract for
serialized range lock endpoints (format: `[1-byte suffix][key bytes]`,
no timestamp).

Test Plan:
- New test `RangeLockWithTimestampComparator` reopens with
  `BytewiseComparatorWithU64Ts` and exercises range lock acquisition,
  conflict detection, and non-overlapping success with short keys.
- Verified RED (assertion failure) before fix, GREEN after.
- Full `range_locking_test` suite passes (16/16).
- Stress tested with `COERCE_CONTEXT_SWITCH=1 --gtest_repeat=5`.
meta-codesync Bot pushed a commit that referenced this pull request Apr 15, 2026
… (#14611)

Summary:
`RangeTreeLockManager::CompareDbtEndpoints()` called `Comparator::Compare()` on range lock endpoint keys that never contain user-defined timestamps. With a timestamp-aware comparator (`timestamp_size() > 0`), this caused assertion failures in debug builds and silent endpoint misordering in release builds.

Replace `Compare()` with the 4-argument `CompareWithoutTimestamp()` using `a_has_ts=false, b_has_ts=false`, which is the correct contract for serialized range lock endpoints (format: `[1-byte suffix][key bytes]`, no timestamp).

Pull Request resolved: #14611

Test Plan:
- New test `RangeLockWithTimestampComparator` reopens with `BytewiseComparatorWithU64Ts` and exercises range lock acquisition, conflict detection, and non-overlapping success with short keys.
- Verified RED (assertion failure) before fix, GREEN after.
- Full `range_locking_test` suite passes (16/16).
- Stress tested with `COERCE_CONTEXT_SWITCH=1 --gtest_repeat=5`.

Reviewed By: xingbowang

Differential Revision: D100775201

Pulled By: laurynas-biveinis

fbshipit-source-id: 58e3846e62e9b3cc5bdc69557458b245f90b3967
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants