db_stress: fix crashes during concurrent column family drop+recreate …#14601
db_stress: fix crashes during concurrent column family drop+recreate …#14601midom wants to merge 1 commit intofacebook:mainfrom
Conversation
…acebook#14601) Fix two bugs in db_stress that cause crashes when column families are dropped and recreated (via clear_column_family_one_in): 1. Out-of-bounds access in GetOptions (introduced in 217e075, facebook#13800): GetOptions(cf_id) used the RocksDB internal CF ID as a direct index into column_families_. After a CF drop+recreate, the new CF gets a monotonically increasing ID (e.g., 10 for an initially 10-CF setup) that exceeds column_families_.size(), causing undefined behavior. Fix: search column_families_ for the handle matching the CF ID, falling back to base options if the CF was dropped. 2. Use-after-free on CF handles (present since 457c78e, 2014): MaybeClearOneColumnFamily had two issues: - It called delete on the old CF handle immediately while other threads could still hold pointers to it (use-after-free/segfault). - CreateColumnFamilyImpl zeroes its output handle (*handle = nullptr) before writing the new one. Since it was called with &column_families_[cf], this created a window where concurrent readers see nullptr (assertion failure). Fix: use a local variable for CreateColumnFamily output to avoid the nullptr window, and defer deletion of old handles to CleanUpColumnFamilies when all threads have stopped.
✅ clang-tidy: No findings on changed linesCompleted in 83.5s. |
xingbowang
left a comment
There was a problem hiding this comment.
The bug found here is real. The stress test code quality is typically lower than engine code. Although this drop/recreate test was disabled years ago. My guess is low ROI.
| @@ -796,9 +800,17 @@ Status StressTest::SetOptions(ThreadState* thread) { | |||
| } | |||
|
|
|||
| Options StressTest::GetOptions(int cf_id) { | |||
There was a problem hiding this comment.
since you are here, we should probably fix the type, uint32_t is the right type for cf_id
| auto cfh = column_families_[cf_id]; | ||
| assert(cfh); | ||
| return db_->GetOptions(cfh); | ||
| // cf_id is the RocksDB internal column family ID, not the index into | ||
| // column_families_. After a CF drop+recreate, the new CF gets a higher ID | ||
| // that may exceed column_families_.size(). Search for the matching handle. | ||
| for (auto* cfh : column_families_) { | ||
| if (cfh && cfh->GetID() == static_cast<uint32_t>(cf_id)) { | ||
| return db_->GetOptions(cfh); | ||
| } | ||
| } |
There was a problem hiding this comment.
On cf id not found, we used to cause segfault, accessing nullptr. Now we just return default options_. Although with delete/recreate, this is a valid case we need to handle properly for remote compaction. I think it is better to surface CF not found error back to caller in remote compaction and have remote compaction just abort the compaction gracefully.
| auto cfh = column_families_[cf_id]; | ||
| assert(cfh); | ||
| return db_->GetOptions(cfh); |
There was a problem hiding this comment.
The old code treat CF id as the index of column_families_. This is right, if CF is never deleted and recreated. Given we do have code that recreate CF (although this is never called, that's why the bug was never triggered), we should not hold this assumption anymore. I think the fix is correct. Meantime, once we re-enable the drop/recreate test, accessing it would be better guarded by a mutex for thread safety. Although since it is just a pointer of array that is never resized, most hardware would guarantee thread safety on raw pointer read/write, but tsan might warn about this.
Meantime, on re-enable drop/recreate, I don't know whether the stress test handles this gracefully. Essentially whether it would create concurrent write/read failure on a dropped CF properly. Also the re-constructed expected state might need to be updated as well. In short, to re-enable this functionality test, it will likely require a lot more work than this fix. Meantime, dropping a CF concurrently while another thread is read/write it is likely rare in production. Application typically do synchronization by themself on concurrent drop. So I don't know how much value would this provide. If we do have this kind of use case, then I agree we should spend effort on fixing this.
|
If you want to spend more time on this, I would just address the type and return type and ship fix this. For the bigger issue, they can be incrementally addressed. |
…ook#14601) Summary: `RangeTreeLockManager::CompareDbtEndpoints()` called `Comparator::Compare()` on range lock endpoint keys that never contain user-defined timestamps. With a timestamp-aware comparator (`timestamp_size() > 0`), this caused assertion failures in debug builds and silent endpoint misordering in release builds. Replace `Compare()` with the 4-argument `CompareWithoutTimestamp()` using `a_has_ts=false, b_has_ts=false`, which is the correct contract for serialized range lock endpoints (format: `[1-byte suffix][key bytes]`, no timestamp). Test Plan: - New test `RangeLockWithTimestampComparator` reopens with `BytewiseComparatorWithU64Ts` and exercises range lock acquisition, conflict detection, and non-overlapping success with short keys. - Verified RED (assertion failure) before fix, GREEN after. - Full `range_locking_test` suite passes (16/16). - Stress tested with `COERCE_CONTEXT_SWITCH=1 --gtest_repeat=5`.
… (#14611) Summary: `RangeTreeLockManager::CompareDbtEndpoints()` called `Comparator::Compare()` on range lock endpoint keys that never contain user-defined timestamps. With a timestamp-aware comparator (`timestamp_size() > 0`), this caused assertion failures in debug builds and silent endpoint misordering in release builds. Replace `Compare()` with the 4-argument `CompareWithoutTimestamp()` using `a_has_ts=false, b_has_ts=false`, which is the correct contract for serialized range lock endpoints (format: `[1-byte suffix][key bytes]`, no timestamp). Pull Request resolved: #14611 Test Plan: - New test `RangeLockWithTimestampComparator` reopens with `BytewiseComparatorWithU64Ts` and exercises range lock acquisition, conflict detection, and non-overlapping success with short keys. - Verified RED (assertion failure) before fix, GREEN after. - Full `range_locking_test` suite passes (16/16). - Stress tested with `COERCE_CONTEXT_SWITCH=1 --gtest_repeat=5`. Reviewed By: xingbowang Differential Revision: D100775201 Pulled By: laurynas-biveinis fbshipit-source-id: 58e3846e62e9b3cc5bdc69557458b245f90b3967
…(#14601)
Fix two bugs in db_stress that cause crashes when column families are dropped and recreated (via clear_column_family_one_in):
Out-of-bounds access in GetOptions (introduced in 217e075, [Remote Compaction] Simulate e2e flow in Stress Test #13800): GetOptions(cf_id) used the RocksDB internal CF ID as a direct index into column_families_. After a CF drop+recreate, the new CF gets a monotonically increasing ID (e.g., 10 for an initially 10-CF setup) that exceeds column_families_.size(), causing undefined behavior. Fix: search column_families_ for the handle matching the CF ID, falling back to base options if the CF was dropped.
Use-after-free on CF handles (present since 457c78e, 2014): MaybeClearOneColumnFamily had two issues: