Skip to content

Add multi-DB stress testing support#14550

Closed
hx235 wants to merge 2 commits intofacebook:mainfrom
hx235:export-D96575551
Closed

Add multi-DB stress testing support#14550
hx235 wants to merge 2 commits intofacebook:mainfrom
hx235:export-D96575551

Conversation

@hx235
Copy link
Copy Markdown
Contributor

@hx235 hx235 commented Apr 2, 2026

Summary:
Add --num_dbs flag for running N independent StressTest instances sharing
one Env. Each DB gets its own paths, fault injection, and stress threads.

Motivation: validate shared-resource behavior (block cache,
WriteBufferManager, RateLimiter, compressed secondary cache) across
concurrent DB instances, and provide the multi-DB harness for future
shared components like the auto-tuner.

Note: multi-DB mode (--num_dbs > 1) is not yet enabled in CI. The
db_crashtest.py randomization is commented out with a TODO at line 481.
Enabling it in CI will be done in a follow-up change.

Per-DB path layout:
--db=/tmp/mydb → /tmp/mydb/db_0, /tmp/mydb/db_1, ...
--expected_values_dir=/tmp/ev → /tmp/ev/db_0, /tmp/ev/db_1, ...
--secondaries_base=/tmp/sec → /tmp/sec/db_0, /tmp/sec/db_1, ...

Per-DB isolation:

  • Per-DB FaultInjectionTestFS (independent fault injection per DB)
  • Per-DB crash callback (async-signal-safe C array of raw pointers)
  • Per-DB RunStressTestImpl() with own SharedState — all features
    (backup, checkpoint, secondary DB, continuous verification, remote
    compaction) work correctly in multi-DB mode via per-DB paths

Shared resources (across all DBs):

  • Env and thread pools (compaction, flush, bottom-pri)
  • Block cache
  • WriteBufferManager
  • RateLimiter

Guards (features disabled/rejected in multi-DB mode):

  • clear_column_family_one_in: pre-existing CF handle race condition
  • compressed_secondary_cache: per-DB CompressedCacheSetCapacityThread
    instances race on the shared global cache (SetCapacity(0)/SetCapacity(size))
  • test_multi_ops_txns: uses a single global key_spaces_path file that
    persists key range descriptors — multiple DBs would overwrite each
    other's ranges, causing layout corruption on reopen

Python driver (db_crashtest.py) changes:

  • cleanup_after_success() forwards num_dbs, expected_values_dir,
    secondaries_base, test_secondary, and continuous_verification_interval
    to --destroy_db_and_exit so all directories are properly cleaned
  • Lambda params (expected_values_dir, num_dbs, test_secondary)
    materialized early in blackbox/whitebox because cleanup_after_success
    reads them from cmd_params which stores raw lambdas from default_params

Bonus cleanup in --destroy_db_and_exit:

  • Cleans expected_values_dir via Env::Default() (always local, even
    when the DB itself lives on a remote env_uri/fs_uri)
  • Cleans secondaries_base including auto-generated default path when
    test_secondary or continuous_verification_interval is enabled
    These were pre-existing leaks in single-DB mode too, but fixed here
    because the Python forwarding above requires the C++ side to handle
    these flags.

Test Plan:
make db_stress
Result: PASSED

Single-DB backward compatibility (verification successful):
./db_stress --max_key=10000 --ops_per_thread=5000 --destroy_db_initially --db=/tmp/rocksdb_v_single
Result: PASSED

Multi-DB basic, 3 DBs (all verified):
./db_stress --num_dbs=3 --max_key=10000 --ops_per_thread=5000 --destroy_db_initially --db=/tmp/rocksdb_v_multi
Result: PASSED

Multi-DB + explicit expected_values_dir (all verified):
./db_stress --num_dbs=3 --max_key=10000 --ops_per_thread=5000 --destroy_db_initially --expected_values_dir=/tmp/test_ev --db=/tmp/rocksdb_v_evdir
Result: PASSED

[ongoing] CI stress test large-scale rehearsal

@meta-cla meta-cla Bot added the CLA Signed label Apr 2, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 2, 2026

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96575551.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2026

⚠️ clang-tidy: 10 warning(s) on changed lines

Completed in 150.6s.

Summary by check

Check Count
concurrency-mt-unsafe 6
cppcoreguidelines-avoid-non-const-global-variables 4
Total 10

Details

db_stress_tool/db_stress_common.cc (2 warning(s))
db_stress_tool/db_stress_common.cc:25:56: warning: variable 'shared_wbm' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_common.cc:26:49: warning: variable 'shared_rate_limiter' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_tool.cc (8 warning(s))
db_stress_tool/db_stress_tool.cc:47:5: warning: variable 'g_fault_fs_for_crash' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_tool.cc:48:12: warning: variable 'g_num_dbs_for_crash' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_tool.cc:101:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:465:5: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:470:5: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:495:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:548:29: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:567:11: warning: function is not thread safe [concurrency-mt-unsafe]

Summary:

AnonExpectedState::Open() never initialized the base class
persisted_seqno_ pointer, leaving it as nullptr. Any call to
SetPersistedSeqno() or GetPersistedSeqno() on an AnonExpectedState
(used when expected_values_dir is empty) would dereference a null
pointer. Fix by allocating and assigning it in Open(), matching
FileExpectedState's pattern.

Differential Revision: D98556119
@hx235 hx235 force-pushed the export-D96575551 branch from 8b524d4 to 1d1cb04 Compare April 2, 2026 02:37
@hx235 hx235 marked this pull request as draft April 2, 2026 02:56
@hx235 hx235 force-pushed the export-D96575551 branch from 1d1cb04 to 5f2f2e2 Compare April 2, 2026 03:35
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 2, 2026

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D96575551.

1 similar comment
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 2, 2026

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D96575551.

Summary:
Add --num_dbs flag for running N independent StressTest instances sharing
one Env. Each DB gets its own paths, fault injection, and stress threads.

Motivation: validate shared-resource behavior (block cache,
WriteBufferManager, RateLimiter, compressed secondary cache) across
concurrent DB instances, and provide the multi-DB harness for future
shared components like the auto-tuner.

Note: multi-DB mode (--num_dbs > 1) is not yet enabled in CI. The
db_crashtest.py randomization is commented out with a TODO at line 481.
Enabling it in CI will be done in a follow-up change.

Per-DB path layout:
  --db=/tmp/mydb                → /tmp/mydb/db_0, /tmp/mydb/db_1, ...
  --expected_values_dir=/tmp/ev → /tmp/ev/db_0, /tmp/ev/db_1, ...
  --secondaries_base=/tmp/sec   → /tmp/sec/db_0, /tmp/sec/db_1, ...
ComputeDbPaths() appends /db_<i> for num_dbs > 1; single-DB paths
are unchanged.

Per-DB isolation:
- Per-DB FaultInjectionTestFS (independent fault injection per DB)
- Per-DB crash callback (async-signal-safe C array of raw pointers)
- Per-DB RunStressTestImpl() with own SharedState — all features
  (backup, checkpoint, secondary DB, continuous verification, remote
  compaction) work correctly in multi-DB mode via per-DB paths

Shared resources (across all DBs):
- Env and thread pools (compaction, flush, bottom-pri)
- Block cache
- WriteBufferManager
- RateLimiter

Guards (features disabled/rejected in multi-DB mode):
- clear_column_family_one_in: pre-existing CF handle race condition
- compressed_secondary_cache: per-DB CompressedCacheSetCapacityThread
  instances race on the shared global cache (SetCapacity(0)/SetCapacity(size))
- test_multi_ops_txns: uses a single global key_spaces_path file that
  persists key range descriptors — multiple DBs would overwrite each
  other's ranges, causing layout corruption on reopen

Python driver (db_crashtest.py) changes:
- cleanup_after_success() forwards num_dbs, expected_values_dir,
  secondaries_base, test_secondary, and continuous_verification_interval
  to --destroy_db_and_exit so all directories are properly cleaned
- Lambda params (expected_values_dir, num_dbs, test_secondary)
  materialized early in blackbox/whitebox because cleanup_after_success
  reads them from cmd_params which stores raw lambdas from default_params

Bonus cleanup in --destroy_db_and_exit:
- Cleans expected_values_dir via Env::Default() (always local, even
  when the DB itself lives on a remote env_uri/fs_uri)
- Cleans secondaries_base including auto-generated default path when
  test_secondary or continuous_verification_interval is enabled
These were pre-existing leaks in single-DB mode too, but fixed here
because the Python forwarding above requires the C++ side to handle
these flags.

Differential Revision: D96575551
@hx235 hx235 force-pushed the export-D96575551 branch from 5f2f2e2 to 370bc2f Compare April 2, 2026 04:13
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 2, 2026

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D96575551.

@hx235 hx235 closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant