Skip to content

Improve db_stress debuggability and fault-injection logging#14620

Closed
xingbowang wants to merge 5 commits intofacebook:mainfrom
xingbowang:2026_04_15_trace_debug_improvements
Closed

Improve db_stress debuggability and fault-injection logging#14620
xingbowang wants to merge 5 commits intofacebook:mainfrom
xingbowang:2026_04_15_trace_debug_improvements

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

Summary

  • add db_stress operation tracing plus parser support to make stress failures easier to debug
  • extend db_crashtest tooling/tests around the tracing flow
  • switch fault injection FS logging to a raw format and add a parser to reduce CPU overhead during fault-heavy runs

Benchmarks

  • iterator-heavy db_stress (DEBUG_LEVEL=1): --trace_public_iterator_api=0 was +0.9% vs upstream/main, while --trace_public_iterator_api=1 was +22.7%
  • fault-heavy db_stress (DEBUG_LEVEL=1, read faults): raw log path improved median runtime from 3.45s to 2.99s (-13.3%)

Testing

  • not run locally as part of this PR creation

@meta-cla meta-cla Bot added the CLA Signed label Apr 15, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 15, 2026

✅ clang-tidy: No findings on changed lines

Completed in 200.2s.

xingbowang and others added 3 commits April 15, 2026 10:12
Benchmark (DEBUG_LEVEL=1, iterator-heavy db_stress, median of 5 runs):
- upstream/main: 10.04s
- --trace_public_iterator_api=0: 10.13s (+0.9%)
- --trace_public_iterator_api=1: 12.32s (+22.7%)
Benchmark (DEBUG_LEVEL=1, db_stress, 1 thread, readpercent=100,
read_fault_one_in=1, ops_per_thread=1000000, median of 5 runs):
- previous text log path: 3.45s
- raw log path: 2.99s (-13.3%)
@xingbowang xingbowang force-pushed the 2026_04_15_trace_debug_improvements branch from 39109af to cc314a3 Compare April 15, 2026 17:13
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 15, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D101066219.

Copy link
Copy Markdown
Contributor

@joshkang97 joshkang97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we're not wrapping all iterators, i.e. txn iterators and attribute-group iterators

Should we also trace Get variants as well?

)
)

for _ in range(used_slots):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are tracing in slot order, but wouldn't it be better to print in timestamp or seqno order

const uint64_t seq =
next_sequence_.fetch_add(1, std::memory_order_relaxed) + 1;
const uint64_t pos = log.head.fetch_add(1, std::memory_order_relaxed);
TraceEntry& entry = log.entries[pos % kEntriesPerThread];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mutating the ring buffer directly. What if there is a crash mid mutation, will there be a corrupt entry?

@xingbowang xingbowang closed this Apr 22, 2026
@xingbowang
Copy link
Copy Markdown
Contributor Author

Discussed offline, we have a separate plan to preserve wal file from db stress test. No long need extra tracing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants