fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing'#4740
fix Bookkeeper can't startup cause by 'IOException: Recovery log xxx is missing'#4740void-ptr974 wants to merge 5 commits into
Conversation
|
good jobs, |
There was a problem hiding this comment.
Pull request overview
This PR addresses BookKeeper startup failures where a referenced recovery journal file has been garbage-collected due to a race between concurrent checkpointComplete() calls (e.g., SyncThread vs SingleDirectoryDbLedgerStorage.flush()), causing lastMark to move backwards and point at a deleted journal.
Changes:
- Adds a monotonic-advance guard to
Journal.checkpointComplete()intended to preventlastMarkregression under concurrent checkpoints. - Adds a regression test attempting to reproduce the “recovery log is missing” scenario and validate the fix.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java |
Introduces a lock + persisted-mark tracking intended to keep lastMark monotonic across concurrent checkpoints. |
bookkeeper-server/src/test/java/org/apache/bookkeeper/bookie/storage/ldb/DbLedgerStorageTest.java |
Adds a test scenario for the journal-missing failure mode and expected monotonic behavior. |
Comments suppressed due to low confidence (1)
bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java:804
- The monotonic guard is not sufficient to prevent lastMark regression: if an older checkpoint enters this method first (passes the check and updates lastPersistedMark) but reaches rollLog() after a newer checkpoint has already persisted a later mark, this call will still execute rollLog() and can overwrite the lastMark file backwards. Consider serializing the whole rollLog + journal GC block under the same lock, or re-checking the mark under the lock immediately before persisting/deleting so stale checkpoints cannot write.
// Monotonic check: only advance lastMark forward, never backwards.
synchronized (checkpointLock) {
if (mark.getCurMark().compare(lastPersistedMark) <= 0) {
return;
}
lastPersistedMark.setLogMark(
mark.getCurMark().getLogFileId(), mark.getCurMark().getLogFileOffset());
}
mark.rollLog(mark);
if (compact) {
// list the journals that have been marked
List<Long> logs = listJournalIds(journalDirectory, new JournalRollingFilter(mark));
// keep MAX_BACKUP_JOURNALS journal files before marked journal
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@StevenLuMT Thanks for the feedback. I've updated the PR description with a clearer explanation of the triggering scenario. |
|
good jobs, I have rerun CI |
There was a problem hiding this comment.
@void-ptr974 Thanks for your contribution.
Great diagnosis — the race is real, the description is precise, and the monotonic guard is the right shape of fix. A few concerns and suggestions before merging.
Concern 1: rollLog is outside the synchronized block — a narrower variant of the same race still exists
The fix correctly prevents an older mark from overwriting newer on the disk file when calls are interleaved as described. But two newer marks racing each other are not protected, because rollLog runs
outside checkpointLock:
synchronized (checkpointLock) {
if (mark.getCurMark().compare(lastPersistedMark) < 0) {
return;
}
lastPersistedMark.setLogMark(...); // updated under lock
}
mark.rollLog(mark); // executed outside the lock
Scenario:
- Thread C: checkpointComplete(mark=7) passes guard, lastPersistedMark=7, starts rollLog (slow disk write + force(true)).
- Thread D: checkpointComplete(mark=9) passes guard (9 > 7), lastPersistedMark=9, starts rollLog.
- Thread D's rollLog finishes first → file contains 9.
- Thread C's rollLog finishes second → file contains 7. lastMark regresses.
rollLog opens new FileOutputStream(file) which truncates, so it really is last-writer-wins. This window is much smaller than the one the PR targets, but it's structurally the same bug and the guard
doesn't close it.
Suggested fix: either move rollLog inside the synchronized block, or have rollLog write lastPersistedMark (the source of truth) rather than the per-call mark argument. The former is simpler:
synchronized (checkpointLock) {
if (mark.getCurMark().compare(lastPersistedMark) < 0) {
return;
}
lastPersistedMark.setLogMark(
mark.getCurMark().getLogFileId(), mark.getCurMark().getLogFileOffset());
mark.rollLog(mark);
}
if (compact) {
...
}
rollLog does a 16-byte write + fsync. Holding the lock across it is fine — checkpointComplete is not on the hot path, and serializing it is exactly the invariant we want.
Concern 2: the test doesn't actually exercise the race
testConcurrentCheckpointCompleteLastMarkRegression calls checkpointComplete(newer) then checkpointComplete(stale) sequentially on the same thread. It verifies the guard logic but doesn't reproduce the
actual concurrent path that motivated the PR.
A concurrent test would:
- Spin up two threads, one with the stale checkpoint and one with the newer.
- Use latches/barriers to force the documented interleaving (Path B's checkpointComplete runs before Path A's, both with rollLog in flight).
- Assert lastMark file contents after both finish.
This also surfaces Concern 1 — if you run two threads with newer marks (7 and 9) concurrently in a loop, you can demonstrate the residual race.
Concern 3: lastPersistedMark is initialized to (0, 0)
At startup, LastLogMark.readLog() populates lastLogMark.curMark from the on-disk file (could be at, say, mark 100). But lastPersistedMark stays at (0, 0). In practice this is benign because
newCheckpoint() only ever produces marks ≥ the current journal position, so the first post-startup checkpointComplete will be ≥ 100 and the guard advances correctly. But it's conceptually inconsistent
— lastPersistedMark claims to be "what's on disk" and is actually zero.
Consider seeding it from lastLogMark when Journal.readLog() runs, or at least document the invariant.
Smaller comments
- Style: mark.rollLog(mark) (a method on LastLogMark taking a LastLogMark and using the argument's curMark) is awkward. Pre-existing, not for this PR — but a follow-up to make rollLog parameter-less
and use this.curMark would clean this up. - Deeper question (not blocking): zooming out, SingleDirectoryDbLedgerStorage.flush() calls checkpointSource.checkpointComplete(cp, true) internally (line 925), and SyncThread.flush() also calls
and compact=true, the outer one with the stale snapshot and compact=false. The monotonic guard handles this safely now, but it might be worth questioning whether SDLS should be calling
checkpointComplete at all when invoked from SyncThread's path. That's out of scope here, but the duplicated responsibility for advancing lastMark is a smell. - Comment: the inline // Skip if this mark is not newer than what was already persisted. comment is good. Adding a one-liner reference to the issue/PR (// See #4105) would help future readers trace the
rationale.
What looks good
- Guard semantics (compare < 0 not ≤ 0) correctly tolerates an idempotent retry with the same mark.
- The skip path doesn't suppress compact=true work that matters: if a stale call's compaction is skipped, the newer call's compaction has already deleted everything older.
- The synchronized block is tiny (compare + set) so contention is a non-issue.
Once Concerns 1 and 2 are addressed I think this is ready.
When journal and dbledgerstorage on the same disk, concurrent checkpointComplete calls from SyncThread and SingleDirectoryDbLedgerStorage.flush() can overwrite lastMark backwards,causing journal files to be deleted while still referenced by the older mark.
When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.
Root cause
SyncThread.flush() has a nested call pattern that causes lastMark regression:
SyncThread.flush():
1. outerCheckpoint = newCheckpoint() → mark = (file 5, offset 100)
2. ledgerStorage.flush()
→ SingleDirectoryDbLedgerStorage.flush():
a. innerCheckpoint = newCheckpoint() → mark = (file 7, offset 200), journal has advanced
b. flushes data to disk
c. checkpointComplete(mark=7, compact=true)
→ persists lastMark=7, deletes journal files with id < 7 (including file 5)
3. checkpointComplete(mark=5, compact=false)
→ persists lastMark=5, overwriting the 7 written in step 2c
4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash
Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
position whose journal file was already deleted.
Conditions
- singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
- Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
- maxBackupJournals small enough for old files to actually be deleted
Fix
Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.
Root cause
SyncThread.flush() has a nested call pattern that causes lastMark regression:
SyncThread.flush():
1. outerCheckpoint = newCheckpoint() → mark = (file 5, offset 100)
2. ledgerStorage.flush()
→ SingleDirectoryDbLedgerStorage.flush():
a. innerCheckpoint = newCheckpoint() → mark = (file 7, offset 200), journal has advanced
b. flushes data to disk
c. checkpointComplete(mark=7, compact=true)
→ persists lastMark=7, deletes journal files with id < 7 (including file 5)
3. checkpointComplete(mark=5, compact=false)
→ persists lastMark=5, overwriting the 7 written in step 2c
4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash
Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
position whose journal file was already deleted.
Conditions
- singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
- Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
- maxBackupJournals small enough for old files to actually be deleted
Fix
Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
When singleLedgerDirs=true, bookie can crash on restart with "Recovery log is missing" due to lastMark file being overwritten backwards.
Root cause
SyncThread.flush() has a nested call pattern that causes lastMark regression:
SyncThread.flush():
1. outerCheckpoint = newCheckpoint() → mark = (file 5, offset 100)
2. ledgerStorage.flush()
→ SingleDirectoryDbLedgerStorage.flush():
a. innerCheckpoint = newCheckpoint() → mark = (file 7, offset 200), journal has advanced
b. flushes data to disk
c. checkpointComplete(mark=7, compact=true)
→ persists lastMark=7, deletes journal files with id < 7 (including file 5)
3. checkpointComplete(mark=5, compact=false)
→ persists lastMark=5, overwriting the 7 written in step 2c
4. Bookie restarts → reads lastMark=5 → file 5 no longer exists → crash
Step 1 captures the checkpoint before step 2 runs, so it is always older. Step 2c advances lastMark and garbage-collects old journals. Step 3 then overwrites lastMark backwards to a
position whose journal file was already deleted.
Conditions
- singleLedgerDirs=true (journal and ledger on the same disk, so SingleDirectoryDbLedgerStorage.flush() calls checkpointComplete internally)
- Journal file rotates between the two newCheckpoint() calls (requires sufficient write throughput)
- maxBackupJournals small enough for old files to actually be deleted
Fix
Add a monotonic guard in checkpointComplete(): track the highest mark persisted so far, skip any call with an older mark. This prevents rollLog from overwriting lastMark backwards.
2d35897 to
5498af4
Compare
|
Thanks @hangc0276, I addressed your comments. The patch now initializes |
Fix #4105
Motivation
A bookie may fail to restart with:
IOException: Recovery log xxx is missingThis can happen when the persisted journal
lastMarkmoves backwards and points to a journal file that has already been garbage-collected.Root Cause
With
DbLedgerStorageand a single ledger directory, one flush cycle may complete checkpoints out of order.A possible sequence is:
SyncThread.flush()captures an older checkpoint, for example journal5.ledgerStorage.flush().SingleDirectoryDbLedgerStorage.flush()captures and completes a newer checkpoint, for example journal7.7persistslastMark=7.5.SyncThread.flush()resumes and completes the older checkpoint5.lastMarkback to5.If the bookie crashes after this, restart reads
lastMark=5, but journal5may no longer exist, so recovery fails.Example
Assume the bookie currently has these journal files:
5.txnlastMark=56.txn7.txnA possible execution order:
lastMarkSyncThread.flush()55ledgerStorage.flush()75checkpointComplete(7)7775.txnmay be deletedcheckpointComplete(5)5before this patchlastMarkmoves backwardslastMark=555.txnis missingCheckpoint
5is valid when it is created. It becomes stale only after checkpoint7completes first and allows journal GC to remove older logs.Race
The same issue can happen when two checkpoint completions are interleaved.
lastMarkcheckpointComplete(7)555checkpointComplete(9)555lastMark=99lastMark=77before this patchWithout holding the stale check, persist, and in-memory update under the same lock, an older checkpoint can pass the stale check before a newer checkpoint is persisted, then write later and move
lastMarkbackwards.Changes
This patch makes journal checkpoint completion monotonic.
Journal.checkpointComplete()now:lastMark;lastMarkloaded from disk during journal startup.In other words,
lastMarkmay move from5to7to9, but it must never move back from9to7or from7to5.Performance Impact
The impact should be negligible.
Checkpoint completion is not on the normal journal write hot path. The added comparison is in-memory and constant-time. The lock only protects checkpoint mark persistence, and stale checkpoints may now avoid unnecessary
lastMarkwrites.Tests
Added coverage for:
lastMark;Verified with:
mvn -pl bookkeeper-server -am -P '!mac' -Dtest=DbLedgerStorageTest -Dsurefire.failIfNoSpecifiedTests=false testmvn -pl bookkeeper-server -P '!mac' -DskipTests checkstyle:check