Fix/store offload bugs by pingzhuu · Pull Request #2594 · kvcache-ai/Mooncake

pingzhuu · 2026-06-24T06:48:38Z

Description

Module

Type of Change

How Has This Been Tested?

Test commands:

# Example: bash scripts/run_ci_test.sh

Test results:

Unit tests pass
Integration tests pass (if applicable)
Manual testing done (describe below)

Checklist

I have performed a self-review of my own code
I have formatted my code using ./scripts/code_format.sh
I have run pre-commit run --all-files and all hooks pass
I have updated the documentation (if applicable)
I have added tests to prove my changes are effective
For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

No AI tools were used
AI tools were used (specify below)

…2529 commit 1/3)

When a master failover occurs, the new leader's ssd_total_capacity_bytes is 0 because the field is in-memory and not replicated via oplog. The store-server heartbeat only calls ReportSsdCapacity once at init, so the new leader's SSD Storage denominator stays at 0 and SSD-level eviction never triggers. Fix: in FileStorage::Heartbeat(), when OffloadObjectHeartbeat returns SEGMENT_NOT_FOUND (master lost our LOCAL_DISK segment on restart/failover), re-report SSD capacity via ReportSsdCapacity after re-registering the segment, mirroring the init-time call.

vector_write/vector_read called pwritev/preadv with the full iovec array in one shot. When iovcnt exceeds UIO_MAXIOV (1024 on Linux), the syscall returns EINVAL, surfacing as FILE_WRITE_FAIL and corrupting the offload bucket. Fix: iterate the iovec array in chunks of UIO_MAXIOV, advancing the file offset by the bytes written in each chunk, and verify the total matches the expected byte count. Also log errno and context (filename, fd, byte counts, chunk range) on read/write/pwritev/preadv failures to aid diagnosis.

gemini-code-assist

Code Review

This pull request introduces background worker threads to execute L2-to-L1 promotion-on-hit tasks asynchronously in FileStorage, along with corresponding configuration options, unit tests, and benchmarks. It also updates PosixFile to perform chunked vector reads and writes with enhanced error logging. The review feedback highlights critical issues: potential data corruption in PosixFile::vector_write and PosixFile::vector_read due to incorrect offset handling on partial writes/reads, a concurrency race condition during shutdown in FileStorage that could leak enqueued tasks, and a potential crash during shutdown caused by a strict CHECK on task enqueuing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-24T06:51:06Z

+    for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) {
+        int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV);
+        ssize_t ret = ::pwritev(fd_, iov + idx, chunk_cnt, cur_offset);
+        if (ret < 0) {
+            int saved_errno = errno;
+            LOG(ERROR) << "pwritev failed for file: " << filename_
+                       << ", errno=" << saved_errno
+                       << " (" << strerror(saved_errno) << ")"
+                       << ", fd=" << fd_
+                       << ", iovcnt=" << iovcnt
+                       << ", total_bytes=" << total_bytes
+                       << ", offset=" << offset
+                       << ", chunk_start=" << idx
+                       << ", chunk_cnt=" << chunk_cnt;
+            return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL);
+        }
+        written_total += ret;
+        cur_offset += ret;
+    }


Issue: Potential Data Corruption on Partial Write

In PosixFile::vector_write, if ::pwritev performs a partial write (i.e., returns fewer bytes than the chunk's total size), the loop still advances idx by UIO_MAXIOV in the next iteration. This means the remaining unwritten bytes of the current chunk are skipped, and subsequent chunks are written at incorrect offsets, leading to silent data corruption.

Since PosixFile expects exact writes, we should fail immediately if ret is less than the expected chunk size.

for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) { int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV); size_t chunk_bytes = 0; for (int i = 0; i < chunk_cnt; ++i) { chunk_bytes += iov[idx + i].iov_len; } ssize_t ret = ::pwritev(fd_, iov + idx, chunk_cnt, cur_offset); if (ret < 0) { int saved_errno = errno; LOG(ERROR) << "pwritev failed for file: " << filename_ << ", errno=" << saved_errno << " (" << strerror(saved_errno) << ")" << ", fd=" << fd_ << ", iovcnt=" << iovcnt << ", total_bytes=" << total_bytes << ", offset=" << offset << ", chunk_start=" << idx << ", chunk_cnt=" << chunk_cnt; return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL); } if (static_cast<size_t>(ret) < chunk_bytes) { LOG(ERROR) << "pwritev partial write for file: " << filename_ << ", expected=" << chunk_bytes << ", written=" << ret; return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL); } written_total += ret; cur_offset += ret; }

gemini-code-assist · 2026-06-24T06:51:06Z

+    for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) {
+        int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV);
+        ssize_t ret = ::preadv(fd_, iov + idx, chunk_cnt, cur_offset);
+        if (ret < 0) {
+            int saved_errno = errno;
+            LOG(ERROR) << "preadv failed for file: " << filename_
+                       << ", errno=" << saved_errno
+                       << " (" << strerror(saved_errno) << ")"
+                       << ", fd=" << fd_
+                       << ", iovcnt=" << iovcnt
+                       << ", total_bytes=" << total_bytes
+                       << ", offset=" << offset
+                       << ", chunk_start=" << idx
+                       << ", chunk_cnt=" << chunk_cnt;
+            return make_error<size_t>(ErrorCode::FILE_READ_FAIL);
+        }
+        read_total += ret;
+        cur_offset += ret;
+        if (ret == 0) break;  // EOF
    }


Issue: Potential Data Corruption on Partial Read

In PosixFile::vector_read, if ::preadv performs a partial read (returns fewer bytes than the chunk's total size), the loop still advances idx by UIO_MAXIOV in the next iteration. This shifts the offsets for subsequent chunks, reading data into the wrong buffers and corrupting the read result.

We should break early if ret is less than the expected chunk size (which indicates EOF or a short read), stopping further reads.

for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) { int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV); size_t chunk_bytes = 0; for (int i = 0; i < chunk_cnt; ++i) { chunk_bytes += iov[idx + i].iov_len; } ssize_t ret = ::preadv(fd_, iov + idx, chunk_cnt, cur_offset); if (ret < 0) { int saved_errno = errno; LOG(ERROR) << "preadv failed for file: " << filename_ << ", errno=" << saved_errno << " (" << strerror(saved_errno) << ")" << ", fd=" << fd_ << ", iovcnt=" << iovcnt << ", total_bytes=" << total_bytes << ", offset=" << offset << ", chunk_start=" << idx << ", chunk_cnt=" << chunk_cnt; return make_error<size_t>(ErrorCode::FILE_READ_FAIL); } read_total += ret; cur_offset += ret; if (static_cast<size_t>(ret) < chunk_bytes) { break; // EOF or partial read, stop reading further } }

gemini-code-assist · 2026-06-24T06:51:06Z

+    std::vector<PromotionTaskItem> queued_tasks_to_release;
+    {
+        std::lock_guard<std::mutex> lock(promotion_queue_mutex_);
+        while (!promotion_task_queue_.empty()) {
+            queued_tasks_to_release.push_back(
+                std::move(promotion_task_queue_.front()));
+            promotion_task_queue_.pop_front();
+        }
+    }
+    promotion_workers_running_.store(false);


Issue: Concurrency Race Condition during Shutdown

In FileStorage::~FileStorage(), the lock on promotion_queue_mutex_ is released at line 251, and then promotion_workers_running_ is set to false at line 252.

During this tiny window, another thread (such as the heartbeat thread or a concurrent test runner) could call EnqueuePromotionTask(), which acquires the lock, sees promotion_workers_running_ as true, and successfully pushes a task to promotion_task_queue_. Since the queue was already cleared, this newly enqueued task will be leaked, leaving its master slot permanently pinned.

To prevent this race condition, promotion_workers_running_ should be set to false inside the lock block before clearing the queue.

Suggested change

std::vector<PromotionTaskItem> queued_tasks_to_release;

{

std::lock_guard<std::mutex> lock(promotion_queue_mutex_);

while (!promotion_task_queue_.empty()) {

queued_tasks_to_release.push_back(

std::move(promotion_task_queue_.front()));

promotion_task_queue_.pop_front();

}

}

promotion_workers_running_.store(false);

std::vector<PromotionTaskItem> queued_tasks_to_release;

{

std::lock_guard<std::mutex> lock(promotion_queue_mutex_);

promotion_workers_running_.store(false);

while (!promotion_task_queue_.empty()) {

queued_tasks_to_release.push_back(

std::move(promotion_task_queue_.front()));

promotion_task_queue_.pop_front();

}

}

gemini-code-assist · 2026-06-24T06:51:06Z

+    auto enqueue_batch = [this, &enqueued_tasks](const auto& tasks) {
+        for (const auto& task : tasks) {
+            const bool enqueued = EnqueuePromotionTask(
+                task, /*allow_over_capacity_for_pulled_task=*/
+                config_.promotion_queue_capacity > 0);
+            CHECK(enqueued)
+                << "Promotion queue capacity changed after pull planning";
+            ++enqueued_tasks;
        }
+    };


Issue: Potential Crash on Shutdown / Enqueue Failure

Using CHECK(enqueued) will abruptly terminate the program if EnqueuePromotionTask returns false. This can easily happen during shutdown when promotion_workers_running_ is set to false while the heartbeat thread is still processing or pulling tasks.

Instead of crashing the program, we should handle the failure gracefully by logging a warning and calling ReleasePromotionTask to ensure the master slot is released immediately.

auto enqueue_batch = [this, &enqueued_tasks](const auto& tasks) { for (const auto& task : tasks) { const bool enqueued = EnqueuePromotionTask( task, /*allow_over_capacity_for_pulled_task=*/ config_.promotion_queue_capacity > 0); if (!enqueued) { LOG(WARNING) << "Failed to enqueue promotion task for key=" << task.key << "; releasing master slot"; ReleasePromotionTask(task.key, task.tenant_id); } else { ++enqueued_tasks; } } };

…File Previously BucketStorageBackend::OpenFile only created UringFile for FileMode::Read, falling back to PosixFile for writes. This meant write paths (WriteBucket, WriteSlicesToFile via this backend) never took advantage of io_uring async submission and queue depth, even with MOONCAKE_OFFLOAD_USE_URING=1. Align with the base StorageBackend::create_file behavior: always create UringFile when use_uring is set, but keep O_DIRECT only for reads (write use_direct_io=false). O_DIRECT writes would require 4 KiB alignment padding which corrupts meta file parsing and wastes disk space, so writes stay buffered while still submitting through the io_uring ring.

Replace the pre-allocated 32MiB aligned bounce-buffer path with vector_write scatter-gather. Previously, buckets larger than 32MiB triggered a temporary posix_memalign allocation (logged as a WARNING per write) plus an extra memcpy aggregation step. Writes were already buffered (OpenFile uses O_DIRECT only for reads), so write_aligned gave no benefit; vector_write issues one SQE per iov directly. datasync is still called on UringFile after write for the same write-ordering durability guarantee.

pingzhuu and others added 5 commits June 23, 2026 19:16

store: drain promotion backlog off the heartbeat path (PR kvcache-ai#…

b671871

…2529 commit 1/3)

store: tighten promotion worker follow-up

50317a3

store: tighten promotion drain benchmark follow-up

49ac9f6

pingzhuu requested review from ShangmingCai, XucSh, YiXR, stmatengss and ykwd as code owners June 24, 2026 06:48

pingzhuu marked this pull request as draft June 24, 2026 06:49

github-actions Bot added documentation Improvements or additions to documentation run-ci Store Installation Tests labels Jun 24, 2026

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

pingzhuu added 2 commits June 24, 2026 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/store offload bugs#2594

Fix/store offload bugs#2594
pingzhuu wants to merge 7 commits into
kvcache-ai:mainfrom
pingzhuu:fix/store-offload-bugs

pingzhuu commented Jun 24, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pingzhuu commented Jun 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Module

Type of Change

How Has This Been Tested?

Checklist

AI Assistance Disclosure

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Issue: Potential Data Corruption on Partial Write

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Issue: Potential Data Corruption on Partial Read

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Issue: Concurrency Race Condition during Shutdown

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Issue: Potential Crash on Shutdown / Enqueue Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pingzhuu commented Jun 24, 2026 •

edited by github-actions Bot

Loading