Skip to content

Fix/store offload bugs#2594

Draft
pingzhuu wants to merge 7 commits into
kvcache-ai:mainfrom
pingzhuu:fix/store-offload-bugs
Draft

Fix/store offload bugs#2594
pingzhuu wants to merge 7 commits into
kvcache-ai:mainfrom
pingzhuu:fix/store-offload-bugs

Conversation

@pingzhuu

@pingzhuu pingzhuu commented Jun 24, 2026

Copy link
Copy Markdown

Description

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Mooncake PG (mooncake-pg)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • Common (mooncake-common)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Performance improvement
  • Other

How Has This Been Tested?

Test commands:

# Example: bash scripts/run_ci_test.sh

Test results:

  • Unit tests pass
  • Integration tests pass (if applicable)
  • Manual testing done (describe below)

Checklist

  • I have performed a self-review of my own code
  • I have formatted my code using ./scripts/code_format.sh
  • I have run pre-commit run --all-files and all hooks pass
  • I have updated the documentation (if applicable)
  • I have added tests to prove my changes are effective
  • For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

  • No AI tools were used
  • AI tools were used (specify below)

pingzhuu and others added 5 commits June 23, 2026 19:16
When a master failover occurs, the new leader's ssd_total_capacity_bytes
is 0 because the field is in-memory and not replicated via oplog. The
store-server heartbeat only calls ReportSsdCapacity once at init, so the
new leader's SSD Storage denominator stays at 0 and SSD-level eviction
never triggers.

Fix: in FileStorage::Heartbeat(), when OffloadObjectHeartbeat returns
SEGMENT_NOT_FOUND (master lost our LOCAL_DISK segment on restart/failover),
re-report SSD capacity via ReportSsdCapacity after re-registering the
segment, mirroring the init-time call.
vector_write/vector_read called pwritev/preadv with the full iovec
array in one shot. When iovcnt exceeds UIO_MAXIOV (1024 on Linux),
the syscall returns EINVAL, surfacing as FILE_WRITE_FAIL and corrupting
the offload bucket.

Fix: iterate the iovec array in chunks of UIO_MAXIOV, advancing the
file offset by the bytes written in each chunk, and verify the total
matches the expected byte count.

Also log errno and context (filename, fd, byte counts, chunk range)
on read/write/pwritev/preadv failures to aid diagnosis.
@pingzhuu pingzhuu marked this pull request as draft June 24, 2026 06:49
@github-actions github-actions Bot added documentation Improvements or additions to documentation run-ci Store Installation Tests labels Jun 24, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces background worker threads to execute L2-to-L1 promotion-on-hit tasks asynchronously in FileStorage, along with corresponding configuration options, unit tests, and benchmarks. It also updates PosixFile to perform chunked vector reads and writes with enhanced error logging. The review feedback highlights critical issues: potential data corruption in PosixFile::vector_write and PosixFile::vector_read due to incorrect offset handling on partial writes/reads, a concurrency race condition during shutdown in FileStorage that could leak enqueued tasks, and a potential crash during shutdown caused by a strict CHECK on task enqueuing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +131 to +149
for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) {
int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV);
ssize_t ret = ::pwritev(fd_, iov + idx, chunk_cnt, cur_offset);
if (ret < 0) {
int saved_errno = errno;
LOG(ERROR) << "pwritev failed for file: " << filename_
<< ", errno=" << saved_errno
<< " (" << strerror(saved_errno) << ")"
<< ", fd=" << fd_
<< ", iovcnt=" << iovcnt
<< ", total_bytes=" << total_bytes
<< ", offset=" << offset
<< ", chunk_start=" << idx
<< ", chunk_cnt=" << chunk_cnt;
return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL);
}
written_total += ret;
cur_offset += ret;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Issue: Potential Data Corruption on Partial Write

In PosixFile::vector_write, if ::pwritev performs a partial write (i.e., returns fewer bytes than the chunk's total size), the loop still advances idx by UIO_MAXIOV in the next iteration. This means the remaining unwritten bytes of the current chunk are skipped, and subsequent chunks are written at incorrect offsets, leading to silent data corruption.

Since PosixFile expects exact writes, we should fail immediately if ret is less than the expected chunk size.

    for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) {
        int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV);
        size_t chunk_bytes = 0;
        for (int i = 0; i < chunk_cnt; ++i) {
            chunk_bytes += iov[idx + i].iov_len;
        }
        ssize_t ret = ::pwritev(fd_, iov + idx, chunk_cnt, cur_offset);
        if (ret < 0) {
            int saved_errno = errno;
            LOG(ERROR) << "pwritev failed for file: " << filename_
                       << ", errno=" << saved_errno
                       << " (" << strerror(saved_errno) << ")"
                       << ", fd=" << fd_
                       << ", iovcnt=" << iovcnt
                       << ", total_bytes=" << total_bytes
                       << ", offset=" << offset
                       << ", chunk_start=" << idx
                       << ", chunk_cnt=" << chunk_cnt;
            return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL);
        }
        if (static_cast<size_t>(ret) < chunk_bytes) {
            LOG(ERROR) << "pwritev partial write for file: " << filename_
                       << ", expected=" << chunk_bytes
                       << ", written=" << ret;
            return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL);
        }
        written_total += ret;
        cur_offset += ret;
    }

Comment on lines +174 to 193
for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) {
int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV);
ssize_t ret = ::preadv(fd_, iov + idx, chunk_cnt, cur_offset);
if (ret < 0) {
int saved_errno = errno;
LOG(ERROR) << "preadv failed for file: " << filename_
<< ", errno=" << saved_errno
<< " (" << strerror(saved_errno) << ")"
<< ", fd=" << fd_
<< ", iovcnt=" << iovcnt
<< ", total_bytes=" << total_bytes
<< ", offset=" << offset
<< ", chunk_start=" << idx
<< ", chunk_cnt=" << chunk_cnt;
return make_error<size_t>(ErrorCode::FILE_READ_FAIL);
}
read_total += ret;
cur_offset += ret;
if (ret == 0) break; // EOF
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Issue: Potential Data Corruption on Partial Read

In PosixFile::vector_read, if ::preadv performs a partial read (returns fewer bytes than the chunk's total size), the loop still advances idx by UIO_MAXIOV in the next iteration. This shifts the offsets for subsequent chunks, reading data into the wrong buffers and corrupting the read result.

We should break early if ret is less than the expected chunk size (which indicates EOF or a short read), stopping further reads.

    for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) {
        int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV);
        size_t chunk_bytes = 0;
        for (int i = 0; i < chunk_cnt; ++i) {
            chunk_bytes += iov[idx + i].iov_len;
        }
        ssize_t ret = ::preadv(fd_, iov + idx, chunk_cnt, cur_offset);
        if (ret < 0) {
            int saved_errno = errno;
            LOG(ERROR) << "preadv failed for file: " << filename_
                       << ", errno=" << saved_errno
                       << " (" << strerror(saved_errno) << ")"
                       << ", fd=" << fd_
                       << ", iovcnt=" << iovcnt
                       << ", total_bytes=" << total_bytes
                       << ", offset=" << offset
                       << ", chunk_start=" << idx
                       << ", chunk_cnt=" << chunk_cnt;
            return make_error<size_t>(ErrorCode::FILE_READ_FAIL);
        }
        read_total += ret;
        cur_offset += ret;
        if (static_cast<size_t>(ret) < chunk_bytes) {
            break;  // EOF or partial read, stop reading further
        }
    }

Comment on lines +243 to +252
std::vector<PromotionTaskItem> queued_tasks_to_release;
{
std::lock_guard<std::mutex> lock(promotion_queue_mutex_);
while (!promotion_task_queue_.empty()) {
queued_tasks_to_release.push_back(
std::move(promotion_task_queue_.front()));
promotion_task_queue_.pop_front();
}
}
promotion_workers_running_.store(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Issue: Concurrency Race Condition during Shutdown

In FileStorage::~FileStorage(), the lock on promotion_queue_mutex_ is released at line 251, and then promotion_workers_running_ is set to false at line 252.

During this tiny window, another thread (such as the heartbeat thread or a concurrent test runner) could call EnqueuePromotionTask(), which acquires the lock, sees promotion_workers_running_ as true, and successfully pushes a task to promotion_task_queue_. Since the queue was already cleared, this newly enqueued task will be leaked, leaving its master slot permanently pinned.

To prevent this race condition, promotion_workers_running_ should be set to false inside the lock block before clearing the queue.

Suggested change
std::vector<PromotionTaskItem> queued_tasks_to_release;
{
std::lock_guard<std::mutex> lock(promotion_queue_mutex_);
while (!promotion_task_queue_.empty()) {
queued_tasks_to_release.push_back(
std::move(promotion_task_queue_.front()));
promotion_task_queue_.pop_front();
}
}
promotion_workers_running_.store(false);
std::vector<PromotionTaskItem> queued_tasks_to_release;
{
std::lock_guard<std::mutex> lock(promotion_queue_mutex_);
promotion_workers_running_.store(false);
while (!promotion_task_queue_.empty()) {
queued_tasks_to_release.push_back(
std::move(promotion_task_queue_.front()));
promotion_task_queue_.pop_front();
}
}

Comment on lines +809 to +818
auto enqueue_batch = [this, &enqueued_tasks](const auto& tasks) {
for (const auto& task : tasks) {
const bool enqueued = EnqueuePromotionTask(
task, /*allow_over_capacity_for_pulled_task=*/
config_.promotion_queue_capacity > 0);
CHECK(enqueued)
<< "Promotion queue capacity changed after pull planning";
++enqueued_tasks;
}
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Issue: Potential Crash on Shutdown / Enqueue Failure

Using CHECK(enqueued) will abruptly terminate the program if EnqueuePromotionTask returns false. This can easily happen during shutdown when promotion_workers_running_ is set to false while the heartbeat thread is still processing or pulling tasks.

Instead of crashing the program, we should handle the failure gracefully by logging a warning and calling ReleasePromotionTask to ensure the master slot is released immediately.

    auto enqueue_batch = [this, &enqueued_tasks](const auto& tasks) {
        for (const auto& task : tasks) {
            const bool enqueued = EnqueuePromotionTask(
                task, /*allow_over_capacity_for_pulled_task=*/
                config_.promotion_queue_capacity > 0);
            if (!enqueued) {
                LOG(WARNING) << "Failed to enqueue promotion task for key=" << task.key
                             << "; releasing master slot";
                ReleasePromotionTask(task.key, task.tenant_id);
            } else {
                ++enqueued_tasks;
            }
        }
    };

pingzhuu added 2 commits June 24, 2026 14:53
…File

Previously BucketStorageBackend::OpenFile only created UringFile for
FileMode::Read, falling back to PosixFile for writes. This meant write
paths (WriteBucket, WriteSlicesToFile via this backend) never took
advantage of io_uring async submission and queue depth, even with
MOONCAKE_OFFLOAD_USE_URING=1.

Align with the base StorageBackend::create_file behavior: always create
UringFile when use_uring is set, but keep O_DIRECT only for reads (write
use_direct_io=false). O_DIRECT writes would require 4 KiB alignment
padding which corrupts meta file parsing and wastes disk space, so
writes stay buffered while still submitting through the io_uring ring.
Replace the pre-allocated 32MiB aligned bounce-buffer path with
vector_write scatter-gather. Previously, buckets larger than 32MiB
triggered a temporary posix_memalign allocation (logged as a WARNING
per write) plus an extra memcpy aggregation step. Writes were already
buffered (OpenFile uses O_DIRECT only for reads), so write_aligned
gave no benefit; vector_write issues one SQE per iov directly.

datasync is still called on UringFile after write for the same
write-ordering durability guarantee.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation Installation run-ci Store Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants