Fix/store offload bugs#2594
Conversation
When a master failover occurs, the new leader's ssd_total_capacity_bytes is 0 because the field is in-memory and not replicated via oplog. The store-server heartbeat only calls ReportSsdCapacity once at init, so the new leader's SSD Storage denominator stays at 0 and SSD-level eviction never triggers. Fix: in FileStorage::Heartbeat(), when OffloadObjectHeartbeat returns SEGMENT_NOT_FOUND (master lost our LOCAL_DISK segment on restart/failover), re-report SSD capacity via ReportSsdCapacity after re-registering the segment, mirroring the init-time call.
vector_write/vector_read called pwritev/preadv with the full iovec array in one shot. When iovcnt exceeds UIO_MAXIOV (1024 on Linux), the syscall returns EINVAL, surfacing as FILE_WRITE_FAIL and corrupting the offload bucket. Fix: iterate the iovec array in chunks of UIO_MAXIOV, advancing the file offset by the bytes written in each chunk, and verify the total matches the expected byte count. Also log errno and context (filename, fd, byte counts, chunk range) on read/write/pwritev/preadv failures to aid diagnosis.
There was a problem hiding this comment.
Code Review
This pull request introduces background worker threads to execute L2-to-L1 promotion-on-hit tasks asynchronously in FileStorage, along with corresponding configuration options, unit tests, and benchmarks. It also updates PosixFile to perform chunked vector reads and writes with enhanced error logging. The review feedback highlights critical issues: potential data corruption in PosixFile::vector_write and PosixFile::vector_read due to incorrect offset handling on partial writes/reads, a concurrency race condition during shutdown in FileStorage that could leak enqueued tasks, and a potential crash during shutdown caused by a strict CHECK on task enqueuing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) { | ||
| int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV); | ||
| ssize_t ret = ::pwritev(fd_, iov + idx, chunk_cnt, cur_offset); | ||
| if (ret < 0) { | ||
| int saved_errno = errno; | ||
| LOG(ERROR) << "pwritev failed for file: " << filename_ | ||
| << ", errno=" << saved_errno | ||
| << " (" << strerror(saved_errno) << ")" | ||
| << ", fd=" << fd_ | ||
| << ", iovcnt=" << iovcnt | ||
| << ", total_bytes=" << total_bytes | ||
| << ", offset=" << offset | ||
| << ", chunk_start=" << idx | ||
| << ", chunk_cnt=" << chunk_cnt; | ||
| return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL); | ||
| } | ||
| written_total += ret; | ||
| cur_offset += ret; | ||
| } |
There was a problem hiding this comment.
Issue: Potential Data Corruption on Partial Write
In PosixFile::vector_write, if ::pwritev performs a partial write (i.e., returns fewer bytes than the chunk's total size), the loop still advances idx by UIO_MAXIOV in the next iteration. This means the remaining unwritten bytes of the current chunk are skipped, and subsequent chunks are written at incorrect offsets, leading to silent data corruption.
Since PosixFile expects exact writes, we should fail immediately if ret is less than the expected chunk size.
for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) {
int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV);
size_t chunk_bytes = 0;
for (int i = 0; i < chunk_cnt; ++i) {
chunk_bytes += iov[idx + i].iov_len;
}
ssize_t ret = ::pwritev(fd_, iov + idx, chunk_cnt, cur_offset);
if (ret < 0) {
int saved_errno = errno;
LOG(ERROR) << "pwritev failed for file: " << filename_
<< ", errno=" << saved_errno
<< " (" << strerror(saved_errno) << ")"
<< ", fd=" << fd_
<< ", iovcnt=" << iovcnt
<< ", total_bytes=" << total_bytes
<< ", offset=" << offset
<< ", chunk_start=" << idx
<< ", chunk_cnt=" << chunk_cnt;
return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL);
}
if (static_cast<size_t>(ret) < chunk_bytes) {
LOG(ERROR) << "pwritev partial write for file: " << filename_
<< ", expected=" << chunk_bytes
<< ", written=" << ret;
return make_error<size_t>(ErrorCode::FILE_WRITE_FAIL);
}
written_total += ret;
cur_offset += ret;
}| for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) { | ||
| int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV); | ||
| ssize_t ret = ::preadv(fd_, iov + idx, chunk_cnt, cur_offset); | ||
| if (ret < 0) { | ||
| int saved_errno = errno; | ||
| LOG(ERROR) << "preadv failed for file: " << filename_ | ||
| << ", errno=" << saved_errno | ||
| << " (" << strerror(saved_errno) << ")" | ||
| << ", fd=" << fd_ | ||
| << ", iovcnt=" << iovcnt | ||
| << ", total_bytes=" << total_bytes | ||
| << ", offset=" << offset | ||
| << ", chunk_start=" << idx | ||
| << ", chunk_cnt=" << chunk_cnt; | ||
| return make_error<size_t>(ErrorCode::FILE_READ_FAIL); | ||
| } | ||
| read_total += ret; | ||
| cur_offset += ret; | ||
| if (ret == 0) break; // EOF | ||
| } |
There was a problem hiding this comment.
Issue: Potential Data Corruption on Partial Read
In PosixFile::vector_read, if ::preadv performs a partial read (returns fewer bytes than the chunk's total size), the loop still advances idx by UIO_MAXIOV in the next iteration. This shifts the offsets for subsequent chunks, reading data into the wrong buffers and corrupting the read result.
We should break early if ret is less than the expected chunk size (which indicates EOF or a short read), stopping further reads.
for (int idx = 0; idx < iovcnt; idx += UIO_MAXIOV) {
int chunk_cnt = std::min(iovcnt - idx, UIO_MAXIOV);
size_t chunk_bytes = 0;
for (int i = 0; i < chunk_cnt; ++i) {
chunk_bytes += iov[idx + i].iov_len;
}
ssize_t ret = ::preadv(fd_, iov + idx, chunk_cnt, cur_offset);
if (ret < 0) {
int saved_errno = errno;
LOG(ERROR) << "preadv failed for file: " << filename_
<< ", errno=" << saved_errno
<< " (" << strerror(saved_errno) << ")"
<< ", fd=" << fd_
<< ", iovcnt=" << iovcnt
<< ", total_bytes=" << total_bytes
<< ", offset=" << offset
<< ", chunk_start=" << idx
<< ", chunk_cnt=" << chunk_cnt;
return make_error<size_t>(ErrorCode::FILE_READ_FAIL);
}
read_total += ret;
cur_offset += ret;
if (static_cast<size_t>(ret) < chunk_bytes) {
break; // EOF or partial read, stop reading further
}
}| std::vector<PromotionTaskItem> queued_tasks_to_release; | ||
| { | ||
| std::lock_guard<std::mutex> lock(promotion_queue_mutex_); | ||
| while (!promotion_task_queue_.empty()) { | ||
| queued_tasks_to_release.push_back( | ||
| std::move(promotion_task_queue_.front())); | ||
| promotion_task_queue_.pop_front(); | ||
| } | ||
| } | ||
| promotion_workers_running_.store(false); |
There was a problem hiding this comment.
Issue: Concurrency Race Condition during Shutdown
In FileStorage::~FileStorage(), the lock on promotion_queue_mutex_ is released at line 251, and then promotion_workers_running_ is set to false at line 252.
During this tiny window, another thread (such as the heartbeat thread or a concurrent test runner) could call EnqueuePromotionTask(), which acquires the lock, sees promotion_workers_running_ as true, and successfully pushes a task to promotion_task_queue_. Since the queue was already cleared, this newly enqueued task will be leaked, leaving its master slot permanently pinned.
To prevent this race condition, promotion_workers_running_ should be set to false inside the lock block before clearing the queue.
| std::vector<PromotionTaskItem> queued_tasks_to_release; | |
| { | |
| std::lock_guard<std::mutex> lock(promotion_queue_mutex_); | |
| while (!promotion_task_queue_.empty()) { | |
| queued_tasks_to_release.push_back( | |
| std::move(promotion_task_queue_.front())); | |
| promotion_task_queue_.pop_front(); | |
| } | |
| } | |
| promotion_workers_running_.store(false); | |
| std::vector<PromotionTaskItem> queued_tasks_to_release; | |
| { | |
| std::lock_guard<std::mutex> lock(promotion_queue_mutex_); | |
| promotion_workers_running_.store(false); | |
| while (!promotion_task_queue_.empty()) { | |
| queued_tasks_to_release.push_back( | |
| std::move(promotion_task_queue_.front())); | |
| promotion_task_queue_.pop_front(); | |
| } | |
| } |
| auto enqueue_batch = [this, &enqueued_tasks](const auto& tasks) { | ||
| for (const auto& task : tasks) { | ||
| const bool enqueued = EnqueuePromotionTask( | ||
| task, /*allow_over_capacity_for_pulled_task=*/ | ||
| config_.promotion_queue_capacity > 0); | ||
| CHECK(enqueued) | ||
| << "Promotion queue capacity changed after pull planning"; | ||
| ++enqueued_tasks; | ||
| } | ||
| }; |
There was a problem hiding this comment.
Issue: Potential Crash on Shutdown / Enqueue Failure
Using CHECK(enqueued) will abruptly terminate the program if EnqueuePromotionTask returns false. This can easily happen during shutdown when promotion_workers_running_ is set to false while the heartbeat thread is still processing or pulling tasks.
Instead of crashing the program, we should handle the failure gracefully by logging a warning and calling ReleasePromotionTask to ensure the master slot is released immediately.
auto enqueue_batch = [this, &enqueued_tasks](const auto& tasks) {
for (const auto& task : tasks) {
const bool enqueued = EnqueuePromotionTask(
task, /*allow_over_capacity_for_pulled_task=*/
config_.promotion_queue_capacity > 0);
if (!enqueued) {
LOG(WARNING) << "Failed to enqueue promotion task for key=" << task.key
<< "; releasing master slot";
ReleasePromotionTask(task.key, task.tenant_id);
} else {
++enqueued_tasks;
}
}
};…File Previously BucketStorageBackend::OpenFile only created UringFile for FileMode::Read, falling back to PosixFile for writes. This meant write paths (WriteBucket, WriteSlicesToFile via this backend) never took advantage of io_uring async submission and queue depth, even with MOONCAKE_OFFLOAD_USE_URING=1. Align with the base StorageBackend::create_file behavior: always create UringFile when use_uring is set, but keep O_DIRECT only for reads (write use_direct_io=false). O_DIRECT writes would require 4 KiB alignment padding which corrupts meta file parsing and wastes disk space, so writes stay buffered while still submitting through the io_uring ring.
Replace the pre-allocated 32MiB aligned bounce-buffer path with vector_write scatter-gather. Previously, buckets larger than 32MiB triggered a temporary posix_memalign allocation (logged as a WARNING per write) plus an extra memcpy aggregation step. Writes were already buffered (OpenFile uses O_DIRECT only for reads), so write_aligned gave no benefit; vector_write issues one SQE per iov directly. datasync is still called on UringFile after write for the same write-ordering durability guarantee.
Description
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-pg)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-common)mooncake-rl)Type of Change
How Has This Been Tested?
Test commands:
# Example: bash scripts/run_ci_test.shTest results:
Checklist
./scripts/code_format.shpre-commit run --all-filesand all hooks passAI Assistance Disclosure