Skip to content

Flagcx lat impr#2622

Open
mikethegoblin wants to merge 2 commits into
kvcache-ai:feat/flagcx-supportfrom
mikethegoblin:flagcx-lat-impr
Open

Flagcx lat impr#2622
mikethegoblin wants to merge 2 commits into
kvcache-ai:feat/flagcx-supportfrom
mikethegoblin:flagcx-lat-impr

Conversation

@mikethegoblin

@mikethegoblin mikethegoblin commented Jun 25, 2026

Copy link
Copy Markdown

Description

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Mooncake PG (mooncake-pg)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • Common (mooncake-common)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Performance improvement
  • Other

How Has This Been Tested?

Ran tebench as well as vLLM benchmark in 1P1D scenario
tebench result:
image

Checklist

  • I have performed a self-review of my own code
  • I have formatted my code using ./scripts/code_format.sh
  • I have run pre-commit run --all-files and all hooks pass
  • I have updated the documentation (if applicable)
  • I have added tests to prove my changes are effective
  • For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

  • No AI tools were used
  • AI tools were used (specify below)

used codex

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the FlagCxTransport to remove the background I/O worker thread, transitioning to direct non-blocking submission of slices and a polling-based completion mechanism via getTransferStatus. The review feedback highlights two key areas for improvement: first, avoiding holding the flagcx_mu_ mutex while calling connForSegment to prevent potential performance bottlenecks from blocking metadata queries; second, optimizing submitSlices to bypass dynamic memory allocation of std::unordered_map when all slices in a batch are homogeneous.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +107 to +111
FlagcxP2pConn *conn = nullptr;
{
std::lock_guard<std::mutex> lk(flagcx_mu_);
conn = connForSegment(group.front()->target_id);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Holding flagcx_mu_ while calling connForSegment can lead to performance bottlenecks. connForSegment calls metadata_->getSegmentDescByID, which may perform blocking network I/O (e.g., querying etcd or performing handshakes) if the metadata cache is disabled or needs updating. Holding the mutex during this time blocks all other threads trying to submit transfers or poll status.

Consider refactoring connForSegment to only lock flagcx_mu_ around the flagcxP2pEngineGetConn call, and call connForSegment without holding the lock here.

Comment on lines +178 to +188
void FlagCxTransport::submitSlices(const std::vector<Slice *> &slices) {
std::lock_guard<std::mutex> lk(submit_mu_);
std::unordered_map<uint64_t, std::vector<Slice *>> groups;
for (Slice *s : slices) {
const uint64_t key =
(static_cast<uint64_t>(s->target_id) << 1) |
(s->opcode == TransferRequest::WRITE ? 0u : 1u);
groups[key].push_back(s);
}
for (auto &kv : groups) runSliceGroup(kv.second);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In high-performance networking, avoiding dynamic memory allocations on the data path is crucial. Currently, submitSlices allocates a std::unordered_map on every call. For the very common case where all slices in the batch are homogeneous (same target_id and opcode) or when there is only a single slice, we can completely bypass the map allocation and call runSliceGroup directly.

void FlagCxTransport::submitSlices(const std::vector<Slice *> &slices) {
    if (slices.empty()) return;
    std::lock_guard<std::mutex> lk(submit_mu_);
    bool all_same = true;
    for (size_t i = 1; i < slices.size(); ++i) {
        if (slices[i]->target_id != slices[0]->target_id ||
            slices[i]->opcode != slices[0]->opcode) {
            all_same = false;
            break;
        }
    }
    if (all_same) {
        runSliceGroup(slices);
        return;
    }

    std::unordered_map<uint64_t, std::vector<Slice *>> groups;
    for (Slice *s : slices) {
        const uint64_t key =
            (static_cast<uint64_t>(s->target_id) << 1) |
            (s->opcode == TransferRequest::WRITE ? 0u : 1u);
        groups[key].push_back(s);
    }
    for (auto &kv : groups) runSliceGroup(kv.second);
}

@alogfans

Copy link
Copy Markdown
Collaborator

I think it's not a good idea that pollPendingTransfers() called during getTransferStatus(). This forces users to call getTransferStatus() to reach progress, that is, user must activity polling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants