Skip to content

[TENT] SelectionPolicy: per-policy SL/TC/qp_pool schema (RFC #2568 step 1)#2640

Open
catyans wants to merge 1 commit into
kvcache-ai:mainfrom
catyans:feat/selection-policy-sl-tc-schema-v2
Open

[TENT] SelectionPolicy: per-policy SL/TC/qp_pool schema (RFC #2568 step 1)#2640
catyans wants to merge 1 commit into
kvcache-ai:mainfrom
catyans:feat/selection-policy-sl-tc-schema-v2

Conversation

@catyans

@catyans catyans commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Description

Step 1 of the two-step plan agreed with @staryxchen on #2568.

Instead of adding a TrafficClass enum to SelectionContext (the original
RFC shape, which @staryxchen correctly flagged as layering inversion + a
closed-enum release bottleneck), this uses the existing opaque
policy_name hook
and extends SelectionPolicy with link-layer QoS
attributes. The engine never interprets application semantics; callers bind
to a named policy and the JSON config carries the fabric QoS.

This lets SL/TC be configured per policy instead of only process-global
via MC_IB_SL / MC_IB_TC (#2525 / #2526).

Scope: schema + plumbing only

This PR does not apply SL/TC at QP setup. As @staryxchen noted, real
per-class differentiation requires per-class QP pools (SL/TC are baked into
QP attributes at handshake), which is the explicit step-2 follow-up.
qp_pool is parsed and reserved here so the JSON schema is
forward-compatible and step 2 needs no schema change or breakage of existing
configs (per @staryxchen's reminder).

Changes

  • SelectionPolicy: add optional service_level (0-15), traffic_class
    (0-255), qp_pool (string). nullopt = fall back to the global
    RdmaParams default = today's behavior.
  • loadPolicies(): parse the three from the same JSON as the existing
    fields; out-of-range SL/TC are ignored with a warning so a bad config
    never changes selection.
  • SelectionResult: carry the matched policy's SL/TC/qp_pool to the caller
    for the step-2 QP-setup follow-up.
  • Tests: parse + carry-through, and out-of-range-ignored.

Fully backward compatible: unset fields behave exactly as before; the
default policies (which don't set them) are unaffected.

Module

  • Transfer Engine (mooncake-transfer-engine) — TENT runtime

Type of Change

  • New feature (config schema)

How Has This Been Tested?

Added unit tests in transport_selector_test.cpp:

  • a policy with service_level/traffic_class/qp_pool is parsed and the
    values are carried through SelectionResult;

  • out-of-range SL/TC are ignored (left nullopt).

  • Unit tests pass (added)

  • Integration tests pass

  • Manual testing done

Note: not built locally (no TENT toolchain on my dev machine); relying on
CI. clang-format (v20.1.8) applied.

Checklist

AI Assistance Disclosure

  • AI tools were used (Claude Code) for code navigation and drafting.

…che-ai#2568 step 1)

Step 1 of the two-step plan agreed with @staryxchen on kvcache-ai#2568: extend the
existing `SelectionPolicy` (the opaque `policy_name` hook, not a new enum)
with link-layer QoS attributes, so fabric QoS can be configured per policy
instead of only process-globally via MC_IB_SL / MC_IB_TC.

This is schema + plumbing only. It does NOT yet apply SL/TC at QP setup —
real per-class differentiation needs per-class QP pools, which is the
explicit step-2 follow-up. `qp_pool` is parsed and reserved here so the
JSON schema is forward-compatible and step 2 needs no schema change or
breakage of existing configs (per @staryxchen's reminder).

- SelectionPolicy: add optional `service_level` (0-15), `traffic_class`
  (0-255), `qp_pool` (string). nullopt = fall back to the global RdmaParams
  default = today's behavior.
- loadPolicies(): parse the three from the same JSON as existing fields;
  out-of-range SL/TC are ignored with a warning so a bad config never
  changes selection.
- SelectionResult: carry the matched policy's SL/TC/qp_pool out to the
  caller for the step-2 QP-setup follow-up.
- Tests: parse + carry-through, and out-of-range-ignored.

Fully backward compatible: unset fields behave exactly as before, and the
default policies (which don't set them) are unaffected.

Note: not built locally (no TENT toolchain on my dev machine); relying on
CI. clang-format (v20.1.8) applied.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces link-layer QoS attributes (service_level, traffic_class, and qp_pool) to the transport selector's policy schema and selection results, allowing per-policy fabric QoS configurations. It also adds unit tests to verify the parsing and validation of these new fields. The feedback highlights a potential issue where parsing JSON values without explicit type checking could throw exceptions and crash the application if unexpected types are provided. It is recommended to validate the JSON types before retrieval to ensure robust error handling.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +212 to +235
if (policy_json.contains("service_level")) {
int sl = policy_json.value("service_level", -1);
if (sl >= 0 && sl <= 15) {
policy.service_level = sl;
} else {
LOG(WARNING) << "Ignore service_level in policy " << policy.name
<< ", value " << sl << " out of range (0-15)";
}
}
if (policy_json.contains("traffic_class")) {
int tc = policy_json.value("traffic_class", -1);
if (tc >= 0 && tc <= 255) {
policy.traffic_class = tc;
} else {
LOG(WARNING) << "Ignore traffic_class in policy " << policy.name
<< ", value " << tc << " out of range (0-255)";
}
}
// Reserved for step 2 (per-class QP pools); parsed for forward schema
// compatibility, no effect yet.
if (policy_json.contains("qp_pool")) {
auto& qp = policy_json["qp_pool"];
if (qp.is_string()) policy.qp_pool = qp.get<std::string>();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using nlohmann::json::value() or direct retrieval on keys without type checking can throw a nlohmann::json::type_error exception if the value in the JSON configuration has an unexpected type (e.g., a string, boolean, or null). To ensure that a bad configuration never breaks selection or crashes the application, we should explicitly check the type using is_number_integer() before retrieving the value. Additionally, we should log a warning if qp_pool is present but is not a string, for consistency.

        if (policy_json.contains("service_level")) {
            auto& sl_val = policy_json["service_level"];
            if (sl_val.is_number_integer()) {
                int sl = sl_val.get<int>();
                if (sl >= 0 && sl <= 15) {
                    policy.service_level = sl;
                } else {
                    LOG(WARNING) << "Ignore service_level in policy " << policy.name
                                 << ", value " << sl << " out of range (0-15)";
                }
            } else {
                LOG(WARNING) << "Ignore service_level in policy " << policy.name
                             << ", value is not an integer";
            }
        }
        if (policy_json.contains("traffic_class")) {
            auto& tc_val = policy_json["traffic_class"];
            if (tc_val.is_number_integer()) {
                int tc = tc_val.get<int>();
                if (tc >= 0 && tc <= 255) {
                    policy.traffic_class = tc;
                } else {
                    LOG(WARNING) << "Ignore traffic_class in policy " << policy.name
                                 << ", value " << tc << " out of range (0-255)";
                }
            } else {
                LOG(WARNING) << "Ignore traffic_class in policy " << policy.name
                             << ", value is not an integer";
            }
        }
        // Reserved for step 2 (per-class QP pools); parsed for forward schema
        // compatibility, no effect yet.
        if (policy_json.contains("qp_pool")) {
            auto& qp = policy_json["qp_pool"];
            if (qp.is_string()) {
                policy.qp_pool = qp.get<std::string>();
            } else {
                LOG(WARNING) << "Ignore qp_pool in policy " << policy.name
                             << ", value is not a string";
            }
        }

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment on lines +232 to +235
if (policy_json.contains("qp_pool")) {
auto& qp = policy_json["qp_pool"];
if (qp.is_string()) policy.qp_pool = qp.get<std::string>();
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty-string qp_pool is treated as set, which is semantically different from nullopt (= use default pool).

Comment on lines +232 to +235
if (policy_json.contains("qp_pool")) {
auto& qp = policy_json["qp_pool"];
if (qp.is_string()) policy.qp_pool = qp.get<std::string>();
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SL/TC out-of-range gets a LOG(WARNING), but "qp_pool": 42 (non-string) is silently no-op'd. For schema consistency, log a warning here too, you don't want a typo'd config to look like it took effect when it didn't.

// Transport preference list (evaluated in order)
std::vector<TransportType> transports;

// --- Link-layer QoS attributes (RFC #2519 / #2568, step-1 schema only) ---

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think add the comments about the RFC reference and splited step is not necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants