[TENT] SelectionPolicy: per-policy SL/TC/qp_pool schema (RFC #2568 step 1)#2640
[TENT] SelectionPolicy: per-policy SL/TC/qp_pool schema (RFC #2568 step 1)#2640catyans wants to merge 1 commit into
Conversation
…che-ai#2568 step 1) Step 1 of the two-step plan agreed with @staryxchen on kvcache-ai#2568: extend the existing `SelectionPolicy` (the opaque `policy_name` hook, not a new enum) with link-layer QoS attributes, so fabric QoS can be configured per policy instead of only process-globally via MC_IB_SL / MC_IB_TC. This is schema + plumbing only. It does NOT yet apply SL/TC at QP setup — real per-class differentiation needs per-class QP pools, which is the explicit step-2 follow-up. `qp_pool` is parsed and reserved here so the JSON schema is forward-compatible and step 2 needs no schema change or breakage of existing configs (per @staryxchen's reminder). - SelectionPolicy: add optional `service_level` (0-15), `traffic_class` (0-255), `qp_pool` (string). nullopt = fall back to the global RdmaParams default = today's behavior. - loadPolicies(): parse the three from the same JSON as existing fields; out-of-range SL/TC are ignored with a warning so a bad config never changes selection. - SelectionResult: carry the matched policy's SL/TC/qp_pool out to the caller for the step-2 QP-setup follow-up. - Tests: parse + carry-through, and out-of-range-ignored. Fully backward compatible: unset fields behave exactly as before, and the default policies (which don't set them) are unaffected. Note: not built locally (no TENT toolchain on my dev machine); relying on CI. clang-format (v20.1.8) applied.
There was a problem hiding this comment.
Code Review
This pull request introduces link-layer QoS attributes (service_level, traffic_class, and qp_pool) to the transport selector's policy schema and selection results, allowing per-policy fabric QoS configurations. It also adds unit tests to verify the parsing and validation of these new fields. The feedback highlights a potential issue where parsing JSON values without explicit type checking could throw exceptions and crash the application if unexpected types are provided. It is recommended to validate the JSON types before retrieval to ensure robust error handling.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if (policy_json.contains("service_level")) { | ||
| int sl = policy_json.value("service_level", -1); | ||
| if (sl >= 0 && sl <= 15) { | ||
| policy.service_level = sl; | ||
| } else { | ||
| LOG(WARNING) << "Ignore service_level in policy " << policy.name | ||
| << ", value " << sl << " out of range (0-15)"; | ||
| } | ||
| } | ||
| if (policy_json.contains("traffic_class")) { | ||
| int tc = policy_json.value("traffic_class", -1); | ||
| if (tc >= 0 && tc <= 255) { | ||
| policy.traffic_class = tc; | ||
| } else { | ||
| LOG(WARNING) << "Ignore traffic_class in policy " << policy.name | ||
| << ", value " << tc << " out of range (0-255)"; | ||
| } | ||
| } | ||
| // Reserved for step 2 (per-class QP pools); parsed for forward schema | ||
| // compatibility, no effect yet. | ||
| if (policy_json.contains("qp_pool")) { | ||
| auto& qp = policy_json["qp_pool"]; | ||
| if (qp.is_string()) policy.qp_pool = qp.get<std::string>(); | ||
| } |
There was a problem hiding this comment.
Using nlohmann::json::value() or direct retrieval on keys without type checking can throw a nlohmann::json::type_error exception if the value in the JSON configuration has an unexpected type (e.g., a string, boolean, or null). To ensure that a bad configuration never breaks selection or crashes the application, we should explicitly check the type using is_number_integer() before retrieving the value. Additionally, we should log a warning if qp_pool is present but is not a string, for consistency.
if (policy_json.contains("service_level")) {
auto& sl_val = policy_json["service_level"];
if (sl_val.is_number_integer()) {
int sl = sl_val.get<int>();
if (sl >= 0 && sl <= 15) {
policy.service_level = sl;
} else {
LOG(WARNING) << "Ignore service_level in policy " << policy.name
<< ", value " << sl << " out of range (0-15)";
}
} else {
LOG(WARNING) << "Ignore service_level in policy " << policy.name
<< ", value is not an integer";
}
}
if (policy_json.contains("traffic_class")) {
auto& tc_val = policy_json["traffic_class"];
if (tc_val.is_number_integer()) {
int tc = tc_val.get<int>();
if (tc >= 0 && tc <= 255) {
policy.traffic_class = tc;
} else {
LOG(WARNING) << "Ignore traffic_class in policy " << policy.name
<< ", value " << tc << " out of range (0-255)";
}
} else {
LOG(WARNING) << "Ignore traffic_class in policy " << policy.name
<< ", value is not an integer";
}
}
// Reserved for step 2 (per-class QP pools); parsed for forward schema
// compatibility, no effect yet.
if (policy_json.contains("qp_pool")) {
auto& qp = policy_json["qp_pool"];
if (qp.is_string()) {
policy.qp_pool = qp.get<std::string>();
} else {
LOG(WARNING) << "Ignore qp_pool in policy " << policy.name
<< ", value is not a string";
}
}|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| if (policy_json.contains("qp_pool")) { | ||
| auto& qp = policy_json["qp_pool"]; | ||
| if (qp.is_string()) policy.qp_pool = qp.get<std::string>(); | ||
| } |
There was a problem hiding this comment.
Empty-string qp_pool is treated as set, which is semantically different from nullopt (= use default pool).
| if (policy_json.contains("qp_pool")) { | ||
| auto& qp = policy_json["qp_pool"]; | ||
| if (qp.is_string()) policy.qp_pool = qp.get<std::string>(); | ||
| } |
There was a problem hiding this comment.
SL/TC out-of-range gets a LOG(WARNING), but "qp_pool": 42 (non-string) is silently no-op'd. For schema consistency, log a warning here too, you don't want a typo'd config to look like it took effect when it didn't.
| // Transport preference list (evaluated in order) | ||
| std::vector<TransportType> transports; | ||
|
|
||
| // --- Link-layer QoS attributes (RFC #2519 / #2568, step-1 schema only) --- |
There was a problem hiding this comment.
I think add the comments about the RFC reference and splited step is not necessary
Description
Step 1 of the two-step plan agreed with @staryxchen on #2568.
Instead of adding a
TrafficClassenum toSelectionContext(the originalRFC shape, which @staryxchen correctly flagged as layering inversion + a
closed-enum release bottleneck), this uses the existing opaque
policy_namehook and extendsSelectionPolicywith link-layer QoSattributes. The engine never interprets application semantics; callers bind
to a named policy and the JSON config carries the fabric QoS.
This lets SL/TC be configured per policy instead of only process-global
via
MC_IB_SL/MC_IB_TC(#2525 / #2526).Scope: schema + plumbing only
This PR does not apply SL/TC at QP setup. As @staryxchen noted, real
per-class differentiation requires per-class QP pools (SL/TC are baked into
QP attributes at handshake), which is the explicit step-2 follow-up.
qp_poolis parsed and reserved here so the JSON schema isforward-compatible and step 2 needs no schema change or breakage of existing
configs (per @staryxchen's reminder).
Changes
SelectionPolicy: add optionalservice_level(0-15),traffic_class(0-255),
qp_pool(string).nullopt= fall back to the globalRdmaParamsdefault = today's behavior.loadPolicies(): parse the three from the same JSON as the existingfields; out-of-range SL/TC are ignored with a warning so a bad config
never changes selection.
SelectionResult: carry the matched policy's SL/TC/qp_pool to the callerfor the step-2 QP-setup follow-up.
Fully backward compatible: unset fields behave exactly as before; the
default policies (which don't set them) are unaffected.
Module
mooncake-transfer-engine) — TENT runtimeType of Change
How Has This Been Tested?
Added unit tests in
transport_selector_test.cpp:a policy with
service_level/traffic_class/qp_poolis parsed and thevalues are carried through
SelectionResult;out-of-range SL/TC are ignored (left
nullopt).Unit tests pass (added)
Integration tests pass
Manual testing done
Checklist
pre-commit run --all-filesAI Assistance Disclosure