From 42fb9d9d8346e2d9a6ff1e43e6a9205a89eda738 Mon Sep 17 00:00:00 2001 From: Ashvin Deodhar Date: Mon, 30 Mar 2026 23:04:48 -0700 Subject: [PATCH 1/6] Add KEP-5981: DRA Sharing Affinity for Conditional Fungibility --- .../5981-dra-sharing-affinity/README.md | 1244 +++++++++++++++++ .../5981-dra-sharing-affinity/kep.yaml | 46 + 2 files changed, 1290 insertions(+) create mode 100644 keps/sig-scheduling/5981-dra-sharing-affinity/README.md create mode 100644 keps/sig-scheduling/5981-dra-sharing-affinity/kep.yaml diff --git a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md new file mode 100644 index 000000000000..a116ae3204bc --- /dev/null +++ b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md @@ -0,0 +1,1244 @@ +# KEP-5981: DRA Sharing Affinity for Conditional Fungibility + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1: RDMA Partition Key Alignment](#story-1-rdma-partition-key-alignment) + - [Story 2: FPGA Bitstream Sharing](#story-2-fpga-bitstream-sharing) + - [Story 3: Single-subnet NIC Sharing](#story-3-single-subnet-nic-sharing) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [API Enhancement](#api-enhancement) + - [ResourceSlice Device Spec](#resourceslice-device-spec) + - [Scheduler Enhancement](#scheduler-enhancement) + - [Examples](#examples) + - [ResourceSlice with Sharing Affinity](#resourceslice-with-sharing-affinity) + - [ResourceClaim with Affinity Value](#resourceclaim-with-affinity-value) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Claim-side SharingAffinity (on DeviceRequest)](#claim-side-sharingaffinity-on-devicerequest) + - [Placeholder Pattern Workaround](#placeholder-pattern-workaround) + - [CEL-based Affinity Matching](#cel-based-affinity-matching) +- [Future Enhancements](#future-enhancements) +- [Infrastructure Needed](#infrastructure-needed) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This KEP proposes an extension to Dynamic Resource Allocation (DRA) that allows +the `kube-scheduler` to handle resources that are **conditionally fungible**. + +[KEP-5075 (Consumable Capacity)](https://github.com/kubernetes/enhancements/issues/5075) +introduced the ability to track numerical capacity (e.g., 16 slots of a NIC) +and share devices across multiple claims via `allowMultipleAllocations`. +However, it assumes all claims are fungible—any claim can share the device with +any other claim. + +Real-world hardware is often **modal**: once partially allocated, the device +requires all subsequent consumers to share a specific configuration. For +example: + +- **Multi-pod NIC sharing**: A network DRA driver shares a NIC across 16 pods, + but all pods must belong to the same subnet. Once the first pod configures the + NIC for **Subnet A**, the remaining 15 slots are restricted to **Subnet A**. +- **FPGA bitstream sharing**: An FPGA can serve multiple inference pods, but all + must use the same bitstream. Once **bitstream-ml-v2** is loaded, other pods + needing **bitstream-crypto-v1** must use a different FPGA. + +This KEP introduces a `SharingAffinity` field in the ResourceSlice `Device` +spec that allows drivers to declare which device attribute keys constrain +sharing compatibility. The scheduler's `AllocatedState` is enhanced to track +both consumed capacity and the affinity values that lock a device to a +particular sharing group, enabling it to **gate** remaining capacity and +**pack** compatible workloads onto already-locked devices. + +## Motivation + +As AI and HPC workloads move toward higher density, hardware partitioning +(SR-IOV, GPU slicing, FPGA multi-tenancy) is becoming standard. These +physical devices often have a "modal" constraint: once partially allocated, +the device requires all subsequent consumers to share a specific configuration +(see [Summary](#summary) for concrete examples). + +Currently, the scheduler is unaware of this "lock." It may schedule a Pod +requiring a different configuration to the same device because it sees +"available capacity." +In short: **In these scenarios, Quantitative Sharing (how many slots?) fails +without Qualitative Gating (what mode are those slots in?).** This leads to: + +1. **Allocation failures at the node level**: The driver rejects incompatible + binds at prepare time, after the scheduler has already committed +2. **High scheduling latency**: The scheduler retries the same failing + combination, thrashing between candidates +3. **Resource starvation**: Without affinity awareness, same-subnet pods + spread across multiple devices instead of consolidating—wasting capacity +4. **Complex driver workarounds**: Drivers resort to placeholder patterns + with race conditions and ResourceSlice churn + +The scheduler's `AllocatedState` currently tracks consumed capacity but not the +affinity values that determine sharing compatibility. This KEP closes that gap. + +### Goals + +- Enable the scheduler to **gate** remaining capacity on a device based on a + required sharing attribute +- Provide a mechanism for drivers to signal compatibility requirements for + shared hardware via `SharingAffinity` in ResourceSlice +- Minimize **fragmentation** of cluster resources by enabling the scheduler to + **pack** workloads with identical sharing requirements onto already-locked devices +- Track affinity values in `AllocatedState` so subsequent scheduling decisions + respect the first claim's lock-in +- Maintain backward compatibility with devices that have no sharing affinity + constraints + +### Non-Goals + +- Defining hardware-specific attribute names (these remain driver-defined) +- Managing the physical lifecycle of the device configuration (this remains + the driver's responsibility) +- Changing how capacity is tracked (that's KEP-5075) +- Supporting affinity across different device types or pools + +## Proposal + +Add a `sharingAffinity` field to `Device` in ResourceSlice that specifies which device attribute keys constrain sharing: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +spec: + devices: + - name: eth1 + allowMultipleAllocations: true + sharingAffinity: + attributeKeys: ["networking.example.com/subnet"] + capacity: + networking.example.com/slots: + value: "16" +``` + +When the scheduler allocates a multi-allocatable device with `sharingAffinity`: + +1. **First claim**: The scheduler reads the claim's affinity values for the specified attribute key(s) and records them in `AllocatedState` alongside consumed capacity +2. **Subsequent claims**: The scheduler checks if the new claim's affinity values match those recorded in `AllocatedState` +3. **Mismatch**: If values don't match, the device is skipped (try another device) +4. **Match**: If values match and capacity is available, allocation proceeds + +<<[UNRESOLVED @pohly @johnbelamaric @sunya-ch @ritazh]>> +**Open Design Questions** + +**1. Placement of SharingAffinity: ResourceSlice (driver-side) vs DeviceRequest (claim-side)** + +This KEP places `SharingAffinity` on the ResourceSlice `Device` (driver- +defined). An alternative design places it on the `DeviceRequest` in the +ResourceClaim (user-defined). See [Alternatives: Claim-side SharingAffinity](#claim-side-sharingaffinity-on-devicerequest) for the trade-off analysis. + +We chose driver-side because the hardware modal constraint is a property of the +device, not the workload. The driver knows that "once a NIC is configured for +subnet A, it can only serve subnet A"—this is a device-level constraint that +should be declared once on the device. + +**2. How claims communicate affinity values to the scheduler** + +The driver declares `sharingAffinity.attributeKeys` on the device, telling the +scheduler which attribute keys constrain sharing. But the scheduler also needs +to know what affinity values a given claim *requests*. + +The claim's opaque config (`DeviceConfiguration.Opaque.Parameters`) is a +`runtime.RawExtension`—raw bytes the scheduler cannot parse by design. The +scheduler intentionally does not understand opaque driver parameters. Therefore, +affinity values cannot be extracted from opaque config. + +Options under consideration: + +**Option A: New structured field in `DeviceClaimConfiguration`** + +Add a `affinityValues` map alongside opaque config: + +```go +type DeviceClaimConfiguration struct { + Requests []string + DeviceConfiguration `json:",inline"` + + // AffinityValues provides structured affinity values for sharing + // constraints. Keys must correspond to attributeKeys declared in + // the device's SharingAffinity. + // The maximum number of entries is 8. + // +optional + // +featureGate=DRASharingAffinity + AffinityValues map[QualifiedName]DeviceAttribute +} +``` + +Pros: Clean separation; scheduler can read values directly; follows existing +config pattern where claim-level configuration lives. +Cons: Duplicates information already in opaque config; users must specify +affinity-relevant values in two places (opaque for the driver, structured for +the scheduler). + +> **UX mitigation — Staged Automation**: +> +> The duplication is addressed through a staged approach that establishes the +> API contract early and hardens it for production over time: +> +> **Alpha — API field + external webhook polyfill**: The `SharingAffinityMapping` +> field is defined in the `DeviceClass` spec during Alpha. This establishes the +> declarative mapping contract between opaque config keys and `affinityValues` +> attribute keys. For example: +> ```yaml +> apiVersion: resource.k8s.io/v1 +> kind: DeviceClass +> metadata: +> name: shared-nic +> spec: +> sharingAffinityMapping: +> - opaqueKey: "subnet" +> attributeKey: "networking.example.com/subnet" +> ``` +> During Alpha, driver vendors ship a **mutating admission webhook** that reads +> the `SharingAffinityMapping` from the referenced DeviceClass, extracts the +> corresponding values from the claim's opaque config, and auto-populates +> `affinityValues`. This keeps the initial PR footprint small and avoids complex +> changes to kube-apiserver before the core scheduling logic is proven. +> +> **Beta — Built-in admission controller**: The mapping logic moves into a +> **built-in admission controller** in the API server. This eliminates the +> dependency on an external webhook, which can fail or become a bottleneck during +> high-volume pod creation. The controller reads the DeviceClass mapping, +> extracts values from the opaque config's raw JSON (the driver opted into this +> by defining the mapping), and auto-populates `affinityValues` at the API +> boundary. By Beta, the mapping is transparent to the user — they provide +> opaque config and the API server ensures `affinityValues` is consistent. + +**Option B: New structured field on `ExactDeviceRequest`** + +Add `affinityValues` per-request instead of per-config: + +```go +type ExactDeviceRequest struct { + // ... existing fields ... + + // AffinityValues specifies the affinity attribute values this request + // requires for sharing-constrained devices. + // The maximum number of entries is 8. + // +optional + // +featureGate=DRASharingAffinity + AffinityValues map[QualifiedName]DeviceAttribute +} +``` + +Pros: Values are per-request (more precise); no duplication with config section. +Cons: Mixes resource selection with configuration; config is traditionally +separate from request in DRA's layered model. + +**Option C: Driver pre-populates device attributes dynamically** + +The driver publishes devices with attribute values already set based on current +state (e.g., `networking.example.com/subnet: "subnet-X"` after first +allocation). Claims use CEL selectors to match. + +Pros: No new API fields needed; uses existing CEL selector mechanism. +Cons: Race condition between driver updating ResourceSlice and scheduler reading +it—this is essentially the placeholder pattern this KEP aims to eliminate. + +We are leaning toward **Option A** because it cleanly separates structured +affinity values from opaque driver config while keeping the scheduler's read +path simple. The examples in this KEP use Option A. Feedback from reviewers is +requested. + +**3. SharingStrategy: Should claims control lock-setting behavior?** + +When a claim provides `affinityValues`, should it also declare whether it is +allowed to *set* a lock on a clean device, or whether it can only *join* an +existing lock? Two initial strategies are proposed: + +- `CanSetLock` (default): The claim can land on a clean device and establish + the lock. Standard behavior for primary workloads. +- `NeverSetLock`: The claim can only be allocated to a device that already has + a matching lock established by another claim. Useful for background or batch + jobs that should never "poison" a clean device. + +If included, this field would live alongside `affinityValues` on the claim side +(Option A: in `DeviceClaimConfiguration`, Option B: in `ExactDeviceRequest`). + +**Filter ordering**: The scheduler checks `NeverSetLock` **before** evaluating +capacity or key matching. If the device is unlocked and the strategy is +`NeverSetLock`, the device is rejected immediately — there is no need to +compute capacity for a device the claim is fundamentally ineligible for. This +also produces a clear rejection reason (`SharingAffinityNeverSetLockOnCleanDevice`) +without noise from capacity evaluation. The full Filter order becomes: + +1. If `Strategy == NeverSetLock` AND device is unlocked → reject +2. Sufficient consumable capacity (KEP-5075) +3. All required `attributeKeys` present in claim's `affinityValues` +4. Values match existing lock + +This could be deferred to beta if the alpha scope is too large. +<<[/UNRESOLVED]>> + +### User Stories + +#### Story 1: RDMA Partition Key Alignment + +A user runs a distributed training job where every Pod must share the same +RDMA Partition Key (PKey) to communicate. The NIC supports 16 VFs. The driver +sets `sharingAffinity.attributeKeys: ["networking.example.com/pkey"]`. The scheduler finds a node where +a NIC has enough capacity and is either "unlocked" or already locked to that +specific PKey. + +- Pod A (pkey-0x8001) gets allocated to mlx5_0 → mlx5_0 is now locked to pkey-0x8001 +- Pod B (pkey-0x8001) arrives → matches affinity, shares mlx5_0 +- Pod C (pkey-0x8002) arrives → affinity mismatch, gets mlx5_1 instead + +#### Story 2: FPGA Bitstream Sharing + +An inference service uses FPGAs to accelerate a specific model. Loading a +bitstream takes several seconds. The driver sets +`sharingAffinity.attributeKeys: ["fpga.example.com/bitstream"]`. The scheduler ensures new Pods +for this model are scheduled onto FPGAs that already have the bitstream loaded, +even if other "fresh" FPGAs are available. + +- Pod A (bitstream-ml-v2) gets the FPGA → locks to bitstream-ml-v2 +- Pod B (bitstream-ml-v2) shares the same FPGA +- Pod C (bitstream-crypto-v1) must wait or use a different FPGA + +#### Story 3: Single-subnet NIC Sharing + +A network DRA driver advertises NICs that can be shared across up to 16 pods, +but only if pods belong to the same subnet. The driver sets +`sharingAffinity.attributeKeys: ["networking.example.com/subnet"]`. + +- Pod A (subnet-X) gets allocated to eth1 → eth1 is now locked to subnet-X +- Pod B (subnet-X) arrives → matches affinity, shares eth1 +- Pod C (subnet-Y) arrives → affinity mismatch, gets eth2 instead + +### Notes/Constraints/Caveats + +- **Affinity is set by the first claim**: Once a device is allocated with an affinity value, that value is locked until all claims release the device +- **Attribute keys must be declared**: The device's `sharingAffinity.attributeKeys` lists which attribute keys constrain sharing; claims must provide values for all of these keys in `affinityValues` or the device is filtered out +- **Multiple keys**: If multiple attribute keys are specified, ALL must match (both presence and value) +- **Extra keys in claim**: If a claim's `affinityValues` contains keys beyond + what the device declares in `attributeKeys`, the extra keys are **ignored** + for that device. Only the device's declared keys are evaluated. This allows + "generic" claims to work across devices with different sharing requirements + (e.g., a claim with both `subnet` and `vlan` can match a device that only + constrains on `subnet`) +- **Missing keys in claim**: If the claim does not provide a value for a key + the device declares in `attributeKeys`, the device is **filtered out** (see + Filter phase) +- **Multi-request claims (per-request scoping)**: If a claim requests multiple + devices (e.g., `mgmt-nic` and `data-nic`), each `DeviceClaimConfiguration` + block targets specific requests via its `requests` slice. Different config + blocks can specify different `affinityValues` for different requests. This + means `mgmt-nic` can be locked to Subnet-A while `data-nic` is locked to + Subnet-B within the same claim — there is no cross-talk between requests. +- **Empty affinity**: Devices without `sharingAffinity` behave as before (any claim can share) +- **Grandfathered claims**: Pre-existing claims without `affinityValues` (created + before the feature was enabled) do not participate in affinity matching but do + not block new claims from establishing a lock. See lock precedence table below. + +#### Lock Precedence with Grandfathered Claims + +| Device State | New Claim | Result | +|---|---|---| +| 5 grandfathered claims, no lock set | Claim with `subnet: A` | Lock set to `subnet: A`; device now locked | +| 5 grandfathered + locked to `subnet: A` | Claim with `subnet: A` | Allowed (values match) | +| 5 grandfathered + locked to `subnet: A` | Claim with `subnet: B` | **Rejected** (mismatch with lock) | +| 5 grandfathered + locked to `subnet: A` | Claim without `affinityValues` | **Rejected** (missing required key) | +| Only grandfathered claims remain, new claims released | — | Lock cleared; device returns to unlocked | +| All claims released (grandfathered + new) | — | Device fully clean | + +Grandfathered claims are "transparent" to the lock — they neither set it nor +conflict with it. The lock is defined entirely by claims that provide +`AffinityValues`. + +### Risks and Mitigations + +#### Fragmentation (Poisoning) + +**Risk**: One Pod with a unique affinity value could "lock" a high-capacity +device, preventing other more common workloads from using the remaining 90% +capacity. + +**Mitigation**: A scoring plugin will prioritize packing compatible workloads +onto already-locked devices before consuming "clean" (unlocked) devices. This +minimizes the number of devices locked to a single affinity group. + +#### Priority Inversion (Preemption Blindness) + +**Risk**: Standard Kubernetes preemption is blind to affinity locks. It triggers +on *resource shortage*, not affinity mismatch. If a NIC has 15/16 slots +available but is locked to the wrong subnet, the scheduler sees plenty of +capacity and never enters the preemption path. A single low-priority Pod can +permanently hold a high-capacity device hostage by setting a lock that no +high-priority Pod can break. + +Even if preemption were triggered by an unrelated shortage, victim selection +asks "which Pods free up slots?" — not "which Pods clear the lock?" The +scheduler might preempt an unrelated Pod, freeing a slot on a device still +locked to the wrong value. + +**Mitigation (Alpha)**: The scoring plugin reduces the probability by packing +compatible workloads and preserving clean devices. However, this is a soft +mitigation — it does not guarantee that a clean device will always be available. + +**Mitigation (Beta)**: Lock-aware preemption (see [Beta graduation criteria](#beta)) +will teach the scheduler's PostFilter phase to detect affinity mismatch as a +preemption-solvable problem and identify lock-holder Pods as preemption victims. + +#### Stale Affinity View + +**Risk**: When a claim is released externally (pod completes, user deletes pod, +kubelet eviction), the scheduler learns about it asynchronously via its +informer watch. There is a brief propagation delay (typically milliseconds, but +potentially seconds under load) between the API server state changing and the +informer callback updating the scheduler's cache. During this window, the +scheduler may still see a device as "locked" when it is actually clean, causing +it to unnecessarily skip the device for one scheduling cycle. + +**Mitigation**: For scheduler-driven releases (Unreserve on binding failure or +preemption), the cache is updated immediately with no staleness. For external +releases, the informer eventually reconciles the state. This is the same +propagation delay that affects all informer-based caches in Kubernetes and is +not unique to this feature. The worst case is a briefly suboptimal scheduling +decision, not a correctness bug. + +#### Unexpected Affinity Values + +**Risk**: A claim specifies an unexpected or unique affinity value (e.g., an +arbitrary subnet GUID or name), further fragmenting devices by locking them to +rare values. + +**Mitigation**: In many cases, affinity values are externally defined (subnet +names, partition keys) and cannot be validated by the driver. The primary +mitigation is the **scoring plugin**: by packing compatible workloads onto +already-locked devices before consuming clean ones, the scheduler naturally +limits fragmentation even when affinity values are unpredictable. Additionally, +cluster administrators can use `DeviceClass` CEL selectors to restrict which +attribute values are accepted where domain-specific validation is feasible. + +#### Memory Overhead + +**Risk**: Affinity values accumulate in `AllocatedState`, increasing memory usage. + +**Mitigation**: Affinity values are small strings (max 64 characters per +`DeviceAttribute` value), capped at 8 attribute keys per device. Per-device +overhead is bounded at 8 key-value pairs in `AllocatedState.AffinityValues`, +and entries are cleared when all claims release the device. The total overhead +is proportional to active shared allocations, not total devices. + +## Design Details + +### API Enhancement + +#### ResourceSlice Device Spec + +```go +type Device struct { + // ... existing fields (Name, Attributes, Capacity, + // AllowMultipleAllocations, Taints, etc.) ... + + // SharingAffinity specifies constraints for sharing this device across + // multiple allocations. If set, only claims with matching affinity values + // for the specified attribute keys can share this device. + // + // This field is only meaningful when AllowMultipleAllocations is true. + // + // +optional + // +featureGate=DRASharingAffinity + SharingAffinity *DeviceSharingAffinity +} + +// DeviceSharingAffinity defines which device attribute keys constrain +// sharing across multiple claims. +type DeviceSharingAffinity struct { + // AttributeKeys lists the fully-qualified device attribute names that + // must have matching values across all claims sharing this device. + // + // When the first claim is allocated to this device, the affinity values + // for these keys are recorded in AllocatedState. Subsequent claims can + // only share the device if their affinity values match exactly. + // + // The maximum number of attribute keys is 8. + // + // +required + // +listType=atomic + // +k8s:maxItems=8 + AttributeKeys []FullyQualifiedName +} + +const SharingAffinityAttributeKeysMaxSize = 8 +``` + +#### Scheduler Enhancement + +##### Source of Truth for Affinity Locks + +The scheduler derives affinity locks **solely from active claims' +`affinityValues`** — not from device attributes on the ResourceSlice. The driver +is NOT required to write locked affinity values back to the ResourceSlice. + +- The ResourceSlice declares *which* keys constrain sharing (`attributeKeys`) +- The claims declare *what* values they need (`affinityValues`) +- The scheduler combines these to maintain the lock in `AllocatedState` + +This avoids two sources of truth that could diverge, eliminates ResourceSlice +churn (no update every time a lock is set/cleared), and keeps driver +implementation simple. Drivers MAY optionally publish current locked values as +regular device attributes for observability (e.g., visible via `kubectl`), but +the scheduler does not depend on them. + +When the last claim on a device is released, the scheduler clears the lock. The +driver is responsible for device lifecycle — tearing down the old configuration +(via `NodeUnprepareResources`) and reconfiguring for new claims (via +`NodePrepareResources`). The scheduler does not track hardware reconfiguration +state. + +##### Cache Extension: Effective Device State + +To prevent race conditions during high-volume scheduling, the scheduler +maintains affinity locks in its internal cache rather than relying on API server +round-trips. This is consistent with how DRA already handles capacity tracking +via `inFlightAllocations`. + +A device's effective state is a derived value: + +``` +Effective State = ResourceSlice (device definition + attributeKeys) + + Active Claims (affinityValues from bound claims) + + AssumedClaims (tentative locks from current scheduling cycle) +``` + +The scheduler's `AllocatedState` is extended to track affinity values alongside +consumed capacity: + +```go +type AllocatedState struct { + AllocatedDevices sets.Set[DeviceID] + AllocatedSharedDeviceIDs sets.Set[SharedDeviceID] + AggregatedCapacity ConsumedCapacityCollection + + // AffinityValues tracks the locked affinity values for shared devices. + // Key is DeviceID, value is a map of attribute key to locked value. + // Set tentatively during Reserve, hardened on successful bind, + // cleared on Unreserve or when all claims release. + // +featureGate=DRASharingAffinity + AffinityValues map[DeviceID]map[string]string +} +``` + +##### Filter and Score Phases + +**Filter phase**: For a given node, the scheduler evaluates each device. A +device with `sharingAffinity` is a candidate ONLY if: + +1. It has sufficient consumable capacity (KEP-5075) +2. The claim provides values for ALL keys in `sharingAffinity.attributeKeys` + (missing key → device is **not** a candidate) +3. The device's `AffinityValues` is either empty (unlocked) OR matches the + claim's affinity values for ALL keys + +If a claim does not provide an `affinityValues` entry for a required attribute +key, the device is filtered out. This is the safe default: the driver declared +that sharing requires a specific parameter, and a claim that omits it cannot be +properly configured. Claims that do not need sharing-constrained devices should +target devices without `sharingAffinity`. + +**Score phase**: Nodes where the `AffinityValues` already match the request +are scored **higher** than nodes with "clean" (unlocked) devices. This +preserves unlocked devices for future workloads with different affinity +values, minimizing fragmentation. + +##### Reserve Phase: Tentative Locking + +Once a node/device is selected, the Reserve plugin establishes a "tentative +lock" in the scheduler cache before the Binding phase: + +1. Scheduler evaluates a multi-allocatable device with `sharingAffinity` +2. If device has no existing allocations (unlocked): + - Extract affinity values for `sharingAffinity.attributeKeys` from the claim + - Record values in `AllocatedState.AffinityValues[deviceID]` + - Proceed with allocation (device is now tentatively locked) +3. If device has existing allocations (locked): + - Compare claim's affinity values against `AllocatedState.AffinityValues[deviceID]` + - If all keys match: proceed with allocation (pack onto locked device) + - If any key mismatches: skip this device, try next candidate + +This tentative lock is immediately visible to subsequent scheduling cycles. If +Pod-B is evaluated milliseconds after Pod-A's Reserve (before Pod-A's bind +reaches the API server), Pod-B's Filter phase will see Pod-A's tentative lock +and either join it or skip the device. This follows the same pattern used by +`SignalClaimPendingAllocation()` for capacity tracking. + +##### State Transitions + +| Event | Cache Action | Result | +|-------|-------------|--------| +| Pod scheduled (Reserve) | Add tentative lock to `AffinityValues` | Device becomes tentatively locked | +| Binding success (PreBind) | Transition tentative lock to hardened | Lock is confirmed | +| Binding failure / Preemption | Trigger Unreserve; remove tentative lock | Lock is released (if no other claims share it) | +| Driver update (ResourceSlice) | Reconcile cache with API state | Cache refreshed; redundant tentative locks purged | +| All claims released | Clear `AffinityValues[deviceID]` | Device becomes unlocked | + +##### Handling the "First Pod" Problem + +The first Pod to land on an unlocked device defines the affinity lock for all +subsequent consumers. This introduces a risk: a low-priority Pod with a rare +affinity value could "poison" a high-capacity device. + +**Lock origin**: If `AffinityValues[deviceID]` is empty, the scheduler takes +the affinity values from the current Pod's ResourceClaim and writes them to the +cache. All subsequent Pods must match these values to share the device. + +**Poisoning mitigation**: The Score phase assigns a higher score to nodes that +have a device already locked to a compatible affinity value, and a lower score +to nodes where the device is still unlocked. This steers the scheduler toward +packing onto already-locked devices before consuming clean ones, reducing +unnecessary lock fragmentation. + +##### Implementation Note: Snapshot Consistency + +Since the scheduler works on a snapshot of the cache for each Pod, the Reserve +phase must update the primary cache so that subsequent snapshots in the same +scheduling cycle reflect the new lock. This aligns with how VolumeBinding and +PodAffinity currently handle "assumed" states. + +**Parallel scheduling**: In clusters with parallel scheduling enabled, multiple +pods may reach the Filter phase concurrently. Without protection, two pods with +*different* affinities could both pass Filter for the same clean device in the +same millisecond. To prevent this, all reads and writes to +`AllocatedState.AffinityValues` must be protected by the `AllocatedState` mutex. +The Filter phase acquires a read lock to check the current affinity state; the +Reserve phase acquires a write lock to set the tentative lock atomically. This +ensures that once one pod's Reserve completes, the next pod's Filter sees the +updated lock. + +##### Scheduler Restart: State Reconstruction + +On scheduler restart, the in-memory `AffinityValues` map is empty. The scheduler +must reconstruct affinity locks from persisted state before the first scheduling +cycle begins. + +**Reconstruction algorithm**: + +1. On startup, the scheduler iterates all `Bound` ResourceClaims (same path as + existing `GatherAllocatedState()` for capacity reconstruction) +2. For each bound claim, check if the allocated device has `SharingAffinity` + defined in the corresponding ResourceSlice +3. If yes and the claim has `affinityValues`, read the values and populate + `AffinityValues[deviceID]` with the key-value pairs +4. If yes but the claim has **no** `affinityValues` (grandfathered claim from + before the feature was enabled), skip it — do not populate affinity for this + claim. The lock will be established by the next new claim that provides values. +5. If multiple claims share the same device, verify their values are consistent + (they must be, by construction—but log a warning if not) + +This follows the same pattern used to reconstruct `AggregatedCapacity` from +bound claims on startup. No new API calls are needed; the data is already +available from the ResourceClaim spec and ResourceSlice spec cached by the +scheduler's informers. + +### Examples + +#### ResourceSlice with Sharing Affinity + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node1-nics +spec: + driver: networking.example.com + nodeName: node1 + devices: + - name: eth1 + allowMultipleAllocations: true + sharingAffinity: + attributeKeys: ["networking.example.com/subnet"] + attributes: + networking.example.com/type: + string: "sriov-vf" + capacity: + networking.example.com/slots: + value: "16" + - name: eth2 + allowMultipleAllocations: true + sharingAffinity: + attributeKeys: ["networking.example.com/subnet"] + attributes: + networking.example.com/type: + string: "sriov-vf" + capacity: + networking.example.com/slots: + value: "16" +``` + +#### ResourceClaim with Affinity Value + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-a-nic +spec: + devices: + requests: + - name: nic + exactly: + deviceClassName: shared-nic + config: + - requests: ["nic"] + # Structured affinity values the scheduler can read (see UNRESOLVED #2) + affinityValues: + networking.example.com/subnet: + string: "subnet-X" + # Opaque driver config (scheduler cannot read this) + opaque: + driver: networking.example.com + parameters: + apiVersion: networking.example.com/v1 + kind: NICConfig + subnet: "subnet-X" + vlanId: 100 +``` + +> **Note**: The `affinityValues` field is a proposed new structured field +> (see [UNRESOLVED #2](#open-design-questions)). It provides the scheduler with +> readable affinity values while the full driver configuration remains in the +> opaque `parameters` blob. + +### Test Plan + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +Existing DRA scheduling tests should pass before adding sharing affinity tests. + +##### Unit tests + +- `pkg/scheduler/framework/plugins/dynamicresources`: Coverage for affinity matching + logic, including: + - Filter: device with matching lock passes + - Filter: device with conflicting lock is excluded + - Filter: unlocked device with sufficient capacity passes + - Filter: claim missing a required `attributeKey` → device filtered out + - Filter: claim with extra keys beyond device's declared `attributeKeys` → extra + keys ignored, device passes if declared keys match + - Score: locked-compatible device scores higher than clean device + - Reserve: first claim sets lock; second claim with same values succeeds + - Reserve: second claim with conflicting values fails + - Unreserve: tentative lock is rolled back + - Grandfathered claims: pre-existing claims without `affinityValues` neither set + nor conflict with locks + - Lock precedence: all 6 scenarios from the Lock Precedence table +- `staging/src/k8s.io/api/resource/v1`: Coverage for new API types, including: + - Validation: `attributeKeys` exceeding max 8 limit is rejected + - Validation: `affinityValues` exceeding max 8 limit is rejected + - Round-trip serialization of `SharingAffinity` and `affinityValues` + +##### Integration tests + +- Affinity matching with multiple claims to same device +- Affinity mismatch causing allocation to different device +- Affinity lock clearing when all claims release a device +- Interaction with consumable capacity constraints (KEP-5075) +- Scheduler restart: `AffinityValues` correctly reconstructed from existing + bound ResourceClaims (including skipping grandfathered claims) +- Parallel scheduling: two Pods with conflicting affinity values targeting the + same device — one wins Reserve, the other is requeued +- Feature gate disabled: `sharingAffinity` fields are ignored; devices are + treated as unconditionally shareable +- Feature gate toggled: enabling after claims exist does not disrupt already-bound + workloads +- **Ghost Lock**: Pod is Assumed (tentative lock set) but Bind fails — verify + the lock is cleared immediately and the next Pod in the queue can claim the + device with a different affinity value +- **Grandfather Migration**: 5 Pods running on a NIC with no lock; driver + updates ResourceSlice to add `sharingAffinity`; 6th Pod scheduled with + `affinityValues` — verify the 6th Pod succeeds, sets the lock, and the + original 5 Pods are not disrupted +- **Partial Key**: Device requires `subnet` and `pkey` in `attributeKeys`; + claim provides only `subnet` — verify the device is filtered out + +##### e2e tests + +- End-to-end test with mock DRA driver using sharing affinity +- Multi-pod scheduling: Pods with matching affinity values share the same device +- Multi-pod scheduling: Pods with conflicting affinity values are placed on + different devices +- Lock lifecycle: last Pod deleted → lock cleared → new Pod with different + affinity value can claim the device + +### Graduation Criteria + +#### Alpha + +- Feature implemented behind `DRASharingAffinity` feature gate +- API fields added to ResourceSlice (`SharingAffinity` on `Device`) +- API field added to DeviceClass (`SharingAffinityMapping`) to establish the + mapping contract between opaque config keys and `affinityValues` attribute keys +- Scheduler Filter plugin enforces affinity matching +- Scheduler Score plugin prefers locked-compatible devices over clean devices +- Scheduler tracks affinity in AllocatedState +- External mutating admission webhook polyfill: driver vendors can ship a webhook + that reads the DeviceClass mapping and auto-populates `affinityValues` +- Unit and integration tests +- Documentation for driver authors + +#### Beta + +- Gather feedback from DRA driver developers +- Address any issues found in alpha +- **Built-in admission controller**: The `SharingAffinityMapping` logic moves + from an external webhook into a built-in admission controller in the API + server, ensuring operational reliability and zero-touch UX for production +- **Lock-aware preemption**: PostFilter detects affinity mismatch as a + preemption-solvable problem; identifies lock-holder Pods as victims when a + higher-priority Pod needs a device locked to an incompatible value +- E2e tests stable +- Performance validation with high pod churn + +#### GA + +- At least 2 production drivers using sharing affinity +- No significant issues reported +- Conformance tests if applicable + +### Upgrade / Downgrade Strategy + +**Upgrade**: Existing ResourceSlices without `sharingAffinity` continue to work. New field is additive. + +**Adding `sharingAffinity` to an in-use device**: A driver may add or update +`sharingAffinity` on a device that already has active (bound) ResourceClaims. +This can happen during driver upgrades or when enabling the feature on existing +hardware. The scheduler handles this as follows: + +- **Pre-existing claims without `affinityValues`** are grandfathered: they do not + participate in affinity matching. The scheduler skips them when reconstructing + the `AffinityValues` map. +- **The lock is established by the first *new* claim** that provides + `affinityValues` for the required attribute keys after the `sharingAffinity` + field is added. +- **Pre-existing claims continue to run** and are not evicted. The driver is + responsible for ensuring that already-configured VFs/resources remain + functional regardless of the new affinity constraint. +- **On release of all claims** (both old and new), the device returns to a clean + unlocked state and subsequent allocations enforce affinity normally. + +> **Note**: The API server does not cross-validate ResourceSlice updates against +> active ResourceClaims. Enforcing "no `sharingAffinity` changes while claims +> are active" would require a new admission controller with cross-object +> validation, which is fragile and out of scope for this KEP. Drivers should +> avoid adding `sharingAffinity` mid-flight when possible, but the scheduler +> must handle it gracefully when it occurs. + +**Downgrade**: If a ResourceSlice with `sharingAffinity` exists and the feature gate is disabled: +- API server rejects updates to the field +- Scheduler ignores the field (all claims can share) +- Driver should handle this gracefully at prepare time + +### Version Skew Strategy + +- **kube-apiserver**: Must be upgraded first to accept new API field +- **kube-scheduler**: If scheduler is older, it ignores `sharingAffinity` (permissive) +- **kubelet**: No changes required; kubelet doesn't interpret sharing affinity +- **DRA driver**: Driver defines the field but doesn't enforce it; scheduler does + +During skew, the worst case is permissive sharing (old scheduler ignores affinity). Drivers should handle conflicting configs at prepare time as a fallback. + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate + - Feature gate name: `DRASharingAffinity` + - Components depending on the feature gate: kube-apiserver, kube-scheduler + +###### Does enabling the feature change any default behavior? + +No. Devices without `sharingAffinity` behave exactly as before. The feature only affects devices that explicitly opt-in via the new field. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. Disabling the feature gate causes: +- API server to reject new/updated ResourceSlices with `sharingAffinity` +- Scheduler to ignore existing `sharingAffinity` fields (permissive sharing) + +Existing allocations continue to work. New allocations may allow incompatible sharing, which drivers should handle at prepare time. + +###### What happens if we reenable the feature if it was previously rolled back? + +The scheduler resumes enforcing `sharingAffinity`. Existing allocations are not affected. New allocations will respect affinity constraints. + +###### Are there any tests for feature enablement/disablement? + +Yes, unit tests will cover the feature gate behavior for API validation and scheduler logic. + +### Rollout, Upgrade and Rollback Planning + +###### How can a rollout or rollback fail? Can it impact already running workloads? + +Rollout failure: If API server is updated but scheduler is not, the scheduler ignores affinity (permissive). This may cause incompatible sharing, but drivers should handle it. + +Rollback failure: If scheduler is rolled back but API server keeps the field, same permissive behavior. + +Running workloads are not impacted; only new scheduling decisions are affected. + +###### What specific metrics should inform a rollback? + +- `dra_scheduling_attempts_affinity_mismatch_total` increasing unexpectedly +- Pod scheduling failures with affinity-related events +- Driver prepare failures due to incompatible configs + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + +Will be tested before beta. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +No. + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? + +- Check ResourceSlices for `sharingAffinity` field +- Metric: `dra_scheduling_attempts_affinity_mismatch_total` > 0 indicates affinity is being enforced + +###### How can someone using this feature know that it is working for their instance? + +- [ ] Events + - Event Reason: `SharingAffinityMismatch` when a device is skipped due to affinity +- [ ] API .status + - Condition name: N/A (affinity is transparent; allocation succeeds or device is skipped) + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + +Scheduling latency should not increase significantly. Affinity checking is O(number of attribute keys), typically 1-3 keys. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +- [x] Metrics + - Metric name: `dra_scheduling_attempts_affinity_mismatch_total` + - Components exposing the metric: kube-scheduler + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +A metric for "devices skipped due to affinity" per scheduling cycle could help diagnose fragmentation. + +### Dependencies + +###### Does this feature depend on any specific services running in the cluster? + +- DRA must be enabled (GA in 1.34) +- KEP-5075 (Consumable Capacity) for multi-allocatable devices + +### Scalability + +###### Will enabling / using this feature result in any new API calls? + +No. Affinity is evaluated using existing ResourceSlice and ResourceClaim data. + +###### Will enabling / using this feature result in introducing new API types? + +No. Only new fields on existing types. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +- ResourceSlice: Small increase (~50-100 bytes) for devices with `sharingAffinity` +- AllocatedState (in-memory): Small increase for tracking affinity values + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +Negligible. Affinity check is a simple map lookup, O(1) per attribute key. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +No. Memory increase for `AllocatedState.AffinityValues` is proportional to active shared allocations, not total devices. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No. + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? + +Same as existing DRA behavior. Scheduler cannot proceed without API server. + +###### What are other known failure modes? + +- **Affinity fragmentation**: Many unique affinity values cause devices to be underutilized + - Detection: Monitor device utilization vs capacity + - Mitigation: Review affinity key design; consider coarser grouping + +###### What steps should be taken if SLOs are not being met to determine the problem? + +1. Check `dra_scheduling_attempts_affinity_mismatch_total` for unexpected spikes +2. Review ResourceSlice `sharingAffinity` configuration +3. Examine claim affinity values for unexpected entries +4. Consider disabling feature gate as temporary mitigation + +## Implementation History + +- 2026-03-27: Initial KEP issue created +- 2026-03-30: KEP document drafted + +## Drawbacks + +- Adds complexity to the scheduler's allocation logic +- Affinity is static per device; cannot change after first allocation +- Fragmentation risk if affinity values are too fine-grained + +## Alternatives + +### Claim-side SharingAffinity (on DeviceRequest) + +An alternative design places `SharingAffinity` on the `DeviceRequest` within +ResourceClaim, allowing the user to define it per workload: + +```go +type DeviceRequest struct { + // ... existing fields ... + SharingAffinity *SharingAffinity +} + +type SharingAffinity struct { + AttributeName string // e.g., "networking.k8s.io/pkey" + Value string // e.g., "0x8001" + Strategy SharingStrategy // e.g., "LockOnFirstUse" +} +``` + +**Rejected because**: +- The modal constraint is a property of the hardware, not the workload— + the driver knows that a NIC locked to subnet A can't serve subnet B +- Requires every claim to repeat the *constraint definition* (attribute name, + strategy) in addition to the value—the driver-side design declares the + constraint once on the device and claims only provide values +- Users must understand the sharing constraint mechanism and explicitly opt + into it, rather than simply providing config values they'd specify anyway + +### Placeholder Pattern Workaround + +Without this KEP, drivers must use a "placeholder pattern": + +1. Publish devices with `capacity: 1` initially +2. Wait for first claim to determine affinity value +3. Update ResourceSlice with actual capacity and affinity as attribute +4. Use CEL selector to match affinity attribute + +**Problems**: +- Race condition: Second pod may go to different device before expansion +- ResourceSlice churn: Constant updates as pods come and go +- Driver complexity: State machine for expand/contract lifecycle + +### CEL-based Affinity Matching + +An alternative approach uses CEL expressions to evaluate affinity compatibility, +rather than introducing new structured API fields. Two variants were considered: + +**Variant A: Claim-to-claim CEL matching** + +Allow CEL expressions in a ResourceClaim to reference other claims' allocations +on the same device. For example: + +```yaml +constraints: + - cel: + expression: > + device.allocations.all(a, + !has(a.config.subnet) || a.config.subnet == "subnet-X") +``` + +**Rejected because**: +- Creates a circular dependency: Claim A's eligibility depends on Claim B's + allocation, and vice versa. The scheduler cannot evaluate both simultaneously. +- CEL evaluation order becomes undefined—the result depends on which claim is + evaluated first, making scheduling non-deterministic. +- The CEL environment would need to expose `device.allocations`, a runtime + collection of other claims' configs. This is a fundamentally different + evaluation model from today's single-device CEL selectors. + +**Variant B: Driver-published CEL lock expressions on ResourceSlice** + +The driver publishes a CEL expression on the ResourceSlice that evaluates +whether a claim is compatible with the device's current lock state: + +```yaml +devices: + - name: eth1 + sharingAffinity: + lockExpression: > + device.affinityLock['subnet'] == '' || + device.affinityLock['subnet'] == claim.AffinityValues['subnet'] +``` + +**Rejected because**: +- `device.affinityLock` is runtime scheduler state, not a static device + attribute. Exposing it in CEL requires extending the evaluation context to + include the scheduler's in-memory `AllocatedState`, which breaks the current + model where CEL only evaluates against the ResourceSlice snapshot. +- `claim.AffinityValues` is not currently part of the CEL evaluation context + either. Adding it requires changes to the CEL environment definition, the + scheduler's expression compiler, and the cost estimator. +- CEL expressions are powerful but opaque to the scheduler—it cannot extract + *which* keys constrain sharing or *what* values to record in `AllocatedState`. + The scheduler would need to both evaluate the expression AND separately track + lock state, duplicating logic. +- While Kubernetes is adopting CEL broadly (ValidatingAdmissionPolicy, DRA + selectors), those use cases evaluate static data. Affinity matching requires + reasoning about mutable runtime state, which is a qualitatively different + problem better served by a purpose-built mechanism. + +A CEL-based approach may become viable in the future if the DRA CEL environment +is extended to support runtime allocation state (see +[Future Enhancements: CEL-based Lock Expressions](#cel-based-lock-expressions)). + +## Future Enhancements + +The following ideas are out of scope for alpha but are worth exploring in +beta/GA based on real-world feedback: + +### Priority-based Lock Preemption + +Standard Kubernetes preemption is **blind to affinity locks**. It triggers on +*resource shortage* (insufficient CPU, memory, or device slots), not on +qualitative state mismatch. This creates a critical gap: + +1. **Invisible shortage**: A NIC has 15/16 slots available but is locked to + Subnet-X. A high-priority Pod needs Subnet-Y. The scheduler sees plenty of + capacity → preemption is never triggered. The Pod is simply unschedulable. + +2. **Wrong victim selection**: Even if preemption were triggered by an unrelated + shortage, victim selection asks "which Pods free up slots?" not "which Pods + clear the lock?" The scheduler might preempt an unrelated Pod, freeing a + slot on a device still locked to the wrong subnet. + +3. **Permanent poisoning**: Without lock-aware preemption, a single low-priority + Pod can hold a high-capacity device hostage indefinitely. + +**Lock-aware preemption** (targeted for Beta) extends the scheduler's PostFilter +phase: + +1. **Detection**: When a Pod fails Filter specifically due to + `SharingAffinityMismatch`, the PostFilter identifies the device and its + current lock-holder claims. +2. **Evaluation**: It calculates the collective priority of all Pods holding + claims that share the lock. If the incoming Pod's priority exceeds the + group's maximum priority, preemption is viable. +3. **Action**: The scheduler preempts all lock-holder Pods on the device, + releasing their claims and clearing the affinity lock. The device returns + to a clean state for the high-priority Pod. + +This is scoped for Beta because the core Filter/Reserve/Score mechanism must +be proven in Alpha first, and lock-aware preemption requires careful +integration with the existing DRA preemption path. + +### Follower-Only Strategy + +See [UNRESOLVED #3: SharingStrategy](#open-design-questions) for the +`CanSetLock`/`NeverSetLock` proposal. If not included in alpha, this becomes +the first beta enhancement. + +### Soft / Preferred Affinity Keys + +The Alpha design enforces **hard all-or-nothing** matching: all declared +`attributeKeys` must match or the device is filtered out. Real-world hardware +may have hierarchical constraints where some keys are strict sharing +requirements (e.g., Subnet) and others are scheduling preferences (e.g., +Traffic-Class or bandwidth profile). + +A future enhancement could add a `required` vs `preferred` flag on individual +entries in `attributeKeys`: + +- **`required`** (default): Mismatch → device filtered out (current behavior) +- **`preferred`**: Mismatch → device passes Filter but receives a lower score + +This would allow the Score phase to optimize for Traffic-Class alignment while +only enforcing hard locks on Subnet. The lock itself would only be set for +`required` keys — `preferred` keys would remain advisory and never block +scheduling. This avoids complicating the atomic lock model while still +enabling soft optimization. + +### Lock Decay / Sticky Scoring + +When a device is recently unlocked (all claims released), the hardware may or +may not retain its previous configuration depending on driver behavior — some +drivers keep the state (e.g., a loaded FPGA bitstream), others tear it down +immediately. For drivers that preserve state, a time-decaying score bonus for +recently-unlocked devices matching the previous affinity value would improve +scheduling by avoiding expensive reconfiguration. This would require the +scheduler to track historical lock values with a TTL, and would only benefit +drivers that signal "warm" state — likely via a device attribute. + +### CEL-based Lock Expressions + +As Kubernetes moves toward CEL for policy evaluation, a future enhancement +could allow drivers to publish CEL expressions on the ResourceSlice that +evaluate affinity compatibility (e.g., +`device.affinityLock['pkey'] == '' || device.affinityLock['pkey'] == claim.AffinityValues['pkey']`). +This would require extending the CEL evaluation context to include runtime +allocation state, which is a substantial change warranting its own KEP. + +## Infrastructure Needed + +None diff --git a/keps/sig-scheduling/5981-dra-sharing-affinity/kep.yaml b/keps/sig-scheduling/5981-dra-sharing-affinity/kep.yaml new file mode 100644 index 000000000000..938b7caee368 --- /dev/null +++ b/keps/sig-scheduling/5981-dra-sharing-affinity/kep.yaml @@ -0,0 +1,46 @@ +title: "DRA Sharing Affinity for Conditional Fungibility" +kep-number: 5981 +authors: + - "@ashvindeodhar" +owning-sig: "sig-scheduling" +participating-sigs: + - "sig-node" +status: "provisional" +creation-date: "2026-03-30" +reviewers: + - "@pohly" + - "@johnbelamaric" + - "@sunya-ch" + - "@ritazh" + - "@LionelJouin" +approvers: + - TBD + +see-also: + - "/keps/sig-scheduling/5075-dra-consumable-capacity" + - "/keps/sig-node/4381-dra-structured-parameters" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been done. +latest-milestone: "v1.37" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.37" + beta: "v1.38" + stable: "v1.40" + +# The following PRR answers are required at alpha release +feature-gates: + - name: DRASharingAffinity + components: + - kube-apiserver + - kube-scheduler + - kubelet +disable-supported: true + +# Metrics required for beta release; can be placeholders for now. +metrics: + - dra_scheduling_attempts_affinity_mismatch_total \ No newline at end of file From 2f0197631fdecd45552772aa0a75c7aaa4641f63 Mon Sep 17 00:00:00 2001 From: Ashvin Deodhar Date: Thu, 2 Apr 2026 21:19:05 -0700 Subject: [PATCH 2/6] Incorporate early feedback on using JSON schema in OpaqueDeviceConfiguration --- .../5981-dra-sharing-affinity/README.md | 253 ++++++++---------- 1 file changed, 106 insertions(+), 147 deletions(-) diff --git a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md index a116ae3204bc..0d61a6d15782 100644 --- a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md +++ b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md @@ -168,7 +168,10 @@ spec: When the scheduler allocates a multi-allocatable device with `sharingAffinity`: -1. **First claim**: The scheduler reads the claim's affinity values for the specified attribute key(s) and records them in `AllocatedState` alongside consumed capacity +1. **First claim**: The scheduler decodes the claim's well-known structured + parameters from opaque config, reads the affinity values for the specified + attribute key(s), and records them in `AllocatedState` alongside consumed + capacity 2. **Subsequent claims**: The scheduler checks if the new claim's affinity values match those recorded in `AllocatedState` 3. **Mismatch**: If values don't match, the device is skipped (try another device) 4. **Match**: If values match and capacity is available, allocation proceeds @@ -194,110 +197,58 @@ scheduler which attribute keys constrain sharing. But the scheduler also needs to know what affinity values a given claim *requests*. The claim's opaque config (`DeviceConfiguration.Opaque.Parameters`) is a -`runtime.RawExtension`—raw bytes the scheduler cannot parse by design. The -scheduler intentionally does not understand opaque driver parameters. Therefore, -affinity values cannot be extracted from opaque config. - -Options under consideration: - -**Option A: New structured field in `DeviceClaimConfiguration`** - -Add a `affinityValues` map alongside opaque config: - -```go -type DeviceClaimConfiguration struct { - Requests []string - DeviceConfiguration `json:",inline"` - - // AffinityValues provides structured affinity values for sharing - // constraints. Keys must correspond to attributeKeys declared in - // the device's SharingAffinity. - // The maximum number of entries is 8. - // +optional - // +featureGate=DRASharingAffinity - AffinityValues map[QualifiedName]DeviceAttribute +`runtime.RawExtension`—raw bytes the scheduler cannot parse by design. +However, based on feedback from @pohly, the approach is to define a +**well-known JSON schema** that lives *inside* the opaque config. This avoids +any API changes to `DeviceConfiguration` while giving the scheduler a +decodable format for affinity-relevant parameters. + +**Approach: Well-known JSON schema inside `OpaqueDeviceConfiguration`** + +Drivers that want sharing affinity encode their config using a community-governed +JSON schema, similar to the pattern in +[`k8s.io/dynamic-resource-allocation/api/metadata`](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/dynamic-resource-allocation/api/metadata). +The scheduler recognizes this schema and can decode the affinity-relevant +parameters from the opaque blob. + +The schema uses the same `DeviceAttribute` types already used in ResourceSlice, +providing a flat key-value map of qualified attribute names: + +```json +{ + "apiVersion": "resource.k8s.io/v1alpha1", + "kind": "StructuredParameters", + "attributes": { + "networking.example.com/subnet": {"string": "subnet-X"}, + "networking.example.com/pkey": {"string": "0x8001"} + } } ``` -Pros: Clean separation; scheduler can read values directly; follows existing -config pattern where claim-level configuration lives. -Cons: Duplicates information already in opaque config; users must specify -affinity-relevant values in two places (opaque for the driver, structured for -the scheduler). - -> **UX mitigation — Staged Automation**: -> -> The duplication is addressed through a staged approach that establishes the -> API contract early and hardens it for production over time: -> -> **Alpha — API field + external webhook polyfill**: The `SharingAffinityMapping` -> field is defined in the `DeviceClass` spec during Alpha. This establishes the -> declarative mapping contract between opaque config keys and `affinityValues` -> attribute keys. For example: -> ```yaml -> apiVersion: resource.k8s.io/v1 -> kind: DeviceClass -> metadata: -> name: shared-nic -> spec: -> sharingAffinityMapping: -> - opaqueKey: "subnet" -> attributeKey: "networking.example.com/subnet" -> ``` -> During Alpha, driver vendors ship a **mutating admission webhook** that reads -> the `SharingAffinityMapping` from the referenced DeviceClass, extracts the -> corresponding values from the claim's opaque config, and auto-populates -> `affinityValues`. This keeps the initial PR footprint small and avoids complex -> changes to kube-apiserver before the core scheduling logic is proven. -> -> **Beta — Built-in admission controller**: The mapping logic moves into a -> **built-in admission controller** in the API server. This eliminates the -> dependency on an external webhook, which can fail or become a bottleneck during -> high-volume pod creation. The controller reads the DeviceClass mapping, -> extracts values from the opaque config's raw JSON (the driver opted into this -> by defining the mapping), and auto-populates `affinityValues` at the API -> boundary. By Beta, the mapping is transparent to the user — they provide -> opaque config and the API server ensures `affinityValues` is consistent. - -**Option B: New structured field on `ExactDeviceRequest`** - -Add `affinityValues` per-request instead of per-config: - -```go -type ExactDeviceRequest struct { - // ... existing fields ... - - // AffinityValues specifies the affinity attribute values this request - // requires for sharing-constrained devices. - // The maximum number of entries is 8. - // +optional - // +featureGate=DRASharingAffinity - AffinityValues map[QualifiedName]DeviceAttribute -} -``` - -Pros: Values are per-request (more precise); no duplication with config section. -Cons: Mixes resource selection with configuration; config is traditionally -separate from request in DRA's layered model. - -**Option C: Driver pre-populates device attributes dynamically** - -The driver publishes devices with attribute values already set based on current -state (e.g., `networking.example.com/subnet: "subnet-X"` after first -allocation). Claims use CEL selectors to match. - -Pros: No new API fields needed; uses existing CEL selector mechanism. -Cons: Race condition between driver updating ResourceSlice and scheduler reading -it—this is essentially the placeholder pattern this KEP aims to eliminate. - -We are leaning toward **Option A** because it cleanly separates structured -affinity values from opaque driver config while keeping the scheduler's read -path simple. The examples in this KEP use Option A. Feedback from reviewers is -requested. +The scheduler decodes this JSON from the opaque blob and extracts values for the +keys listed in the device's `sharingAffinity.attributeKeys`. The decoding +overhead is small compared to the overall scheduling effort. + +If drivers also need additional, differently structured configuration parameters +(e.g., MTU, QoS settings), users provide **two** config entries in the claim: +one using the standard schema (scheduler reads) and one using the vendor format +(driver reads). The scheduler only considers configurations matching the +well-known schema. + +**Key advantages:** +- **No API changes** to `DeviceConfiguration` — the feature uses existing opaque + config with a well-known schema +- **No duplication** for simple cases — the driver can read the same structured + parameters it programs (e.g., subnet, PKey) +- **No `SharingAffinityMapping`** — no webhook polyfill or admission controller + needed +- **Extensible** — the well-known schema can support future scheduler-readable + hints beyond sharing affinity **3. SharingStrategy: Should claims control lock-setting behavior?** -When a claim provides `affinityValues`, should it also declare whether it is +When a claim provides affinity values (via the well-known structured parameters +schema), should it also declare whether it is allowed to *set* a lock on a clean device, or whether it can only *join* an existing lock? Two initial strategies are proposed: @@ -307,8 +258,8 @@ existing lock? Two initial strategies are proposed: a matching lock established by another claim. Useful for background or batch jobs that should never "poison" a clean device. -If included, this field would live alongside `affinityValues` on the claim side -(Option A: in `DeviceClaimConfiguration`, Option B: in `ExactDeviceRequest`). +If included, this field would live on the claim side — either in the well-known +structured parameters schema or as an additional field in the opaque config. **Filter ordering**: The scheduler checks `NeverSetLock` **before** evaluating capacity or key matching. If the device is unlocked and the strategy is @@ -319,7 +270,7 @@ without noise from capacity evaluation. The full Filter order becomes: 1. If `Strategy == NeverSetLock` AND device is unlocked → reject 2. Sufficient consumable capacity (KEP-5075) -3. All required `attributeKeys` present in claim's `affinityValues` +3. All required `attributeKeys` present in claim's structured parameters 4. Values match existing lock This could be deferred to beta if the alpha scope is too large. @@ -364,9 +315,9 @@ but only if pods belong to the same subnet. The driver sets ### Notes/Constraints/Caveats - **Affinity is set by the first claim**: Once a device is allocated with an affinity value, that value is locked until all claims release the device -- **Attribute keys must be declared**: The device's `sharingAffinity.attributeKeys` lists which attribute keys constrain sharing; claims must provide values for all of these keys in `affinityValues` or the device is filtered out +- **Attribute keys must be declared**: The device's `sharingAffinity.attributeKeys` lists which attribute keys constrain sharing; claims must provide values for all of these keys in the well-known structured parameters or the device is filtered out - **Multiple keys**: If multiple attribute keys are specified, ALL must match (both presence and value) -- **Extra keys in claim**: If a claim's `affinityValues` contains keys beyond +- **Extra keys in claim**: If a claim's structured parameters contain keys beyond what the device declares in `attributeKeys`, the extra keys are **ignored** for that device. Only the device's declared keys are evaluated. This allows "generic" claims to work across devices with different sharing requirements @@ -378,11 +329,11 @@ but only if pods belong to the same subnet. The driver sets - **Multi-request claims (per-request scoping)**: If a claim requests multiple devices (e.g., `mgmt-nic` and `data-nic`), each `DeviceClaimConfiguration` block targets specific requests via its `requests` slice. Different config - blocks can specify different `affinityValues` for different requests. This + blocks can specify different structured parameters for different requests. This means `mgmt-nic` can be locked to Subnet-A while `data-nic` is locked to Subnet-B within the same claim — there is no cross-talk between requests. - **Empty affinity**: Devices without `sharingAffinity` behave as before (any claim can share) -- **Grandfathered claims**: Pre-existing claims without `affinityValues` (created +- **Grandfathered claims**: Pre-existing claims without structured parameters (created before the feature was enabled) do not participate in affinity matching but do not block new claims from establishing a lock. See lock precedence table below. @@ -393,13 +344,13 @@ but only if pods belong to the same subnet. The driver sets | 5 grandfathered claims, no lock set | Claim with `subnet: A` | Lock set to `subnet: A`; device now locked | | 5 grandfathered + locked to `subnet: A` | Claim with `subnet: A` | Allowed (values match) | | 5 grandfathered + locked to `subnet: A` | Claim with `subnet: B` | **Rejected** (mismatch with lock) | -| 5 grandfathered + locked to `subnet: A` | Claim without `affinityValues` | **Rejected** (missing required key) | +| 5 grandfathered + locked to `subnet: A` | Claim without structured parameters | **Rejected** (missing required key) | | Only grandfathered claims remain, new claims released | — | Lock cleared; device returns to unlocked | | All claims released (grandfathered + new) | — | Device fully clean | Grandfathered claims are "transparent" to the lock — they neither set it nor -conflict with it. The lock is defined entirely by claims that provide -`AffinityValues`. +conflict with it. The lock is defined entirely by claims that provide structured +parameters matching the well-known schema. ### Risks and Mitigations @@ -523,12 +474,13 @@ const SharingAffinityAttributeKeysMaxSize = 8 ##### Source of Truth for Affinity Locks -The scheduler derives affinity locks **solely from active claims' -`affinityValues`** — not from device attributes on the ResourceSlice. The driver -is NOT required to write locked affinity values back to the ResourceSlice. +The scheduler derives affinity locks **solely from active claims' structured +parameters** (decoded from the well-known JSON schema in opaque config) — not +from device attributes on the ResourceSlice. The driver is NOT required to write +locked affinity values back to the ResourceSlice. - The ResourceSlice declares *which* keys constrain sharing (`attributeKeys`) -- The claims declare *what* values they need (`affinityValues`) +- The claims declare *what* values they need (via well-known structured parameters) - The scheduler combines these to maintain the lock in `AllocatedState` This avoids two sources of truth that could diverge, eliminates ResourceSlice @@ -554,7 +506,7 @@ A device's effective state is a derived value: ``` Effective State = ResourceSlice (device definition + attributeKeys) - + Active Claims (affinityValues from bound claims) + + Active Claims (structured parameters decoded from opaque config) + AssumedClaims (tentative locks from current scheduling cycle) ``` @@ -583,11 +535,12 @@ device with `sharingAffinity` is a candidate ONLY if: 1. It has sufficient consumable capacity (KEP-5075) 2. The claim provides values for ALL keys in `sharingAffinity.attributeKeys` - (missing key → device is **not** a candidate) + (missing key → device is **not** a candidate). The scheduler extracts these + values by decoding the well-known JSON schema from the claim's opaque config. 3. The device's `AffinityValues` is either empty (unlocked) OR matches the claim's affinity values for ALL keys -If a claim does not provide an `affinityValues` entry for a required attribute +If a claim does not provide a structured parameter entry for a required attribute key, the device is filtered out. This is the safe default: the driver declared that sharing requires a specific parameter, and a claim that omits it cannot be properly configured. Claims that do not need sharing-constrained devices should @@ -605,7 +558,8 @@ lock" in the scheduler cache before the Binding phase: 1. Scheduler evaluates a multi-allocatable device with `sharingAffinity` 2. If device has no existing allocations (unlocked): - - Extract affinity values for `sharingAffinity.attributeKeys` from the claim + - Extract affinity values for `sharingAffinity.attributeKeys` from the claim's + structured parameters (decoded from opaque config) - Record values in `AllocatedState.AffinityValues[deviceID]` - Proceed with allocation (device is now tentatively locked) 3. If device has existing allocations (locked): @@ -674,11 +628,14 @@ cycle begins. existing `GatherAllocatedState()` for capacity reconstruction) 2. For each bound claim, check if the allocated device has `SharingAffinity` defined in the corresponding ResourceSlice -3. If yes and the claim has `affinityValues`, read the values and populate - `AffinityValues[deviceID]` with the key-value pairs -4. If yes but the claim has **no** `affinityValues` (grandfathered claim from - before the feature was enabled), skip it — do not populate affinity for this - claim. The lock will be established by the next new claim that provides values. +3. If yes, decode the claim's opaque config using the well-known JSON schema + and extract the structured parameters; populate + `AffinityValues[deviceID]` with the key-value pairs for the declared + attribute keys +4. If yes but the claim has **no** well-known structured parameters + (grandfathered claim from before the feature was enabled), skip it — do not + populate affinity for this claim. The lock will be established by the next + new claim that provides structured parameters. 5. If multiple claims share the same device, verify their values are consistent (they must be, by construction—but log a warning if not) @@ -736,25 +693,31 @@ spec: exactly: deviceClassName: shared-nic config: + # Well-known structured parameters (scheduler decodes for affinity matching) + - requests: ["nic"] + opaque: + driver: resource.k8s.io + parameters: + apiVersion: resource.k8s.io/v1alpha1 + kind: StructuredParameters + attributes: + networking.example.com/subnet: + string: "subnet-X" + # Driver-specific opaque config (scheduler ignores this) - requests: ["nic"] - # Structured affinity values the scheduler can read (see UNRESOLVED #2) - affinityValues: - networking.example.com/subnet: - string: "subnet-X" - # Opaque driver config (scheduler cannot read this) opaque: driver: networking.example.com parameters: apiVersion: networking.example.com/v1 kind: NICConfig - subnet: "subnet-X" vlanId: 100 ``` -> **Note**: The `affinityValues` field is a proposed new structured field -> (see [UNRESOLVED #2](#open-design-questions)). It provides the scheduler with -> readable affinity values while the full driver configuration remains in the -> opaque `parameters` blob. +> **Note**: The first config block uses the well-known `StructuredParameters` +> schema with `driver: resource.k8s.io`, which the scheduler recognizes and +> decodes for affinity matching. The second config block is standard opaque +> driver config that only the driver reads. For simple cases where the driver +> can read both, a single well-known config block may be sufficient. ### Test Plan @@ -780,13 +743,13 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Reserve: first claim sets lock; second claim with same values succeeds - Reserve: second claim with conflicting values fails - Unreserve: tentative lock is rolled back - - Grandfathered claims: pre-existing claims without `affinityValues` neither set + - Grandfathered claims: pre-existing claims without structured parameters neither set nor conflict with locks - Lock precedence: all 6 scenarios from the Lock Precedence table - `staging/src/k8s.io/api/resource/v1`: Coverage for new API types, including: - Validation: `attributeKeys` exceeding max 8 limit is rejected - - Validation: `affinityValues` exceeding max 8 limit is rejected - - Round-trip serialization of `SharingAffinity` and `affinityValues` + - Validation: structured parameters exceeding max 8 attributes is rejected + - Round-trip serialization of `SharingAffinity` and well-known schema ##### Integration tests @@ -795,7 +758,8 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Affinity lock clearing when all claims release a device - Interaction with consumable capacity constraints (KEP-5075) - Scheduler restart: `AffinityValues` correctly reconstructed from existing - bound ResourceClaims (including skipping grandfathered claims) + bound ResourceClaims (including skipping grandfathered claims without + structured parameters) - Parallel scheduling: two Pods with conflicting affinity values targeting the same device — one wins Reserve, the other is requeued - Feature gate disabled: `sharingAffinity` fields are ignored; devices are @@ -807,7 +771,7 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. device with a different affinity value - **Grandfather Migration**: 5 Pods running on a NIC with no lock; driver updates ResourceSlice to add `sharingAffinity`; 6th Pod scheduled with - `affinityValues` — verify the 6th Pod succeeds, sets the lock, and the + structured parameters — verify the 6th Pod succeeds, sets the lock, and the original 5 Pods are not disrupted - **Partial Key**: Device requires `subnet` and `pkey` in `attributeKeys`; claim provides only `subnet` — verify the device is filtered out @@ -827,13 +791,11 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Feature implemented behind `DRASharingAffinity` feature gate - API fields added to ResourceSlice (`SharingAffinity` on `Device`) -- API field added to DeviceClass (`SharingAffinityMapping`) to establish the - mapping contract between opaque config keys and `affinityValues` attribute keys +- Well-known `StructuredParameters` JSON schema defined for opaque config +- Scheduler decodes well-known schema from opaque config for affinity matching - Scheduler Filter plugin enforces affinity matching - Scheduler Score plugin prefers locked-compatible devices over clean devices - Scheduler tracks affinity in AllocatedState -- External mutating admission webhook polyfill: driver vendors can ship a webhook - that reads the DeviceClass mapping and auto-populates `affinityValues` - Unit and integration tests - Documentation for driver authors @@ -841,9 +803,6 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Gather feedback from DRA driver developers - Address any issues found in alpha -- **Built-in admission controller**: The `SharingAffinityMapping` logic moves - from an external webhook into a built-in admission controller in the API - server, ensuring operational reliability and zero-touch UX for production - **Lock-aware preemption**: PostFilter detects affinity mismatch as a preemption-solvable problem; identifies lock-holder Pods as victims when a higher-priority Pod needs a device locked to an incompatible value @@ -865,11 +824,11 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. This can happen during driver upgrades or when enabling the feature on existing hardware. The scheduler handles this as follows: -- **Pre-existing claims without `affinityValues`** are grandfathered: they do not +- **Pre-existing claims without structured parameters** are grandfathered: they do not participate in affinity matching. The scheduler skips them when reconstructing the `AffinityValues` map. -- **The lock is established by the first *new* claim** that provides - `affinityValues` for the required attribute keys after the `sharingAffinity` +- **The lock is established by the first *new* claim** that provides structured + parameters for the required attribute keys after the `sharingAffinity` field is added. - **Pre-existing claims continue to run** and are not evicted. The driver is responsible for ensuring that already-configured VFs/resources remain From 3f5640fd749cc9f02ddac0013a352924f38224fd Mon Sep 17 00:00:00 2001 From: Ashvin Deodhar Date: Mon, 6 Apr 2026 17:02:10 -0700 Subject: [PATCH 3/6] Address UNRESOLVED design questions and harden alpha spec - Resolve placement decision: SharingAffinity stays on ResourceSlice (driver-side) with rationale for why hardware modal constraints belong on the device, not the workload - Resolve claim-value delivery: adopt well-known JSON schema inside OpaqueDeviceConfiguration per @pohly's guidance; define normative StructuredParameters contract (recognition, uniqueness, coexistence, conflict handling, string-only alpha, malformed payloads, missing entries, validation intent) - Defer CanSetLock/NeverSetLock to Future Enhancements; alpha allows any compatible claim to establish the initial lock - Replace grandfathered-claim model with conservative unknown-affinity handling: devices with non-reconstructable active claims are filtered out until fully clean (no optimistic lock-setting over legacy claims) - Add Safety Model and Responsibility Split section clarifying scheduler guarantee vs driver guarantee vs conservative fallback - Introduce AffinityState struct with Unknown flag; replace flat AffinityValues map with AffinityStates map[DeviceID]AffinityState - Expand Filter phase to 7-step evaluation including UnknownAffinity check, exactly-one StructuredParameters entry, schema decode, string-type enforcement - Add normative Score ordering (locked-compatible > clean > filtered) - Add explicit alpha limitations for lock-aware fairness and preemption blindness throughout Summary, Non-Goals, Proposal, and Risks sections - Add string-only matching constraint with rationale to Notes, Filter, StructuredParameters contract, and new Future Enhancement (Typed Affinity Values Beyond Strings) - Add Multi-key SharingAffinity example with subnet+pkey walkthrough - Expand reconstruction algorithm to handle malformed, non-string, and duplicate structured-parameters entries - Harden Risks section: rename Stale Affinity View to Cache Staleness, add alpha limitation callout to Priority Inversion - Remove stale SharingAffinityMapping reference - Add Priority-based Lock Preemption and SharingStrategy to Future Enhancements with detailed rationale - Update Graduation Criteria, Upgrade/Downgrade, Version Skew, PRR sections for conservative unknown-affinity handling --- .../5981-dra-sharing-affinity/README.md | 785 +++++++++++++----- 1 file changed, 575 insertions(+), 210 deletions(-) diff --git a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md index 0d61a6d15782..bc824d1c311a 100644 --- a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md +++ b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md @@ -20,6 +20,7 @@ - [Examples](#examples) - [ResourceSlice with Sharing Affinity](#resourceslice-with-sharing-affinity) - [ResourceClaim with Affinity Value](#resourceclaim-with-affinity-value) + - [Multi-key SharingAffinity Example](#multi-key-sharingaffinity-example) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -101,6 +102,17 @@ both consumed capacity and the affinity values that lock a device to a particular sharing group, enabling it to **gate** remaining capacity and **pack** compatible workloads onto already-locked devices. +Alpha intentionally does **not** provide lock-aware fairness or lock-breaking +preemption. In addition, if a device already has active allocations whose +affinity cannot be reconstructed (for example, legacy claims created before the +feature was enabled), the scheduler treats that device conservatively and does +not place new `sharingAffinity` allocations on it until the device becomes +clean. + +`sharingAffinity` in this KEP refers specifically to compatibility for +co-allocation on a shared device; it is distinct from pod affinity, +anti-affinity, or topology-aware placement. + ## Motivation As AI and HPC workloads move toward higher density, hardware partitioning @@ -147,6 +159,12 @@ affinity values that determine sharing compatibility. This KEP closes that gap. the driver's responsibility) - Changing how capacity is tracked (that's KEP-5075) - Supporting affinity across different device types or pools +- Retrofitting affinity-aware sharing onto already-in-use devices when active + claims do not expose reconstructable affinity values. In alpha, such devices + are treated conservatively until they drain clean. +- Guaranteeing **lock-aware fairness** or **lock-breaking/preemption** in alpha. + Alpha enforces compatibility and improves packing, but does not yet guarantee + that a higher-priority Pod can displace an incompatible lock-holder. ## Proposal @@ -176,43 +194,43 @@ When the scheduler allocates a multi-allocatable device with `sharingAffinity`: 3. **Mismatch**: If values don't match, the device is skipped (try another device) 4. **Match**: If values match and capacity is available, allocation proceeds -<<[UNRESOLVED @pohly @johnbelamaric @sunya-ch @ritazh]>> -**Open Design Questions** +**Alpha Design Decisions** -**1. Placement of SharingAffinity: ResourceSlice (driver-side) vs DeviceRequest (claim-side)** +**1. Placement of SharingAffinity: ResourceSlice (driver-side)** This KEP places `SharingAffinity` on the ResourceSlice `Device` (driver- -defined). An alternative design places it on the `DeviceRequest` in the -ResourceClaim (user-defined). See [Alternatives: Claim-side SharingAffinity](#claim-side-sharingaffinity-on-devicerequest) for the trade-off analysis. +defined). We chose driver-side placement because the hardware modal constraint +is a property of the device, not the workload. The driver knows that "once a +NIC is configured for subnet A, it can only serve subnet A"—this is a +device-level constraint that should be declared once on the device. -We chose driver-side because the hardware modal constraint is a property of the -device, not the workload. The driver knows that "once a NIC is configured for -subnet A, it can only serve subnet A"—this is a device-level constraint that -should be declared once on the device. +An alternative design places `SharingAffinity` on the `DeviceRequest` in the +`ResourceClaim` (user-defined). See [Alternatives: Claim-side +SharingAffinity](#claim-side-sharingaffinity-on-devicerequest) for the +trade-off analysis. **2. How claims communicate affinity values to the scheduler** The driver declares `sharingAffinity.attributeKeys` on the device, telling the -scheduler which attribute keys constrain sharing. But the scheduler also needs -to know what affinity values a given claim *requests*. +scheduler which attribute keys constrain sharing. The scheduler learns the +requested values for those keys by decoding a **well-known JSON schema** stored +inside `OpaqueDeviceConfiguration`. The claim's opaque config (`DeviceConfiguration.Opaque.Parameters`) is a -`runtime.RawExtension`—raw bytes the scheduler cannot parse by design. -However, based on feedback from @pohly, the approach is to define a -**well-known JSON schema** that lives *inside* the opaque config. This avoids -any API changes to `DeviceConfiguration` while giving the scheduler a -decodable format for affinity-relevant parameters. - -**Approach: Well-known JSON schema inside `OpaqueDeviceConfiguration`** +`runtime.RawExtension`—raw bytes the scheduler cannot parse generically. For +this feature, drivers that want sharing affinity encode scheduler-readable +parameters using a community-governed JSON schema, similar to the pattern in +[k8s.io/dynamic-resource-allocation/api/metadata](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/dynamic-resource-allocation/api/metadata). +This avoids any API changes to `DeviceConfiguration` while giving the scheduler +a decodable format for affinity-relevant parameters. -Drivers that want sharing affinity encode their config using a community-governed -JSON schema, similar to the pattern in -[`k8s.io/dynamic-resource-allocation/api/metadata`](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/dynamic-resource-allocation/api/metadata). -The scheduler recognizes this schema and can decode the affinity-relevant -parameters from the opaque blob. +**Approach: Well-known JSON schema inside OpaqueDeviceConfiguration** -The schema uses the same `DeviceAttribute` types already used in ResourceSlice, -providing a flat key-value map of qualified attribute names: +The schema reuses the same qualified key naming convention as `ResourceSlice` +attributes and follows a `DeviceAttribute`-like envelope. In **alpha**, +`sharingAffinity` matching is limited to **string-valued** attributes for the +keys referenced by `sharingAffinity.attributeKeys`, which keeps equality +semantics simple and aligns with the scheduler's in-memory lock representation. ```json { @@ -225,56 +243,100 @@ providing a flat key-value map of qualified attribute names: } ``` -The scheduler decodes this JSON from the opaque blob and extracts values for the -keys listed in the device's `sharingAffinity.attributeKeys`. The decoding -overhead is small compared to the overall scheduling effort. +The scheduler decodes this JSON from the opaque blob and extracts **string** +values for the keys listed in the device's `sharingAffinity.attributeKeys`. The +decoding overhead is small compared to the overall scheduling effort. -If drivers also need additional, differently structured configuration parameters -(e.g., MTU, QoS settings), users provide **two** config entries in the claim: -one using the standard schema (scheduler reads) and one using the vendor format -(driver reads). The scheduler only considers configurations matching the -well-known schema. +If drivers also need additional, differently structured configuration +parameters (e.g., MTU, QoS settings), users provide **two** config entries in +the claim: one using the standard schema (scheduler reads) and one using the +vendor format (driver reads). The scheduler only considers configurations +matching the well-known schema. **Key advantages:** -- **No API changes** to `DeviceConfiguration` — the feature uses existing opaque - config with a well-known schema +- **No API changes** to `DeviceConfiguration` — the feature uses existing + opaque config with a well-known schema - **No duplication** for simple cases — the driver can read the same structured parameters it programs (e.g., subnet, PKey) -- **No `SharingAffinityMapping`** — no webhook polyfill or admission controller - needed - **Extensible** — the well-known schema can support future scheduler-readable hints beyond sharing affinity -**3. SharingStrategy: Should claims control lock-setting behavior?** - -When a claim provides affinity values (via the well-known structured parameters -schema), should it also declare whether it is -allowed to *set* a lock on a clean device, or whether it can only *join* an -existing lock? Two initial strategies are proposed: - -- `CanSetLock` (default): The claim can land on a clean device and establish - the lock. Standard behavior for primary workloads. -- `NeverSetLock`: The claim can only be allocated to a device that already has - a matching lock established by another claim. Useful for background or batch - jobs that should never "poison" a clean device. - -If included, this field would live on the claim side — either in the well-known -structured parameters schema or as an additional field in the opaque config. - -**Filter ordering**: The scheduler checks `NeverSetLock` **before** evaluating -capacity or key matching. If the device is unlocked and the strategy is -`NeverSetLock`, the device is rejected immediately — there is no need to -compute capacity for a device the claim is fundamentally ineligible for. This -also produces a clear rejection reason (`SharingAffinityNeverSetLockOnCleanDevice`) -without noise from capacity evaluation. The full Filter order becomes: - -1. If `Strategy == NeverSetLock` AND device is unlocked → reject -2. Sufficient consumable capacity (KEP-5075) -3. All required `attributeKeys` present in claim's structured parameters -4. Values match existing lock - -This could be deferred to beta if the alpha scope is too large. -<<[/UNRESOLVED]>> +**Alpha StructuredParameters Contract** + +For alpha, the scheduler-readable structured-parameters format is a +scheduler-recognized contract with the following rules: + +1. **Recognition**: The scheduler recognizes a config entry as structured + parameters only when the `opaque.driver` is `resource.k8s.io` **and** the + embedded payload has `apiVersion: resource.k8s.io/v1alpha1` and + `kind: StructuredParameters`. +2. **Per-request uniqueness**: For a given request, there must be **at most + one** structured-parameters config entry targeted at that request. Multiple + matching entries for the same request are invalid. +3. **Coexistence with driver config**: The structured-parameters entry may + coexist with one or more driver-specific opaque config entries for the same + request. The scheduler reads only the recognized structured-parameters + entry. The driver-specific entries are ignored by the scheduler. +4. **Conflict handling**: If the same logical setting is encoded both in + `StructuredParameters` and in driver-specific config, the scheduler uses only + the `StructuredParameters` value for placement decisions and does not attempt to + compare or reconcile the driver-specific opaque payload. If a conflict exists, + the driver should reject the request during `NodePrepareResources` with a clear error, + rather than silently accepting divergent values. +5. **String-only affinity values in alpha**: For any key referenced by + `sharingAffinity.attributeKeys`, the recognized structured-parameters entry + must provide a `string` value in alpha. Other value types are not matched in + alpha and are treated as invalid for `sharingAffinity` scheduling. +6. **Malformed payloads**: If a recognized structured-parameters entry is + malformed, has the wrong schema, or cannot be decoded, it is treated as + invalid for scheduling purposes. +7. **Missing recognized entry**: If a claim targets a device with + `sharingAffinity` but does not provide a recognized structured-parameters + entry for that request, the device is filtered out. This does not make the + claim universally unschedulable. It only makes the claim ineligible for devices + that declare `sharingAffinity`. If all feasible devices for the request declare + `sharingAffinity`, then the request may remain unschedulable until a recognized + `StructuredParameters` entry is provided or non-sharing-affinity capacity is available. +8. **Validation intent**: API validation should reject malformed or duplicate + structured-parameters entries when feasible. The scheduler must still + handle invalid persisted objects defensively and deterministically. + +In alpha, this keeps the contract explicit without introducing a new API field: +the scheduler depends only on a single, community-governed, recognized payload +shape and ignores all other opaque config. + +For alpha, `StructuredParameters` is a **scheduler-recognized sub-protocol** +defined by this KEP. The scheduler interprets only payloads explicitly +recognized as `opaque.driver: resource.k8s.io` together with the embedded +`apiVersion`/`kind` for `StructuredParameters`; all vendor-defined opaque +payloads remain opaque. The sub-protocol is versioned via the embedded +`apiVersion` and future revisions must define compatibility and upgrade +behavior explicitly. + +**Alpha scope** + +Alpha fully resolves the design around **driver-side placement** and the +**structured-parameters approach** described above. Claims do not control +lock-setting behavior in alpha: any compatible claim may establish the initial +lock on a clean device. Claim-side lock-setting policy (for example, +`CanSetLock`/`NeverSetLock`) is deferred to [Future +Enhancements](#future-enhancements). + +In other words, alpha standardizes **driver-declared compatibility keys**, a +**scheduler-recognized structured-parameters contract**, and **correct lock +enforcement / packing behavior** — but intentionally stops short of +lock-aware fairness, lock-breaking policy, or preemption semantics. + +**Alpha limitations** + +Alpha provides **correct lock enforcement** and **better packing**, but it does +not provide lock-aware fairness or lock-breaking semantics. A lower-priority +Pod may continue holding a device lock even when the device still has nominal +capacity and a higher-priority Pod needs the same device with a different +affinity value. In that case the higher-priority Pod may remain unschedulable +until a compatible alternative appears or the lock-holder exits. This is an +expected alpha limitation, not a correctness bug, and is addressed later under +[Future Enhancements: Priority-based Lock Preemption](#priority-based-lock-preemption). ### User Stories @@ -314,43 +376,65 @@ but only if pods belong to the same subnet. The driver sets ### Notes/Constraints/Caveats -- **Affinity is set by the first claim**: Once a device is allocated with an affinity value, that value is locked until all claims release the device -- **Attribute keys must be declared**: The device's `sharingAffinity.attributeKeys` lists which attribute keys constrain sharing; claims must provide values for all of these keys in the well-known structured parameters or the device is filtered out -- **Multiple keys**: If multiple attribute keys are specified, ALL must match (both presence and value) +- **Affinity is set by the first compatible claim on a clean device**: Once a + device is allocated with an affinity value, that value is locked until all + claims release the device. +- **Attribute keys must be declared**: The device's + `sharingAffinity.attributeKeys` lists which attribute keys constrain sharing; + claims must provide values for all of these keys in the well-known structured + parameters or the device is filtered out. +- **Multiple keys**: If multiple attribute keys are specified, ALL must match + (both presence and value). - **Extra keys in claim**: If a claim's structured parameters contain keys beyond what the device declares in `attributeKeys`, the extra keys are **ignored** for that device. Only the device's declared keys are evaluated. This allows "generic" claims to work across devices with different sharing requirements (e.g., a claim with both `subnet` and `vlan` can match a device that only - constrains on `subnet`) + constrains on `subnet`). +- **String-only matching in alpha**: For keys referenced by + `sharingAffinity.attributeKeys`, the scheduler only matches `string` values + in alpha. If a required key is present with a non-string value, the device is + filtered out for that claim. - **Missing keys in claim**: If the claim does not provide a value for a key the device declares in `attributeKeys`, the device is **filtered out** (see - Filter phase) + Filter phase). +- **Malformed structured parameters**: If the scheduler-recognized + `StructuredParameters` entry is malformed, undecodable, or uses the wrong + schema, it is treated as invalid and the claim cannot use devices that rely + on `sharingAffinity`. +- **Duplicate structured parameters for one request**: If more than one + recognized `StructuredParameters` config entry targets the same request, the + claim is treated as invalid for `sharingAffinity` scheduling until corrected. - **Multi-request claims (per-request scoping)**: If a claim requests multiple devices (e.g., `mgmt-nic` and `data-nic`), each `DeviceClaimConfiguration` block targets specific requests via its `requests` slice. Different config blocks can specify different structured parameters for different requests. This means `mgmt-nic` can be locked to Subnet-A while `data-nic` is locked to Subnet-B within the same claim — there is no cross-talk between requests. -- **Empty affinity**: Devices without `sharingAffinity` behave as before (any claim can share) -- **Grandfathered claims**: Pre-existing claims without structured parameters (created - before the feature was enabled) do not participate in affinity matching but do - not block new claims from establishing a lock. See lock precedence table below. +- **Empty affinity**: Devices without sharingAffinity do not participate in + affinity-based gating; those may get allocated by claims that do not specify affinity. +- **Legacy allocations with unknown affinity are conservative in alpha**: + If a device has active allocations for which the scheduler cannot reconstruct + the required affinity values (for example, claims created before the feature + was enabled or invalid persisted claims), that device is treated as having + **unknown affinity state** and is filtered out for new `sharingAffinity` + scheduling until it becomes fully clean. -#### Lock Precedence with Grandfathered Claims +#### Handling Legacy Claims with Unknown Affinity | Device State | New Claim | Result | |---|---|---| -| 5 grandfathered claims, no lock set | Claim with `subnet: A` | Lock set to `subnet: A`; device now locked | -| 5 grandfathered + locked to `subnet: A` | Claim with `subnet: A` | Allowed (values match) | -| 5 grandfathered + locked to `subnet: A` | Claim with `subnet: B` | **Rejected** (mismatch with lock) | -| 5 grandfathered + locked to `subnet: A` | Claim without structured parameters | **Rejected** (missing required key) | -| Only grandfathered claims remain, new claims released | — | Lock cleared; device returns to unlocked | -| All claims released (grandfathered + new) | — | Device fully clean | +| 5 legacy claims, affinity unknown | Claim with `subnet: A` | **Filtered out**. Existing allocations have unknown affinity, so no new `sharingAffinity` lock may be established yet. | +| 5 legacy claims, affinity unknown | Claim without structured parameters | **Filtered out**. Missing required scheduler-readable affinity information. | +| Legacy claims drained; device now clean | Claim with `subnet: A` | Lock set to `subnet: A`; device now locked. | +| Device locked to `subnet: A` | Claim with `subnet: A` | Allowed (values match). | +| Device locked to `subnet: A` | Claim with `subnet: B` | **Rejected** (mismatch with lock). | +| All claims released | — | Device fully clean and eligible to establish a new lock. | + +Legacy claims continue to run and are not evicted. However, until all unknown +allocations on a `sharingAffinity` device are released, the scheduler does not +assume it knows the device's effective modal state. -Grandfathered claims are "transparent" to the lock — they neither set it nor -conflict with it. The lock is defined entirely by claims that provide structured -parameters matching the well-known schema. ### Risks and Mitigations @@ -380,28 +464,32 @@ locked to the wrong value. **Mitigation (Alpha)**: The scoring plugin reduces the probability by packing compatible workloads and preserving clean devices. However, this is a soft -mitigation — it does not guarantee that a clean device will always be available. +mitigation — it does not guarantee that a clean device will always be available, +and alpha does **not** guarantee lock-aware fairness. + +**Alpha limitation**: In alpha, a lower-priority Pod may continue to hold a +lock that blocks a higher-priority incompatible Pod even when nominal capacity +remains on the device. This is an expected limitation of the alpha scope rather +than a correctness bug. **Mitigation (Beta)**: Lock-aware preemption (see [Beta graduation criteria](#beta)) will teach the scheduler's PostFilter phase to detect affinity mismatch as a preemption-solvable problem and identify lock-holder Pods as preemption victims. -#### Stale Affinity View +#### Cache Staleness and Delayed Release Visibility -**Risk**: When a claim is released externally (pod completes, user deletes pod, -kubelet eviction), the scheduler learns about it asynchronously via its -informer watch. There is a brief propagation delay (typically milliseconds, but -potentially seconds under load) between the API server state changing and the -informer callback updating the scheduler's cache. During this window, the -scheduler may still see a device as "locked" when it is actually clean, causing -it to unnecessarily skip the device for one scheduling cycle. +**Risk**: Like other informer-based scheduler state, sharing-affinity lock state may +briefly lag external claim release, pod deletion, eviction, or ResourceSlice updates. +Because the scheduler maintains derived lock state in its internal cache, there can be +a short propagation window in which a device is still observed as locked or in unknown-affinity +state after the underlying API state has changed. During that window, the scheduler may +conservatively skip the device for a scheduling cycle. This is not unique to sharing affinity; +it is the feature-specific manifestation of normal cache propagation delay in scheduler-managed +state. The result is a temporary loss of placement optimality rather than a correctness violation -**Mitigation**: For scheduler-driven releases (Unreserve on binding failure or -preemption), the cache is updated immediately with no staleness. For external -releases, the informer eventually reconciles the state. This is the same -propagation delay that affects all informer-based caches in Kubernetes and is -not unique to this feature. The worst case is a briefly suboptimal scheduling -decision, not a correctness bug. +**Mitigation**: For scheduler-driven transitions such as Reserve / Unreserve, the cache is updated +immediately. For externally driven transitions, informer reconciliation eventually converges the +state. This matches the existing consistency model used elsewhere in scheduler and DRA cache-based decisions. #### Unexpected Affinity Values @@ -421,11 +509,12 @@ attribute values are accepted where domain-specific validation is feasible. **Risk**: Affinity values accumulate in `AllocatedState`, increasing memory usage. -**Mitigation**: Affinity values are small strings (max 64 characters per -`DeviceAttribute` value), capped at 8 attribute keys per device. Per-device -overhead is bounded at 8 key-value pairs in `AllocatedState.AffinityValues`, -and entries are cleared when all claims release the device. The total overhead -is proportional to active shared allocations, not total devices. +**Mitigation**: In alpha, affinity values are stored as small strings (for +example subnet or PKey identifiers), capped at 8 attribute keys per device. +Per-device overhead is bounded at 8 key-value pairs in +`AllocatedState.AffinityStates`, and entries are cleared when all claims release +the device. The total overhead is proportional to active shared allocations, +not total devices. ## Design Details @@ -455,6 +544,10 @@ type DeviceSharingAffinity struct { // AttributeKeys lists the fully-qualified device attribute names that // must have matching values across all claims sharing this device. // + // In alpha, the corresponding values must be provided as strings in the + // recognized StructuredParameters entry. Support for additional value types + // is deferred. + // // When the first claim is allocated to this device, the affinity values // for these keys are recorded in AllocatedState. Subsequent claims can // only share the device if their affinity values match exactly. @@ -495,6 +588,23 @@ driver is responsible for device lifecycle — tearing down the old configuratio `NodePrepareResources`). The scheduler does not track hardware reconfiguration state. +##### Safety Model and Responsibility Split + +This feature intentionally keeps **placement knowledge** and **hardware +enforcement** separate: + +- **Scheduler guarantee**: when it has recognized structured parameters for all + active allocations on a `sharingAffinity` device, it will not intentionally + co-place claims with incompatible affinity values on that device. +- **Conservative fallback**: if the scheduler cannot reconstruct the effective + affinity state of a device (for example, due to legacy or invalid persisted + claims), it treats that device as **unknown** and filters it out for new + `sharingAffinity` placements until the device becomes clean. +- **Driver guarantee**: the driver remains the final authority for programming + and validating the actual hardware mode during `NodePrepareResources`. +- **Failure handling**: stale scheduler state or races may still cause prepare- + time rejection, and that rejection remains the final safety backstop. + ##### Cache Extension: Effective Device State To prevent race conditions during high-volume scheduling, the scheduler @@ -507,6 +617,7 @@ A device's effective state is a derived value: ``` Effective State = ResourceSlice (device definition + attributeKeys) + Active Claims (structured parameters decoded from opaque config) + + Unknown affinity claims (AffinityStates[deviceID].Unknown marks non-reconstructable state) + AssumedClaims (tentative locks from current scheduling cycle) ``` @@ -514,42 +625,72 @@ The scheduler's `AllocatedState` is extended to track affinity values alongside consumed capacity: ```go +type AffinityState struct { + // Unknown indicates that one or more active claims on the device do not + // expose reconstructable affinity values. When true, the device is filtered + // for new sharing-affinity placements until fully clean. + Unknown bool + + // LockedAffinity stores the known lock for a device when Unknown is false. + // Empty means the device is clean/unlocked. + LockedAffinity map[string]string +} + type AllocatedState struct { AllocatedDevices sets.Set[DeviceID] AllocatedSharedDeviceIDs sets.Set[SharedDeviceID] AggregatedCapacity ConsumedCapacityCollection - // AffinityValues tracks the locked affinity values for shared devices. - // Key is DeviceID, value is a map of attribute key to locked value. - // Set tentatively during Reserve, hardened on successful bind, - // cleared on Unreserve or when all claims release. // +featureGate=DRASharingAffinity - AffinityValues map[DeviceID]map[string]string + AffinityStates map[DeviceID]AffinityState } ``` + ##### Filter and Score Phases **Filter phase**: For a given node, the scheduler evaluates each device. A device with `sharingAffinity` is a candidate ONLY if: 1. It has sufficient consumable capacity (KEP-5075) -2. The claim provides values for ALL keys in `sharingAffinity.attributeKeys` - (missing key → device is **not** a candidate). The scheduler extracts these - values by decoding the well-known JSON schema from the claim's opaque config. -3. The device's `AffinityValues` is either empty (unlocked) OR matches the +2. The device's `AffinityStates[deviceID].Unknown` is **not** true +3. The claim has **exactly one** scheduler-recognized `StructuredParameters` + config entry targeting the relevant request +4. That entry can be decoded successfully using the well-known schema +5. The claim provides values for ALL keys in `sharingAffinity.attributeKeys` + (missing key → device is **not** a candidate) +6. For each required affinity key, the recognized entry provides a **string** + value (non-string values are invalid in alpha) +7. The device's `AffinityStates[deviceID].LockedAffinity` is either empty (unlocked) OR matches the claim's affinity values for ALL keys -If a claim does not provide a structured parameter entry for a required attribute -key, the device is filtered out. This is the safe default: the driver declared -that sharing requires a specific parameter, and a claim that omits it cannot be -properly configured. Claims that do not need sharing-constrained devices should -target devices without `sharingAffinity`. +The scheduler identifies the structured-parameters entry by `opaque.driver: +resource.k8s.io` plus `apiVersion: resource.k8s.io/v1alpha1` and +`kind: StructuredParameters` in the embedded payload. Driver-specific config +entries are ignored by the scheduler. + +If a device has `AffinityStates[deviceID].Unknown == true`, or if a required request has no +recognized structured-parameters entry, more than one recognized entry, an +entry that fails schema/decoding checks, or a required affinity key with a +non-string value, the device is filtered out for `sharingAffinity` +scheduling. This is the safe default: the driver declared that sharing +requires specific scheduler-readable parameters, and a scheduler that cannot +reconstruct the current or requested affinity state cannot evaluate placement +safely. Claims that do not need sharing-constrained devices should target +devices without `sharingAffinity`. + +**Score phase**: The normative ordering in alpha is: + +1. A device already locked to a compatible affinity value scores highest. +2. An otherwise equivalent clean (unlocked) device scores lower. +3. An incompatible locked device, or a device with `AffinityStates[deviceID].Unknown == true`, is + not scored because it was already filtered out. + +This preserves unlocked devices for future workloads with different affinity +values, minimizing fragmentation. The exact score weight is not in scope of the alpha; +the required behavior is that a compatible locked device is preferred over an otherwise +equivalent clean device. -**Score phase**: Nodes where the `AffinityValues` already match the request -are scored **higher** than nodes with "clean" (unlocked) devices. This -preserves unlocked devices for future workloads with different affinity -values, minimizing fragmentation. ##### Reserve Phase: Tentative Locking @@ -560,10 +701,10 @@ lock" in the scheduler cache before the Binding phase: 2. If device has no existing allocations (unlocked): - Extract affinity values for `sharingAffinity.attributeKeys` from the claim's structured parameters (decoded from opaque config) - - Record values in `AllocatedState.AffinityValues[deviceID]` + - Record values in `AllocatedState.AffinityStates[deviceID].LockedAffinity` - Proceed with allocation (device is now tentatively locked) 3. If device has existing allocations (locked): - - Compare claim's affinity values against `AllocatedState.AffinityValues[deviceID]` + - Compare claim's affinity values against `AllocatedState.AffinityStates[deviceID].LockedAffinity` - If all keys match: proceed with allocation (pack onto locked device) - If any key mismatches: skip this device, try next candidate @@ -577,11 +718,11 @@ and either join it or skip the device. This follows the same pattern used by | Event | Cache Action | Result | |-------|-------------|--------| -| Pod scheduled (Reserve) | Add tentative lock to `AffinityValues` | Device becomes tentatively locked | +| Pod scheduled (Reserve) | Set `AffinityStates[deviceID].LockedAffinity` | Device becomes tentatively locked | | Binding success (PreBind) | Transition tentative lock to hardened | Lock is confirmed | | Binding failure / Preemption | Trigger Unreserve; remove tentative lock | Lock is released (if no other claims share it) | | Driver update (ResourceSlice) | Reconcile cache with API state | Cache refreshed; redundant tentative locks purged | -| All claims released | Clear `AffinityValues[deviceID]` | Device becomes unlocked | +| All claims released | Clear `AffinityStates[deviceID]` | Device becomes unlocked | ##### Handling the "First Pod" Problem @@ -589,7 +730,7 @@ The first Pod to land on an unlocked device defines the affinity lock for all subsequent consumers. This introduces a risk: a low-priority Pod with a rare affinity value could "poison" a high-capacity device. -**Lock origin**: If `AffinityValues[deviceID]` is empty, the scheduler takes +**Lock origin**: If `AffinityStates[deviceID].LockedAffinity` is empty, the scheduler takes the affinity values from the current Pod's ResourceClaim and writes them to the cache. All subsequent Pods must match these values to share the device. @@ -610,7 +751,7 @@ PodAffinity currently handle "assumed" states. pods may reach the Filter phase concurrently. Without protection, two pods with *different* affinities could both pass Filter for the same clean device in the same millisecond. To prevent this, all reads and writes to -`AllocatedState.AffinityValues` must be protected by the `AllocatedState` mutex. +`AllocatedState.AffinityStates` must be protected by the `AllocatedState` mutex. The Filter phase acquires a read lock to check the current affinity state; the Reserve phase acquires a write lock to set the tentative lock atomically. This ensures that once one pod's Reserve completes, the next pod's Filter sees the @@ -618,32 +759,37 @@ updated lock. ##### Scheduler Restart: State Reconstruction -On scheduler restart, the in-memory `AffinityValues` map is empty. The scheduler +On scheduler restart, the in-memory `AffinityStates` map is empty. The scheduler must reconstruct affinity locks from persisted state before the first scheduling cycle begins. **Reconstruction algorithm**: 1. On startup, the scheduler iterates all `Bound` ResourceClaims (same path as - existing `GatherAllocatedState()` for capacity reconstruction) + existing `GatherAllocatedState()` for capacity reconstruction). 2. For each bound claim, check if the allocated device has `SharingAffinity` - defined in the corresponding ResourceSlice -3. If yes, decode the claim's opaque config using the well-known JSON schema - and extract the structured parameters; populate - `AffinityValues[deviceID]` with the key-value pairs for the declared - attribute keys -4. If yes but the claim has **no** well-known structured parameters - (grandfathered claim from before the feature was enabled), skip it — do not - populate affinity for this claim. The lock will be established by the next - new claim that provides structured parameters. -5. If multiple claims share the same device, verify their values are consistent - (they must be, by construction—but log a warning if not) + defined in the corresponding ResourceSlice. +3. If yes, attempt to decode the claim's opaque config using the well-known JSON + schema and extract the required structured parameters. +4. If decoding succeeds and all required affinity keys are present as strings, + populate `AffinityStates[deviceID].LockedAffinity` with those values. +5. If the claim has **no** recognized `StructuredParameters` entry, malformed + structured parameters, non-string values for a required affinity key, or + multiple recognized structured-parameters entries for the same request, + set `AffinityStates[deviceID].Unknown = true` and log a warning. The scheduler + must not infer lock state from ambiguous or invalid data. +6. If multiple claims share the same device and any one of them causes the + device to become unknown, the device remains with `AffinityStates[deviceID].Unknown == true` + until all claims on that device are released. +7. If multiple reconstructable claims share the same device, verify their values + are consistent (they must be, by construction—but log a warning if not). This follows the same pattern used to reconstruct `AggregatedCapacity` from bound claims on startup. No new API calls are needed; the data is already available from the ResourceClaim spec and ResourceSlice spec cached by the scheduler's informers. + ### Examples #### ResourceSlice with Sharing Affinity @@ -719,6 +865,59 @@ spec: > driver config that only the driver reads. For simple cases where the driver > can read both, a single well-known config block may be sufficient. +#### Multi-key SharingAffinity Example + +This example illustrates the alpha semantics when a device constrains sharing on +**multiple keys**. + +A driver advertises a shared RDMA-capable NIC where both **subnet** and **PKey** +must match for pods to share the same device: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +spec: + devices: + - name: mlx5_0 + allowMultipleAllocations: true + sharingAffinity: + attributeKeys: + - networking.example.com/subnet + - networking.example.com/pkey + capacity: + networking.example.com/slots: + value: "16" +``` + +A matching claim provides both values in the scheduler-recognized structured +parameters: + +```json +{ + "apiVersion": "resource.k8s.io/v1alpha1", + "kind": "StructuredParameters", + "attributes": { + "networking.example.com/subnet": {"string": "subnet-a"}, + "networking.example.com/pkey": {"string": "0x8001"}, + "networking.example.com/vlan": {"string": "100"} + } +} +``` + +Alpha matching behavior: + +- If the device is clean, the first compatible claim sets the lock to: + - `subnet = subnet-a` + - `pkey = 0x8001` +- A later claim with the **same** `subnet` and **same** `pkey` may share the + device. +- A claim with `subnet = subnet-a` but `pkey = 0x8002` is rejected for that + device because **all declared keys must match**. +- A claim that provides only `subnet` but omits `pkey` is rejected for that + device because **missing declared keys are invalid**. +- The extra `vlan` key is ignored for this device because the driver did not + declare `networking.example.com/vlan` in `attributeKeys`. + ### Test Plan [x] I/we understand the owners of the involved components may require updates to @@ -739,16 +938,30 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Filter: claim missing a required `attributeKey` → device filtered out - Filter: claim with extra keys beyond device's declared `attributeKeys` → extra keys ignored, device passes if declared keys match + - Filter: no recognized `StructuredParameters` entry for a sharing-constrained + request → device filtered out + - Filter: malformed recognized `StructuredParameters` payload → device filtered out + - Filter: duplicate recognized `StructuredParameters` entries for one request → + device filtered out + - Filter: non-string value for a required affinity key → device filtered out + - Filter: device with `AffinityStates[deviceID].Unknown == true` is excluded for new + `sharingAffinity` scheduling - Score: locked-compatible device scores higher than clean device - Reserve: first claim sets lock; second claim with same values succeeds - Reserve: second claim with conflicting values fails - Unreserve: tentative lock is rolled back - - Grandfathered claims: pre-existing claims without structured parameters neither set - nor conflict with locks - - Lock precedence: all 6 scenarios from the Lock Precedence table -- `staging/src/k8s.io/api/resource/v1`: Coverage for new API types, including: + - Legacy claims with non-reconstructable affinity cause the device to be marked + unknown rather than establishing or joining a lock + - Legacy-claim handling: all scenarios from the `Handling Legacy Claims with + Unknown Affinity` table +- `staging/src/k8s.io/api/resource/v1`: Coverage for new API types and the + recognized structured-parameters contract, including: - Validation: `attributeKeys` exceeding max 8 limit is rejected - Validation: structured parameters exceeding max 8 attributes is rejected + - Validation: duplicate recognized structured-parameters entries for the same + request are rejected when validation can detect them + - Validation: non-string values for keys referenced by `sharingAffinity` + are rejected when validation can detect them - Round-trip serialization of `SharingAffinity` and well-known schema ##### Integration tests @@ -757,22 +970,27 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Affinity mismatch causing allocation to different device - Affinity lock clearing when all claims release a device - Interaction with consumable capacity constraints (KEP-5075) -- Scheduler restart: `AffinityValues` correctly reconstructed from existing - bound ResourceClaims (including skipping grandfathered claims without - structured parameters) +- Scheduler restart: `AffinityStates` correctly reconstructed from existing + bound ResourceClaims, and devices with non-reconstructable active claims have + `AffinityStates[deviceID].Unknown == true` - Parallel scheduling: two Pods with conflicting affinity values targeting the same device — one wins Reserve, the other is requeued - Feature gate disabled: `sharingAffinity` fields are ignored; devices are treated as unconditionally shareable - Feature gate toggled: enabling after claims exist does not disrupt already-bound - workloads + workloads, and legacy in-use devices are conservatively filtered until clean +- Invalid structured parameters: malformed payload or duplicate recognized + entries for one request do not crash scheduling and deterministically exclude + sharing-constrained devices +- Invalid value type: a required affinity key encoded as a non-string value is + rejected for `sharingAffinity` scheduling and does not populate lock state - **Ghost Lock**: Pod is Assumed (tentative lock set) but Bind fails — verify the lock is cleared immediately and the next Pod in the queue can claim the device with a different affinity value -- **Grandfather Migration**: 5 Pods running on a NIC with no lock; driver - updates ResourceSlice to add `sharingAffinity`; 6th Pod scheduled with - structured parameters — verify the 6th Pod succeeds, sets the lock, and the - original 5 Pods are not disrupted +- **Legacy Device Migration**: 5 Pods are already running on a NIC; the driver + updates `ResourceSlice` to add `sharingAffinity`; a 6th Pod arrives with + structured parameters — verify the device has `AffinityStates[deviceID].Unknown == true` + and the 6th Pod is filtered from that device until all legacy claims drain - **Partial Key**: Device requires `subnet` and `pkey` in `attributeKeys`; claim provides only `subnet` — verify the device is filtered out @@ -798,6 +1016,10 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Scheduler tracks affinity in AllocatedState - Unit and integration tests - Documentation for driver authors +- Alpha documentation explicitly calls out the lack of lock-aware fairness and + the absence of lock-breaking/preemption semantics for incompatible locks +- Alpha documentation explicitly calls out string-only affinity matching and + the rejection of non-string values for `sharingAffinity` keys #### Beta @@ -824,24 +1046,24 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. This can happen during driver upgrades or when enabling the feature on existing hardware. The scheduler handles this as follows: -- **Pre-existing claims without structured parameters** are grandfathered: they do not - participate in affinity matching. The scheduler skips them when reconstructing - the `AffinityValues` map. -- **The lock is established by the first *new* claim** that provides structured - parameters for the required attribute keys after the `sharingAffinity` - field is added. -- **Pre-existing claims continue to run** and are not evicted. The driver is - responsible for ensuring that already-configured VFs/resources remain - functional regardless of the new affinity constraint. -- **On release of all claims** (both old and new), the device returns to a clean - unlocked state and subsequent allocations enforce affinity normally. +- **Pre-existing claims continue to run** and are not evicted. +- If any active claim on that device does **not** provide reconstructable + affinity values for the required keys, the scheduler marks the device as + `AffinityStates[deviceID].Unknown = true`. +- A device with `AffinityStates[deviceID].Unknown == true` is **not eligible** for new + `sharingAffinity` placements, even if it still has nominal shared capacity. +- **Once all active claims on that device are released**, the device becomes + clean and subsequent allocations can establish and enforce affinity normally. +- Drivers enabling this feature on existing hardware should prefer doing so on + clean devices, because alpha intentionally chooses conservative correctness + over mid-flight reuse of devices whose effective modal state is unknown. > **Note**: The API server does not cross-validate ResourceSlice updates against > active ResourceClaims. Enforcing "no `sharingAffinity` changes while claims > are active" would require a new admission controller with cross-object > validation, which is fragile and out of scope for this KEP. Drivers should > avoid adding `sharingAffinity` mid-flight when possible, but the scheduler -> must handle it gracefully when it occurs. +> must handle it safely when it occurs. **Downgrade**: If a ResourceSlice with `sharingAffinity` exists and the feature gate is disabled: - API server rejects updates to the field @@ -850,12 +1072,28 @@ hardware. The scheduler handles this as follows: ### Version Skew Strategy -- **kube-apiserver**: Must be upgraded first to accept new API field -- **kube-scheduler**: If scheduler is older, it ignores `sharingAffinity` (permissive) -- **kubelet**: No changes required; kubelet doesn't interpret sharing affinity -- **DRA driver**: Driver defines the field but doesn't enforce it; scheduler does - -During skew, the worst case is permissive sharing (old scheduler ignores affinity). Drivers should handle conflicting configs at prepare time as a fallback. +- **kube-apiserver**: Must be upgraded first to accept the new + `sharingAffinity` API field on `ResourceSlice`. +- **kube-scheduler**: + - A scheduler that understands this feature enforces `sharingAffinity`, tracks + `AffinityStates`, and may conservatively mark devices with + `AffinityStates[deviceID].Unknown == true` when effective affinity cannot be reconstructed. + - An older scheduler ignores `sharingAffinity`. In that skew case, placement + may be overly permissive and the DRA driver remains the final safety + backstop during `NodePrepareResources`. +- **kubelet**: No changes required; kubelet does not interpret + `sharingAffinity`. +- **DRA driver**: + - Drivers may publish `sharingAffinity` and may optionally surface current + lock-related state for observability. + - Drivers must continue validating actual hardware compatibility at prepare + time, especially during skew where an older scheduler may not enforce the + affinity constraints. + +During version skew, the main outcomes are **permissive scheduling** by an older +scheduler or **conservative filtering** by a newer scheduler when affinity state +cannot be reconstructed. Both are operationally safe as long as the driver +continues rejecting incompatible prepare-time configurations. ## Production Readiness Review Questionnaire @@ -874,14 +1112,20 @@ No. Devices without `sharingAffinity` behave exactly as before. The feature only ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Yes. Disabling the feature gate causes: -- API server to reject new/updated ResourceSlices with `sharingAffinity` -- Scheduler to ignore existing `sharingAffinity` fields (permissive sharing) +- API server to reject new/updated `ResourceSlice`s with `sharingAffinity` +- Scheduler to ignore existing `sharingAffinity` fields for future placement + decisions -Existing allocations continue to work. New allocations may allow incompatible sharing, which drivers should handle at prepare time. +Existing allocations continue to work. New allocations may become more +permissive, so the driver must continue validating compatibility at prepare +time. ###### What happens if we reenable the feature if it was previously rolled back? -The scheduler resumes enforcing `sharingAffinity`. Existing allocations are not affected. New allocations will respect affinity constraints. +The scheduler resumes enforcing `sharingAffinity` for future placement +decisions. Existing allocations are not evicted. If there are active +allocations whose affinity cannot be reconstructed, the corresponding devices +may be treated conservatively until they become clean. ###### Are there any tests for feature enablement/disablement? @@ -891,17 +1135,30 @@ Yes, unit tests will cover the feature gate behavior for API validation and sche ###### How can a rollout or rollback fail? Can it impact already running workloads? -Rollout failure: If API server is updated but scheduler is not, the scheduler ignores affinity (permissive). This may cause incompatible sharing, but drivers should handle it. +Rollout failure modes include: -Rollback failure: If scheduler is rolled back but API server keeps the field, same permissive behavior. +- **Older scheduler after API enablement**: the scheduler ignores `sharingAffinity` + and placement may be overly permissive. The driver remains the safety backstop + at prepare time. +- **Newer scheduler enabling conservative handling on legacy in-use devices**: + devices with non-reconstructable active claims may be filtered until they are + clean, which can temporarily reduce effective schedulable capacity. -Running workloads are not impacted; only new scheduling decisions are affected. +Rollback failure mode: if the scheduler is rolled back while the API server +still serves the field, placement returns to permissive behavior for new +scheduling decisions. + +Running workloads are not evicted by this feature; the impact is on future +placement decisions, not on already-running pods. ###### What specific metrics should inform a rollback? -- `dra_scheduling_attempts_affinity_mismatch_total` increasing unexpectedly -- Pod scheduling failures with affinity-related events -- Driver prepare failures due to incompatible configs +Illustrative rollback signals include: + +- `sharing_affinity_filter_mismatch_total` increasing unexpectedly, +- `sharing_affinity_unknown_device_total` remaining elevated after rollout, +- spikes in affinity-related scheduler events or unschedulable pods, +- driver prepare failures due to incompatible or stale configurations. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? @@ -915,29 +1172,65 @@ No. ###### How can an operator determine if the feature is in use by workloads? -- Check ResourceSlices for `sharingAffinity` field -- Metric: `dra_scheduling_attempts_affinity_mismatch_total` > 0 indicates affinity is being enforced +Operators can determine usage by inspecting `ResourceSlice` objects that set +`sharingAffinity` and by observing `ResourceClaim`s that include recognized +`StructuredParameters` for requests targeting those devices. ###### How can someone using this feature know that it is working for their instance? -- [ ] Events - - Event Reason: `SharingAffinityMismatch` when a device is skipped due to affinity -- [ ] API .status - - Condition name: N/A (affinity is transparent; allocation succeeds or device is skipped) +A user should be able to observe that: + +- compatible claims preferentially pack onto already-locked devices, +- incompatible claims are filtered before bind/prepare when the scheduler has + reconstructable affinity state, +- devices with unknown legacy affinity state are conservatively excluded until + they become clean. + +In practice, this should be visible through scheduler logs, scheduler events, +and (where implemented) scheduler metrics. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? -Scheduling latency should not increase significantly. Affinity checking is O(number of attribute keys), typically 1-3 keys. +This enhancement should not materially regress baseline DRA scheduling latency +for clusters that do not use `sharingAffinity`. + +For clusters that do use the feature, the primary objective is **correctness of +compatibility-aware placement** with bounded incremental scheduling overhead. ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? -- [x] Metrics - - Metric name: `dra_scheduling_attempts_affinity_mismatch_total` - - Components exposing the metric: kube-scheduler +Useful SLIs include: + +- rate of scheduling attempts filtered due to `sharingAffinity` mismatch, +- rate of devices with `AffinityStates[deviceID].Unknown == true`, +- rate of malformed or duplicate recognized `StructuredParameters` payloads, +- share of successful placements that pack onto already-locked compatible + devices, +- prepare-time rejections by the DRA driver caused by incompatible hardware + configuration. ###### Are there any missing metrics that would be useful to have to improve observability of this feature? -A metric for "devices skipped due to affinity" per scheduling cycle could help diagnose fragmentation. +This feature would benefit from scheduler-observable counters and/or events for: + +- `sharing_affinity_filter_mismatch_total` +- `sharing_affinity_filter_missing_parameters_total` +- `sharing_affinity_filter_invalid_parameters_total` +- `sharing_affinity_unknown_device_total` +- `sharing_affinity_packed_allocation_total` + +Exact metric names are illustrative and implementation-specific, but +equivalent observability is strongly recommended. + +In addition, user-facing diagnostics should make the reason for filtering clear, +for example: + +- missing required structured parameters for request ``, +- duplicate recognized `StructuredParameters` entries for request ``, +- required key `` has a non-string value in alpha, +- device `` is locked to incompatible affinity values, +- device `` has unknown affinity state due to legacy or invalid active + claims. ### Dependencies @@ -971,7 +1264,7 @@ Negligible. Affinity check is a simple map lookup, O(1) per attribute key. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? -No. Memory increase for `AllocatedState.AffinityValues` is proportional to active shared allocations, not total devices. +No. Memory increase for `AllocatedState.AffinityStates` is proportional to active shared allocations, not total devices. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? @@ -981,20 +1274,51 @@ No. ###### How does this feature react if the API server and/or etcd is unavailable? -Same as existing DRA behavior. Scheduler cannot proceed without API server. +Like existing scheduler-driven DRA logic, this feature depends on informer state +and cached API data. Temporary API server or etcd unavailability does not by +itself invalidate already-computed in-memory lock state, but sustained control- +plane unavailability may delay reconciliation of claim release, slice updates, +or restart reconstruction. + +The driver remains the final enforcement authority at prepare time. ###### What are other known failure modes? -- **Affinity fragmentation**: Many unique affinity values cause devices to be underutilized - - Detection: Monitor device utilization vs capacity - - Mitigation: Review affinity key design; consider coarser grouping +Known failure modes include: + +- **Malformed structured parameters**: the scheduler cannot decode the + recognized payload and filters the device. +- **Duplicate recognized entries for one request**: the scheduler treats the + request as invalid for `sharingAffinity` scheduling. +- **Missing required keys**: the claim cannot be matched against a device that + declares those keys. +- **Non-string values for required keys in alpha**: the device is filtered for + that claim. +- **Unknown affinity state**: the device has active allocations whose affinity + cannot be reconstructed, so it is conservatively filtered until clean. +- **Prepare-time driver rejection**: despite scheduler filtering, the driver may + still reject an incompatible or stale placement and that rejection is the + final safety backstop. ###### What steps should be taken if SLOs are not being met to determine the problem? -1. Check `dra_scheduling_attempts_affinity_mismatch_total` for unexpected spikes -2. Review ResourceSlice `sharingAffinity` configuration -3. Examine claim affinity values for unexpected entries -4. Consider disabling feature gate as temporary mitigation +Recommended debugging flow: + +1. Inspect the relevant `ResourceSlice` and confirm the device declares the + expected `sharingAffinity.attributeKeys`. +2. Inspect the `ResourceClaim` and confirm there is exactly one recognized + `StructuredParameters` entry for the relevant request. +3. Verify that every required affinity key is present and string-valued. +4. Check whether the target device is already locked to incompatible values. +5. Check whether the device is being treated as having **unknown affinity + state** because of legacy or invalid active claims. +6. Review scheduler logs/events for explicit filter reasons. +7. If the scheduler allowed placement but the driver rejected prepare, inspect + driver logs to determine whether the issue was stale scheduler state, + unsupported config, or an actual device-level incompatibility. + +User-facing diagnostics should prefer concrete messages over generic +unschedulable errors whenever possible. ## Implementation History @@ -1003,9 +1327,12 @@ Same as existing DRA behavior. Scheduler cannot proceed without API server. ## Drawbacks -- Adds complexity to the scheduler's allocation logic -- Affinity is static per device; cannot change after first allocation -- Fragmentation risk if affinity values are too fine-grained +- Adds complexity to the scheduler's allocation logic and cache reconstruction +- Once a device is locked, its effective affinity cannot change until all + claims on that device are released +- Fragmentation risk remains if affinity values are too fine-grained +- Conservative handling of legacy in-use devices can temporarily strand + schedulable capacity during rollout or migration ## Alternatives @@ -1119,6 +1446,10 @@ beta/GA based on real-world feedback: ### Priority-based Lock Preemption +This section addresses a deliberate **alpha limitation**: alpha enforces lock +compatibility, but does not provide lock-aware fairness or any mechanism for a +higher-priority Pod to break an incompatible lock. + Standard Kubernetes preemption is **blind to affinity locks**. It triggers on *resource shortage* (insufficient CPU, memory, or device slots), not on qualitative state mismatch. This creates a critical gap: @@ -1152,11 +1483,30 @@ This is scoped for Beta because the core Filter/Reserve/Score mechanism must be proven in Alpha first, and lock-aware preemption requires careful integration with the existing DRA preemption path. -### Follower-Only Strategy +### SharingStrategy (`CanSetLock` / `NeverSetLock`) + +Alpha intentionally does **not** let claims control whether they may establish +a new lock on a clean device. Any compatible claim can set the initial lock, +and the scheduler then packs subsequent compatible claims onto that device. + +A future enhancement could add an explicit **SharingStrategy** on the claim +side to control lock-setting behavior. Two candidate strategies are: + +- **`CanSetLock`** (default): The claim may land on a clean device and + establish the lock. This matches the alpha behavior. +- **`NeverSetLock`**: The claim may only be allocated to a device that already + has a matching lock established by another claim. This is useful for + background or batch jobs that should never consume a clean device and + potentially fragment capacity. + +If introduced in beta or later, the scheduler would evaluate this policy before +capacity and key matching for unlocked devices. A claim with `NeverSetLock` +would reject an unlocked device immediately, then continue searching for an +already-locked compatible device. -See [UNRESOLVED #3: SharingStrategy](#open-design-questions) for the -`CanSetLock`/`NeverSetLock` proposal. If not included in alpha, this becomes -the first beta enhancement. +This is deferred from alpha to keep the initial scope focused on the core +problem: driver-declared sharing constraints plus scheduler-enforced lock +tracking via structured parameters. ### Soft / Preferred Affinity Keys @@ -1198,6 +1548,21 @@ evaluate affinity compatibility (e.g., This would require extending the CEL evaluation context to include runtime allocation state, which is a substantial change warranting its own KEP. +### Typed Affinity Values Beyond Strings + +Alpha intentionally limits `sharingAffinity` matching to `string` values, which +keeps equality semantics simple and matches the scheduler's current in-memory +representation (`map[string]string`). A future enhancement could extend this to +additional `DeviceAttribute` value types once Kubernetes defines stable +normalization and equality rules for those types in scheduler-owned lock state. + +That future work would need to answer questions such as: + +- How non-string values are normalized before comparison +- Whether different encodings of the same logical value are considered equal +- How typed values are stored in `AllocatedState` without introducing ambiguous + comparisons or upgrade hazards + ## Infrastructure Needed None From 44f990de4a395288fa80e22c9cf45ef3dd7c4e8b Mon Sep 17 00:00:00 2001 From: Ashvin Deodhar Date: Tue, 7 Apr 2026 22:57:23 -0700 Subject: [PATCH 4/6] Add compatibility matrix, objectRef alternative MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add compatibility matrix (5 scenarios × Scheduler/Driver Outcome columns) showing dual-enforcement model for SA+SP combinations - Add Enablement and Rollout Dynamics section with unknown-affinity safety valve, missing/malformed parameter handling, version skew/rollback, and recommended rollout sequence for drivers - Add Object Reference-based Affinity Matching as a rejected alternative with rationale (API surface, multi-dimensional affinity, @pohly direction) - Add drawback: devices with sharingAffinity but no SP claims become unschedulable under Strict Gating --- .../5981-dra-sharing-affinity/README.md | 128 ++++++++++++++++-- 1 file changed, 114 insertions(+), 14 deletions(-) diff --git a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md index bc824d1c311a..936485eee9d7 100644 --- a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md +++ b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md @@ -43,6 +43,7 @@ - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Claim-side SharingAffinity (on DeviceRequest)](#claim-side-sharingaffinity-on-devicerequest) + - [Object Reference-based Affinity Matching](#object-reference-based-affinity-matching) - [Placeholder Pattern Workaround](#placeholder-pattern-workaround) - [CEL-based Affinity Matching](#cel-based-affinity-matching) - [Future Enhancements](#future-enhancements) @@ -411,8 +412,8 @@ but only if pods belong to the same subnet. The driver sets blocks can specify different structured parameters for different requests. This means `mgmt-nic` can be locked to Subnet-A while `data-nic` is locked to Subnet-B within the same claim — there is no cross-talk between requests. -- **Empty affinity**: Devices without sharingAffinity do not participate in - affinity-based gating; those may get allocated by claims that do not specify affinity. +- **Empty affinity**: Devices without `sharingAffinity` behave as before — any + claim can share them regardless of whether it provides structured parameters. - **Legacy allocations with unknown affinity are conservative in alpha**: If a device has active allocations for which the scheduler cannot reconstruct the required affinity values (for example, claims created before the feature @@ -435,6 +436,26 @@ Legacy claims continue to run and are not evicted. However, until all unknown allocations on a `sharingAffinity` device are released, the scheduler does not assume it knows the device's effective modal state. +#### Compatibility Matrix + +To clarify the interaction between claims and devices, the following matrix +outlines how the scheduler and driver evaluate candidates based on whether +`SharingAffinity` (SA) is declared on the device and whether +`StructuredParameters` (SP) are provided in the claim: + +| Scenario | Device SA | Claim SP | Scheduler Outcome | Driver Outcome | +|---|---|---|---|---| +| **Standard Feature Use** | Yes | Yes | **Match enforced.** Values match lock + capacity available → scheduled. | **Validates** hardware mode matches claim config at `NodePrepareResources`. Rejects if stale or inconsistent. | +| **Strict Gating** | Yes | No | **Filtered out.** Device excluded — requires affinity signal the claim does not provide. | **N/A** — claim never reaches the driver for this device. | +| **Legacy Device Transition** | Yes (newly added) | Yes | **Filtered out** while legacy claims are active (`Unknown: true`). Allowed once device drains clean. | **Validates** as normal once claim reaches the driver. During transition, driver continues serving legacy claims. | +| **Permissive Sharing** | No | Yes | **Allowed.** Device has no `sharingAffinity`; structured parameters are not evaluated for affinity. Standard capacity matching applies. | **Must enforce** hardware compatibility independently. Scheduler provides no affinity gating for this device. | +| **Legacy/Basic** | No | No | **Allowed.** Standard DRA capacity and attribute matching. | **Must enforce** hardware compatibility independently. This is the pre-KEP-5981 behavior. | + +The top rows show the scheduler as the **primary** enforcer with the driver as +a **backstop**. The bottom rows show the driver as the **sole** enforcer with +the scheduler being permissive. The transition row shows the scheduler being +**conservative** (filtering) while the driver continues **serving existing +workloads**. ### Risks and Mitigations @@ -669,7 +690,7 @@ resource.k8s.io` plus `apiVersion: resource.k8s.io/v1alpha1` and `kind: StructuredParameters` in the embedded payload. Driver-specific config entries are ignored by the scheduler. -If a device has `AffinityStates[deviceID].Unknown == true`, or if a required request has no +If a device has `AffinityStates[deviceID].Unknown` set, or if a required request has no recognized structured-parameters entry, more than one recognized entry, an entry that fails schema/decoding checks, or a required affinity key with a non-string value, the device is filtered out for `sharingAffinity` @@ -683,13 +704,13 @@ devices without `sharingAffinity`. 1. A device already locked to a compatible affinity value scores highest. 2. An otherwise equivalent clean (unlocked) device scores lower. -3. An incompatible locked device, or a device with `AffinityStates[deviceID].Unknown == true`, is +3. An incompatible locked device, or a device with `AffinityStates[deviceID].Unknown` set, is not scored because it was already filtered out. This preserves unlocked devices for future workloads with different affinity -values, minimizing fragmentation. The exact score weight is not in scope of the alpha; -the required behavior is that a compatible locked device is preferred over an otherwise -equivalent clean device. +values, minimizing fragmentation. The exact score weight is implementation-defined +in alpha; the required behavior is that a compatible locked device is preferred +over an otherwise equivalent clean device. ##### Reserve Phase: Tentative Locking @@ -779,7 +800,7 @@ cycle begins. set `AffinityStates[deviceID].Unknown = true` and log a warning. The scheduler must not infer lock state from ambiguous or invalid data. 6. If multiple claims share the same device and any one of them causes the - device to become unknown, the device remains with `AffinityStates[deviceID].Unknown == true` + device to become unknown, the device remains with `AffinityStates[deviceID].Unknown` set until all claims on that device are released. 7. If multiple reconstructable claims share the same device, verify their values are consistent (they must be, by construction—but log a warning if not). @@ -944,7 +965,7 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Filter: duplicate recognized `StructuredParameters` entries for one request → device filtered out - Filter: non-string value for a required affinity key → device filtered out - - Filter: device with `AffinityStates[deviceID].Unknown == true` is excluded for new + - Filter: device with `AffinityStates[deviceID].Unknown` set is excluded for new `sharingAffinity` scheduling - Score: locked-compatible device scores higher than clean device - Reserve: first claim sets lock; second claim with same values succeeds @@ -972,7 +993,7 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Interaction with consumable capacity constraints (KEP-5075) - Scheduler restart: `AffinityStates` correctly reconstructed from existing bound ResourceClaims, and devices with non-reconstructable active claims have - `AffinityStates[deviceID].Unknown == true` + `AffinityStates[deviceID].Unknown` set - Parallel scheduling: two Pods with conflicting affinity values targeting the same device — one wins Reserve, the other is requeued - Feature gate disabled: `sharingAffinity` fields are ignored; devices are @@ -989,7 +1010,7 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. device with a different affinity value - **Legacy Device Migration**: 5 Pods are already running on a NIC; the driver updates `ResourceSlice` to add `sharingAffinity`; a 6th Pod arrives with - structured parameters — verify the device has `AffinityStates[deviceID].Unknown == true` + structured parameters — verify the device has `AffinityStates[deviceID].Unknown` set and the 6th Pod is filtered from that device until all legacy claims drain - **Partial Key**: Device requires `subnet` and `pkey` in `attributeKeys`; claim provides only `subnet` — verify the device is filtered out @@ -1050,7 +1071,7 @@ hardware. The scheduler handles this as follows: - If any active claim on that device does **not** provide reconstructable affinity values for the required keys, the scheduler marks the device as `AffinityStates[deviceID].Unknown = true`. -- A device with `AffinityStates[deviceID].Unknown == true` is **not eligible** for new +- A device with `AffinityStates[deviceID].Unknown` set is **not eligible** for new `sharingAffinity` placements, even if it still has nominal shared capacity. - **Once all active claims on that device are released**, the device becomes clean and subsequent allocations can establish and enforce affinity normally. @@ -1065,6 +1086,44 @@ hardware. The scheduler handles this as follows: > avoid adding `sharingAffinity` mid-flight when possible, but the scheduler > must handle it safely when it occurs. +**Enablement and Rollout Dynamics** + +The combination of device and claim states during rollout is critical for safe +enablement. The design prioritizes **conservative correctness** to ensure that +modal hardware is never accidentally over-provisioned or misconfigured during a +feature upgrade. + +1. **The "Unknown Affinity" safety valve**: If `sharingAffinity` is added to a + ResourceSlice that already has active legacy claims (claims created before the + feature or without `StructuredParameters`), the scheduler cannot reconstruct + the current mode of the device. The device is marked with + `AffinityStates[deviceID].Unknown = true`. Even if the device has 15/16 slots + free, the scheduler will **filter it out** for any new claim that requires + `sharingAffinity`. The device only becomes eligible for new affinity-locked + scheduling once all legacy claims are released and the device becomes clean. + +2. **Handling missing or malformed parameters**: The scheduler treats a device + with `sharingAffinity` as a protected resource. If a device requires both + `subnet` and `pkey` but a claim only provides `subnet`, the device is filtered + out — all declared keys are mandatory. If a claim's `StructuredParameters` + entry is malformed or contains non-string values for required keys in alpha, + the device is excluded. + +3. **Version skew and rollback**: During a rollout, an older scheduler will + ignore the `sharingAffinity` field. This may lead to permissive scheduling + where incompatible pods land on the same device. The DRA driver remains the + final safety backstop to reject these at `NodePrepareResources`. If the + feature gate is disabled on rollback, `sharingAffinity` fields are ignored, + devices return to being unconditionally shareable, and the driver again + becomes the sole authority for enforcing hardware modes. + +**Recommended rollout sequence**: To minimize capacity stranding during rollout, +drivers should ideally: + +1. Wait for a device to be idle (clean). +2. Update the ResourceSlice to include the `sharingAffinity` field. +3. Allow the scheduler to establish the first known lock with a new claim. + **Downgrade**: If a ResourceSlice with `sharingAffinity` exists and the feature gate is disabled: - API server rejects updates to the field - Scheduler ignores the field (all claims can share) @@ -1077,7 +1136,7 @@ hardware. The scheduler handles this as follows: - **kube-scheduler**: - A scheduler that understands this feature enforces `sharingAffinity`, tracks `AffinityStates`, and may conservatively mark devices with - `AffinityStates[deviceID].Unknown == true` when effective affinity cannot be reconstructed. + `AffinityStates[deviceID].Unknown` set when effective affinity cannot be reconstructed. - An older scheduler ignores `sharingAffinity`. In that skew case, placement may be overly permissive and the DRA driver remains the final safety backstop during `NodePrepareResources`. @@ -1202,7 +1261,7 @@ compatibility-aware placement** with bounded incremental scheduling overhead. Useful SLIs include: - rate of scheduling attempts filtered due to `sharingAffinity` mismatch, -- rate of devices with `AffinityStates[deviceID].Unknown == true`, +- rate of devices with `AffinityStates[deviceID].Unknown` set, - rate of malformed or duplicate recognized `StructuredParameters` payloads, - share of successful placements that pack onto already-locked compatible devices, @@ -1333,6 +1392,9 @@ unschedulable errors whenever possible. - Fragmentation risk remains if affinity values are too fine-grained - Conservative handling of legacy in-use devices can temporarily strand schedulable capacity during rollout or migration +- If a driver declares `sharingAffinity` on a device but no claims ever provide + `StructuredParameters`, that device becomes effectively unschedulable for + sharing workloads — all claims are filtered out by the "Strict Gating" rule ## Alternatives @@ -1363,6 +1425,43 @@ type SharingAffinity struct { - Users must understand the sharing constraint mechanism and explicitly opt into it, rather than simply providing config values they'd specify anyway +### Object Reference-based Affinity Matching + +An alternative approach replaces inline affinity values with external object +references. Instead of embedding values in opaque config, the claim would +reference a CRD (e.g., `NetworkConfiguration`) by name, and the device would +declare which object kinds constrain sharing: + +```yaml +# External CRD +kind: NetworkConfiguration +metadata: + name: subnet-a +spec: + subnet: 10.0.1.0/24 + +# ResourceClaim +config: + objectRefs: # new field + - kind: NetworkConfiguration + name: subnet-a + +# Device +commonConfigKind: # new field +- NetworkConfiguration +``` + +**Rejected because**: +- Requires new fields on **both** ResourceClaim (`objectRefs`) and Device + (`commonConfigKind`), whereas the chosen approach adds a field only to + Device and uses existing opaque config for claim-side values +- Requires external CRD definitions, adding operational burden for cluster + administrators +- Multi-dimensional affinity: A device may need affinity on multiple independent axes + (e.g., subnet + VLAN). With object references, each axis would need its own CRD. +- Conflicts with the direction from @pohly to avoid new API fields on claims + and use well-known schemas inside existing opaque config + ### Placeholder Pattern Workaround Without this KEP, drivers must use a "placeholder pattern": @@ -1372,6 +1471,7 @@ Without this KEP, drivers must use a "placeholder pattern": 3. Update ResourceSlice with actual capacity and affinity as attribute 4. Use CEL selector to match affinity attribute + **Problems**: - Race condition: Second pod may go to different device before expansion - ResourceSlice churn: Constant updates as pods come and go From d44bec177ffbecd6e713c611908be8e6da4207bc Mon Sep 17 00:00:00 2001 From: Ashvin Deodhar Date: Mon, 13 Apr 2026 12:16:28 -0700 Subject: [PATCH 5/6] Update the draft with few minor changes - MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Naming: - Rename sharingAffinity.attributeKeys to parameterKeys throughout to clarify that the keys reference claim StructuredParameters, not device attributes (addresses reviewer feedback from @sunya-ch) Design refinements: - Add 4-tier scoring: locked-compatible > clean-with-SA > no-SA > filtered, steering affinity-aware claims toward upgraded devices during mixed rollouts Upgrade/Downgrade restructure: - Reorder for narrative flow: upgrade → rollout sequence → in-use edge case → malformed params → downgrade - Clarify re-enable: drivers must republish ResourceSlices after rollback - Simplify driver skew description Test plan: - Add Compatibility Matrix, Strict Gating, Score Packing, Permissive Sharing, Driver Backstop, NodePrepareResources failure, and Rollout scenario tests --- .../5981-dra-sharing-affinity/README.md | 207 +++++++++--------- 1 file changed, 99 insertions(+), 108 deletions(-) diff --git a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md index 936485eee9d7..8d51b0b385bf 100644 --- a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md +++ b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md @@ -179,7 +179,7 @@ spec: - name: eth1 allowMultipleAllocations: true sharingAffinity: - attributeKeys: ["networking.example.com/subnet"] + parameterKeys: ["networking.example.com/subnet"] capacity: networking.example.com/slots: value: "16" @@ -212,7 +212,7 @@ trade-off analysis. **2. How claims communicate affinity values to the scheduler** -The driver declares `sharingAffinity.attributeKeys` on the device, telling the +The driver declares `sharingAffinity.parameterKeys` on the device, telling the scheduler which attribute keys constrain sharing. The scheduler learns the requested values for those keys by decoding a **well-known JSON schema** stored inside `OpaqueDeviceConfiguration`. @@ -230,7 +230,7 @@ a decodable format for affinity-relevant parameters. The schema reuses the same qualified key naming convention as `ResourceSlice` attributes and follows a `DeviceAttribute`-like envelope. In **alpha**, `sharingAffinity` matching is limited to **string-valued** attributes for the -keys referenced by `sharingAffinity.attributeKeys`, which keeps equality +keys referenced by `sharingAffinity.parameterKeys`, which keeps equality semantics simple and aligns with the scheduler's in-memory lock representation. ```json @@ -245,7 +245,7 @@ semantics simple and aligns with the scheduler's in-memory lock representation. ``` The scheduler decodes this JSON from the opaque blob and extracts **string** -values for the keys listed in the device's `sharingAffinity.attributeKeys`. The +values for the keys listed in the device's `sharingAffinity.parameterKeys`. The decoding overhead is small compared to the overall scheduling effort. If drivers also need additional, differently structured configuration @@ -285,7 +285,7 @@ scheduler-recognized contract with the following rules: the driver should reject the request during `NodePrepareResources` with a clear error, rather than silently accepting divergent values. 5. **String-only affinity values in alpha**: For any key referenced by - `sharingAffinity.attributeKeys`, the recognized structured-parameters entry + `sharingAffinity.parameterKeys`, the recognized structured-parameters entry must provide a `string` value in alpha. Other value types are not matched in alpha and are treated as invalid for `sharingAffinity` scheduling. 6. **Malformed payloads**: If a recognized structured-parameters entry is @@ -345,7 +345,7 @@ expected alpha limitation, not a correctness bug, and is addressed later under A user runs a distributed training job where every Pod must share the same RDMA Partition Key (PKey) to communicate. The NIC supports 16 VFs. The driver -sets `sharingAffinity.attributeKeys: ["networking.example.com/pkey"]`. The scheduler finds a node where +sets `sharingAffinity.parameterKeys: ["networking.example.com/pkey"]`. The scheduler finds a node where a NIC has enough capacity and is either "unlocked" or already locked to that specific PKey. @@ -357,7 +357,7 @@ specific PKey. An inference service uses FPGAs to accelerate a specific model. Loading a bitstream takes several seconds. The driver sets -`sharingAffinity.attributeKeys: ["fpga.example.com/bitstream"]`. The scheduler ensures new Pods +`sharingAffinity.parameterKeys: ["fpga.example.com/bitstream"]`. The scheduler ensures new Pods for this model are scheduled onto FPGAs that already have the bitstream loaded, even if other "fresh" FPGAs are available. @@ -369,7 +369,7 @@ even if other "fresh" FPGAs are available. A network DRA driver advertises NICs that can be shared across up to 16 pods, but only if pods belong to the same subnet. The driver sets -`sharingAffinity.attributeKeys: ["networking.example.com/subnet"]`. +`sharingAffinity.parameterKeys: ["networking.example.com/subnet"]`. - Pod A (subnet-X) gets allocated to eth1 → eth1 is now locked to subnet-X - Pod B (subnet-X) arrives → matches affinity, shares eth1 @@ -381,23 +381,23 @@ but only if pods belong to the same subnet. The driver sets device is allocated with an affinity value, that value is locked until all claims release the device. - **Attribute keys must be declared**: The device's - `sharingAffinity.attributeKeys` lists which attribute keys constrain sharing; + `sharingAffinity.parameterKeys` lists which attribute keys constrain sharing; claims must provide values for all of these keys in the well-known structured parameters or the device is filtered out. - **Multiple keys**: If multiple attribute keys are specified, ALL must match (both presence and value). - **Extra keys in claim**: If a claim's structured parameters contain keys beyond - what the device declares in `attributeKeys`, the extra keys are **ignored** + what the device declares in `parameterKeys`, the extra keys are **ignored** for that device. Only the device's declared keys are evaluated. This allows "generic" claims to work across devices with different sharing requirements (e.g., a claim with both `subnet` and `vlan` can match a device that only constrains on `subnet`). - **String-only matching in alpha**: For keys referenced by - `sharingAffinity.attributeKeys`, the scheduler only matches `string` values + `sharingAffinity.parameterKeys`, the scheduler only matches `string` values in alpha. If a required key is present with a non-string value, the device is filtered out for that claim. - **Missing keys in claim**: If the claim does not provide a value for a key - the device declares in `attributeKeys`, the device is **filtered out** (see + the device declares in `parameterKeys`, the device is **filtered out** (see Filter phase). - **Malformed structured parameters**: If the scheduler-recognized `StructuredParameters` entry is malformed, undecodable, or uses the wrong @@ -562,7 +562,7 @@ type Device struct { // DeviceSharingAffinity defines which device attribute keys constrain // sharing across multiple claims. type DeviceSharingAffinity struct { - // AttributeKeys lists the fully-qualified device attribute names that + // parameterKeys lists the fully-qualified device attribute names that // must have matching values across all claims sharing this device. // // In alpha, the corresponding values must be provided as strings in the @@ -578,10 +578,10 @@ type DeviceSharingAffinity struct { // +required // +listType=atomic // +k8s:maxItems=8 - AttributeKeys []FullyQualifiedName + parameterKeys []FullyQualifiedName } -const SharingAffinityAttributeKeysMaxSize = 8 +const SharingAffinityParameterKeysMaxSize = 8 ``` #### Scheduler Enhancement @@ -593,7 +593,7 @@ parameters** (decoded from the well-known JSON schema in opaque config) — not from device attributes on the ResourceSlice. The driver is NOT required to write locked affinity values back to the ResourceSlice. -- The ResourceSlice declares *which* keys constrain sharing (`attributeKeys`) +- The ResourceSlice declares *which* keys constrain sharing (`parameterKeys`) - The claims declare *what* values they need (via well-known structured parameters) - The scheduler combines these to maintain the lock in `AllocatedState` @@ -633,15 +633,6 @@ maintains affinity locks in its internal cache rather than relying on API server round-trips. This is consistent with how DRA already handles capacity tracking via `inFlightAllocations`. -A device's effective state is a derived value: - -``` -Effective State = ResourceSlice (device definition + attributeKeys) - + Active Claims (structured parameters decoded from opaque config) - + Unknown affinity claims (AffinityStates[deviceID].Unknown marks non-reconstructable state) - + AssumedClaims (tentative locks from current scheduling cycle) -``` - The scheduler's `AllocatedState` is extended to track affinity values alongside consumed capacity: @@ -678,7 +669,7 @@ device with `sharingAffinity` is a candidate ONLY if: 3. The claim has **exactly one** scheduler-recognized `StructuredParameters` config entry targeting the relevant request 4. That entry can be decoded successfully using the well-known schema -5. The claim provides values for ALL keys in `sharingAffinity.attributeKeys` +5. The claim provides values for ALL keys in `sharingAffinity.parameterKeys` (missing key → device is **not** a candidate) 6. For each required affinity key, the recognized entry provides a **string** value (non-string values are invalid in alpha) @@ -703,12 +694,18 @@ devices without `sharingAffinity`. **Score phase**: The normative ordering in alpha is: 1. A device already locked to a compatible affinity value scores highest. -2. An otherwise equivalent clean (unlocked) device scores lower. -3. An incompatible locked device, or a device with `AffinityStates[deviceID].Unknown` set, is +2. A clean (unlocked) device with `sharingAffinity` scores next — it can + establish a new lock and enable packing for future claims. +3. A device without `sharingAffinity` scores lowest among otherwise equivalent + candidates — the scheduler has no affinity enforcement for this device, so + packing benefits are lost. +4. An incompatible locked device, or a device with `AffinityStates[deviceID].Unknown` set, is not scored because it was already filtered out. This preserves unlocked devices for future workloads with different affinity -values, minimizing fragmentation. The exact score weight is implementation-defined +values, minimizing fragmentation. During mixed rollouts (some devices with +`sharingAffinity`, some without), this naturally steers affinity-aware claims +toward upgraded devices. The exact score weights are implementation-defined in alpha; the required behavior is that a compatible locked device is preferred over an otherwise equivalent clean device. @@ -720,7 +717,7 @@ lock" in the scheduler cache before the Binding phase: 1. Scheduler evaluates a multi-allocatable device with `sharingAffinity` 2. If device has no existing allocations (unlocked): - - Extract affinity values for `sharingAffinity.attributeKeys` from the claim's + - Extract affinity values for `sharingAffinity.parameterKeys` from the claim's structured parameters (decoded from opaque config) - Record values in `AllocatedState.AffinityStates[deviceID].LockedAffinity` - Proceed with allocation (device is now tentatively locked) @@ -739,27 +736,10 @@ and either join it or skip the device. This follows the same pattern used by | Event | Cache Action | Result | |-------|-------------|--------| -| Pod scheduled (Reserve) | Set `AffinityStates[deviceID].LockedAffinity` | Device becomes tentatively locked | -| Binding success (PreBind) | Transition tentative lock to hardened | Lock is confirmed | -| Binding failure / Preemption | Trigger Unreserve; remove tentative lock | Lock is released (if no other claims share it) | -| Driver update (ResourceSlice) | Reconcile cache with API state | Cache refreshed; redundant tentative locks purged | +| Pod scheduled (Reserve) | Set `AffinityStates[deviceID].LockedAffinity` | Device locked; subsequent claims must match | +| Scheduling failure (Unreserve) | Remove tentative lock if no other claims share it | Device may become unlocked | | All claims released | Clear `AffinityStates[deviceID]` | Device becomes unlocked | - -##### Handling the "First Pod" Problem - -The first Pod to land on an unlocked device defines the affinity lock for all -subsequent consumers. This introduces a risk: a low-priority Pod with a rare -affinity value could "poison" a high-capacity device. - -**Lock origin**: If `AffinityStates[deviceID].LockedAffinity` is empty, the scheduler takes -the affinity values from the current Pod's ResourceClaim and writes them to the -cache. All subsequent Pods must match these values to share the device. - -**Poisoning mitigation**: The Score phase assigns a higher score to nodes that -have a device already locked to a compatible affinity value, and a lower score -to nodes where the device is still unlocked. This steers the scheduler toward -packing onto already-locked devices before consuming clean ones, reducing -unnecessary lock fragmentation. +| Driver adds `sharingAffinity` to in-use device | Mark `AffinityStates[deviceID].Unknown` if active claims are non-reconstructable | Device blocked for new sharing workloads until legacy claims drain | ##### Implementation Note: Snapshot Consistency @@ -827,7 +807,7 @@ spec: - name: eth1 allowMultipleAllocations: true sharingAffinity: - attributeKeys: ["networking.example.com/subnet"] + parameterKeys: ["networking.example.com/subnet"] attributes: networking.example.com/type: string: "sriov-vf" @@ -837,7 +817,7 @@ spec: - name: eth2 allowMultipleAllocations: true sharingAffinity: - attributeKeys: ["networking.example.com/subnet"] + parameterKeys: ["networking.example.com/subnet"] attributes: networking.example.com/type: string: "sriov-vf" @@ -902,7 +882,7 @@ spec: - name: mlx5_0 allowMultipleAllocations: true sharingAffinity: - attributeKeys: + parameterKeys: - networking.example.com/subnet - networking.example.com/pkey capacity: @@ -937,7 +917,7 @@ Alpha matching behavior: - A claim that provides only `subnet` but omits `pkey` is rejected for that device because **missing declared keys are invalid**. - The extra `vlan` key is ignored for this device because the driver did not - declare `networking.example.com/vlan` in `attributeKeys`. + declare `networking.example.com/vlan` in `parameterKeys`. ### Test Plan @@ -956,8 +936,8 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Filter: device with matching lock passes - Filter: device with conflicting lock is excluded - Filter: unlocked device with sufficient capacity passes - - Filter: claim missing a required `attributeKey` → device filtered out - - Filter: claim with extra keys beyond device's declared `attributeKeys` → extra + - Filter: claim missing a required `parameterKey` → device filtered out + - Filter: claim with extra keys beyond device's declared `parameterKeys` → extra keys ignored, device passes if declared keys match - Filter: no recognized `StructuredParameters` entry for a sharing-constrained request → device filtered out @@ -975,9 +955,14 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. unknown rather than establishing or joining a lock - Legacy-claim handling: all scenarios from the `Handling Legacy Claims with Unknown Affinity` table + - Compatibility matrix: device without `sharingAffinity` is unaffected — + claims with or without `StructuredParameters` both pass (Legacy/Basic and + Permissive Sharing rows) + - Strict Gating: device has `sharingAffinity` but claim provides zero + `StructuredParameters` config entries → device filtered out - `staging/src/k8s.io/api/resource/v1`: Coverage for new API types and the recognized structured-parameters contract, including: - - Validation: `attributeKeys` exceeding max 8 limit is rejected + - Validation: `parameterKeys` exceeding max 8 limit is rejected - Validation: structured parameters exceeding max 8 attributes is rejected - Validation: duplicate recognized structured-parameters entries for the same request are rejected when validation can detect them @@ -1012,8 +997,20 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. updates `ResourceSlice` to add `sharingAffinity`; a 6th Pod arrives with structured parameters — verify the device has `AffinityStates[deviceID].Unknown` set and the 6th Pod is filtered from that device until all legacy claims drain -- **Partial Key**: Device requires `subnet` and `pkey` in `attributeKeys`; +- **Partial Key**: Device requires `subnet` and `pkey` in `parameterKeys`; claim provides only `subnet` — verify the device is filtered out +- **Score Packing**: Two devices available, one already locked to subnet-X; + new claim for subnet-X → verify the claim is placed on the locked device, + not the clean one (full Filter→Score→Reserve pipeline) +- **Permissive Sharing (no SA)**: Device without `sharingAffinity`, claim with + `StructuredParameters` — verify scheduler allows the allocation and SP are + not evaluated for affinity +- **Driver Backstop**: Device without `sharingAffinity`, two claims with + incompatible config land on the same device — verify scheduler allows both + (permissive), and `NodePrepareResources` rejects the incompatible claim +- **NodePrepareResources failure does not clear lock**: Claim is bound and + lock is persisted, but `NodePrepareResources` fails on the node — verify + the affinity lock remains in the scheduler cache ##### e2e tests @@ -1023,6 +1020,9 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. different devices - Lock lifecycle: last Pod deleted → lock cleared → new Pod with different affinity value can claim the device +- Rollout scenario: existing Pods running without `sharingAffinity`; driver + adds `sharingAffinity` to ResourceSlice; verify existing Pods continue + running and new Pods respect the new constraint after legacy claims drain ### Graduation Criteria @@ -1060,7 +1060,22 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. ### Upgrade / Downgrade Strategy -**Upgrade**: Existing ResourceSlices without `sharingAffinity` continue to work. New field is additive. +**Upgrade**: Existing ResourceSlices without `sharingAffinity` continue to work. +New field is additive. See the [Compatibility Matrix](#compatibility-matrix) for +how the scheduler and driver behave across all combinations of device +`sharingAffinity` and claim `StructuredParameters` presence. + +**Recommended rollout sequence**: To minimize capacity stranding, drivers should +ideally: + +1. Wait for a device to be idle (clean). +2. Update the ResourceSlice to include the `sharingAffinity` field. +3. Allow the scheduler to establish the first known lock with a new claim. + +During mixed rollouts (some devices with `sharingAffinity`, some without), the +scoring preference for `sharingAffinity` devices (see +[Score](#filter-and-score-phases)) naturally steers affinity-aware claims toward +upgraded devices. **Adding `sharingAffinity` to an in-use device**: A driver may add or update `sharingAffinity` on a device that already has active (bound) ResourceClaims. @@ -1086,48 +1101,20 @@ hardware. The scheduler handles this as follows: > avoid adding `sharingAffinity` mid-flight when possible, but the scheduler > must handle it safely when it occurs. -**Enablement and Rollout Dynamics** - -The combination of device and claim states during rollout is critical for safe -enablement. The design prioritizes **conservative correctness** to ensure that -modal hardware is never accidentally over-provisioned or misconfigured during a -feature upgrade. - -1. **The "Unknown Affinity" safety valve**: If `sharingAffinity` is added to a - ResourceSlice that already has active legacy claims (claims created before the - feature or without `StructuredParameters`), the scheduler cannot reconstruct - the current mode of the device. The device is marked with - `AffinityStates[deviceID].Unknown = true`. Even if the device has 15/16 slots - free, the scheduler will **filter it out** for any new claim that requires - `sharingAffinity`. The device only becomes eligible for new affinity-locked - scheduling once all legacy claims are released and the device becomes clean. - -2. **Handling missing or malformed parameters**: The scheduler treats a device - with `sharingAffinity` as a protected resource. If a device requires both - `subnet` and `pkey` but a claim only provides `subnet`, the device is filtered - out — all declared keys are mandatory. If a claim's `StructuredParameters` - entry is malformed or contains non-string values for required keys in alpha, - the device is excluded. - -3. **Version skew and rollback**: During a rollout, an older scheduler will - ignore the `sharingAffinity` field. This may lead to permissive scheduling - where incompatible pods land on the same device. The DRA driver remains the - final safety backstop to reject these at `NodePrepareResources`. If the - feature gate is disabled on rollback, `sharingAffinity` fields are ignored, - devices return to being unconditionally shareable, and the driver again - becomes the sole authority for enforcing hardware modes. - -**Recommended rollout sequence**: To minimize capacity stranding during rollout, -drivers should ideally: +**Handling missing or malformed parameters**: The scheduler treats a device +with `sharingAffinity` as a protected resource. If a device requires both +`subnet` and `pkey` but a claim only provides `subnet`, the device is filtered +out — all declared keys are mandatory. If a claim's `StructuredParameters` +entry is malformed or contains non-string values for required keys in alpha, +the device is excluded. -1. Wait for a device to be idle (clean). -2. Update the ResourceSlice to include the `sharingAffinity` field. -3. Allow the scheduler to establish the first known lock with a new claim. +**Downgrade**: If the feature gate is disabled: -**Downgrade**: If a ResourceSlice with `sharingAffinity` exists and the feature gate is disabled: -- API server rejects updates to the field -- Scheduler ignores the field (all claims can share) -- Driver should handle this gracefully at prepare time +- The `sharingAffinity` field is not persisted on new writes. +- The scheduler ignores the field — all devices return to unconditional sharing + (pre-KEP-5981 behavior). +- The DRA driver becomes the sole authority for enforcing hardware compatibility + at `NodePrepareResources`. ### Version Skew Strategy @@ -1143,8 +1130,7 @@ drivers should ideally: - **kubelet**: No changes required; kubelet does not interpret `sharingAffinity`. - **DRA driver**: - - Drivers may publish `sharingAffinity` and may optionally surface current - lock-related state for observability. + - Drivers publish ResourceSlices with the `sharingAffinity` field on devices. - Drivers must continue validating actual hardware compatibility at prepare time, especially during skew where an older scheduler may not enforce the affinity constraints. @@ -1171,7 +1157,8 @@ No. Devices without `sharingAffinity` behave exactly as before. The feature only ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Yes. Disabling the feature gate causes: -- API server to reject new/updated `ResourceSlice`s with `sharingAffinity` +- API server to strip the `sharingAffinity` field from new or updated + ResourceSlices before persisting (writes succeed, field is not stored) - Scheduler to ignore existing `sharingAffinity` fields for future placement decisions @@ -1182,9 +1169,13 @@ time. ###### What happens if we reenable the feature if it was previously rolled back? The scheduler resumes enforcing `sharingAffinity` for future placement -decisions. Existing allocations are not evicted. If there are active -allocations whose affinity cannot be reconstructed, the corresponding devices -may be treated conservatively until they become clean. +decisions. Existing allocations are not evicted. However, ResourceSlices that +were created or updated while the gate was disabled will not have the +`sharingAffinity` field (it was stripped by the API server). Drivers must +republish their ResourceSlices with `sharingAffinity` for the feature to take +effect. If there are active allocations whose affinity cannot be reconstructed +at that point, the corresponding devices are treated conservatively until they +become clean. ###### Are there any tests for feature enablement/disablement? @@ -1364,7 +1355,7 @@ Known failure modes include: Recommended debugging flow: 1. Inspect the relevant `ResourceSlice` and confirm the device declares the - expected `sharingAffinity.attributeKeys`. + expected `sharingAffinity.parameterKeys`. 2. Inspect the `ResourceClaim` and confirm there is exactly one recognized `StructuredParameters` entry for the relevant request. 3. Verify that every required affinity key is present and string-valued. @@ -1611,13 +1602,13 @@ tracking via structured parameters. ### Soft / Preferred Affinity Keys The Alpha design enforces **hard all-or-nothing** matching: all declared -`attributeKeys` must match or the device is filtered out. Real-world hardware +`parameterKeys` must match or the device is filtered out. Real-world hardware may have hierarchical constraints where some keys are strict sharing requirements (e.g., Subnet) and others are scheduling preferences (e.g., Traffic-Class or bandwidth profile). A future enhancement could add a `required` vs `preferred` flag on individual -entries in `attributeKeys`: +entries in `parameterKeys`: - **`required`** (default): Mismatch → device filtered out (current behavior) - **`preferred`**: Mismatch → device passes Filter but receives a lower score From 3995490962528305ac6e9611d03917516c3f8afa Mon Sep 17 00:00:00 2001 From: Ashvin Deodhar Date: Mon, 13 Apr 2026 16:36:55 -0700 Subject: [PATCH 6/6] Minor edits to better organize the KEP draft --- .../5981-dra-sharing-affinity/README.md | 272 +++++++++--------- 1 file changed, 144 insertions(+), 128 deletions(-) diff --git a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md index 8d51b0b385bf..485f42fa386e 100644 --- a/keps/sig-scheduling/5981-dra-sharing-affinity/README.md +++ b/keps/sig-scheduling/5981-dra-sharing-affinity/README.md @@ -91,19 +91,19 @@ example: - **Multi-pod NIC sharing**: A network DRA driver shares a NIC across 16 pods, but all pods must belong to the same subnet. Once the first pod configures the - NIC for **Subnet A**, the remaining 15 slots are restricted to **Subnet A**. + NIC for Subnet A, the remaining 15 slots are restricted to Subnet A. - **FPGA bitstream sharing**: An FPGA can serve multiple inference pods, but all - must use the same bitstream. Once **bitstream-ml-v2** is loaded, other pods - needing **bitstream-crypto-v1** must use a different FPGA. + must use the same bitstream. Once bitstream-ml-v2 is loaded, other pods + needing bitstream-crypto-v1 must use a different FPGA. This KEP introduces a `SharingAffinity` field in the ResourceSlice `Device` -spec that allows drivers to declare which device attribute keys constrain +spec that allows drivers to declare which parameter keys constrain sharing compatibility. The scheduler's `AllocatedState` is enhanced to track both consumed capacity and the affinity values that lock a device to a -particular sharing group, enabling it to **gate** remaining capacity and -**pack** compatible workloads onto already-locked devices. +particular sharing group, enabling it to gate remaining capacity and +pack compatible workloads onto already-locked devices. -Alpha intentionally does **not** provide lock-aware fairness or lock-breaking +Alpha intentionally does not provide lock-breaking preemption. In addition, if a device already has active allocations whose affinity cannot be reconstructed (for example, legacy claims created before the feature was enabled), the scheduler treats that device conservatively and does @@ -142,12 +142,12 @@ affinity values that determine sharing compatibility. This KEP closes that gap. ### Goals -- Enable the scheduler to **gate** remaining capacity on a device based on a +- Enable the scheduler to gate remaining capacity on a device based on a required sharing attribute - Provide a mechanism for drivers to signal compatibility requirements for shared hardware via `SharingAffinity` in ResourceSlice -- Minimize **fragmentation** of cluster resources by enabling the scheduler to - **pack** workloads with identical sharing requirements onto already-locked devices +- Minimize fragmentation of cluster resources by enabling the scheduler to + pack workloads with identical sharing requirements onto already-locked devices - Track affinity values in `AllocatedState` so subsequent scheduling decisions respect the first claim's lock-in - Maintain backward compatibility with devices that have no sharing affinity @@ -163,7 +163,7 @@ affinity values that determine sharing compatibility. This KEP closes that gap. - Retrofitting affinity-aware sharing onto already-in-use devices when active claims do not expose reconstructable affinity values. In alpha, such devices are treated conservatively until they drain clean. -- Guaranteeing **lock-aware fairness** or **lock-breaking/preemption** in alpha. +- Guaranteeing **lock-breaking preemption** in alpha. Alpha enforces compatibility and improves packing, but does not yet guarantee that a higher-priority Pod can displace an incompatible lock-holder. @@ -214,7 +214,7 @@ trade-off analysis. The driver declares `sharingAffinity.parameterKeys` on the device, telling the scheduler which attribute keys constrain sharing. The scheduler learns the -requested values for those keys by decoding a **well-known JSON schema** stored +requested values for those keys by decoding a well-known JSON schema stored inside `OpaqueDeviceConfiguration`. The claim's opaque config (`DeviceConfiguration.Opaque.Parameters`) is a @@ -228,8 +228,8 @@ a decodable format for affinity-relevant parameters. **Approach: Well-known JSON schema inside OpaqueDeviceConfiguration** The schema reuses the same qualified key naming convention as `ResourceSlice` -attributes and follows a `DeviceAttribute`-like envelope. In **alpha**, -`sharingAffinity` matching is limited to **string-valued** attributes for the +attributes and follows a `DeviceAttribute`-like envelope. In alpha, +`sharingAffinity` matching is limited to string-valued attributes for the keys referenced by `sharingAffinity.parameterKeys`, which keeps equality semantics simple and aligns with the scheduler's in-memory lock representation. @@ -244,12 +244,12 @@ semantics simple and aligns with the scheduler's in-memory lock representation. } ``` -The scheduler decodes this JSON from the opaque blob and extracts **string** +The scheduler decodes this JSON from the opaque blob and extracts string values for the keys listed in the device's `sharingAffinity.parameterKeys`. The decoding overhead is small compared to the overall scheduling effort. If drivers also need additional, differently structured configuration -parameters (e.g., MTU, QoS settings), users provide **two** config entries in +parameters (e.g., MTU, QoS settings), users provide two config entries in the claim: one using the standard schema (scheduler reads) and one using the vendor format (driver reads). The scheduler only considers configurations matching the well-known schema. @@ -268,11 +268,11 @@ For alpha, the scheduler-readable structured-parameters format is a scheduler-recognized contract with the following rules: 1. **Recognition**: The scheduler recognizes a config entry as structured - parameters only when the `opaque.driver` is `resource.k8s.io` **and** the + parameters only when the `opaque.driver` is `resource.k8s.io` and the embedded payload has `apiVersion: resource.k8s.io/v1alpha1` and `kind: StructuredParameters`. -2. **Per-request uniqueness**: For a given request, there must be **at most - one** structured-parameters config entry targeted at that request. Multiple +2. **Per-request uniqueness**: For a given request, there must be at most + one structured-parameters config entry targeted at that request. Multiple matching entries for the same request are invalid. 3. **Coexistence with driver config**: The structured-parameters entry may coexist with one or more driver-specific opaque config entries for the same @@ -306,7 +306,7 @@ In alpha, this keeps the contract explicit without introducing a new API field: the scheduler depends only on a single, community-governed, recognized payload shape and ignores all other opaque config. -For alpha, `StructuredParameters` is a **scheduler-recognized sub-protocol** +For alpha, `StructuredParameters` is a scheduler-recognized sub-protocol defined by this KEP. The scheduler interprets only payloads explicitly recognized as `opaque.driver: resource.k8s.io` together with the embedded `apiVersion`/`kind` for `StructuredParameters`; all vendor-defined opaque @@ -316,22 +316,22 @@ behavior explicitly. **Alpha scope** -Alpha fully resolves the design around **driver-side placement** and the -**structured-parameters approach** described above. Claims do not control +Alpha fully resolves the design around driver-side placement and the +structured-parameters approach described above. Claims do not control lock-setting behavior in alpha: any compatible claim may establish the initial lock on a clean device. Claim-side lock-setting policy (for example, `CanSetLock`/`NeverSetLock`) is deferred to [Future Enhancements](#future-enhancements). -In other words, alpha standardizes **driver-declared compatibility keys**, a -**scheduler-recognized structured-parameters contract**, and **correct lock -enforcement / packing behavior** — but intentionally stops short of -lock-aware fairness, lock-breaking policy, or preemption semantics. +In other words, alpha standardizes driver-declared compatibility keys, a +scheduler-recognized structured-parameters contract, and correct lock +enforcement / packing behavior — but intentionally stops short of +lock-breaking preemption semantics. **Alpha limitations** -Alpha provides **correct lock enforcement** and **better packing**, but it does -not provide lock-aware fairness or lock-breaking semantics. A lower-priority +Alpha provides correct lock enforcement and better packing, but it does +not provide lock-breaking preemption. A lower-priority Pod may continue holding a device lock even when the device still has nominal capacity and a higher-priority Pod needs the same device with a different affinity value. In that case the higher-priority Pod may remain unschedulable @@ -387,7 +387,7 @@ but only if pods belong to the same subnet. The driver sets - **Multiple keys**: If multiple attribute keys are specified, ALL must match (both presence and value). - **Extra keys in claim**: If a claim's structured parameters contain keys beyond - what the device declares in `parameterKeys`, the extra keys are **ignored** + what the device declares in `parameterKeys`, the extra keys are ignored for that device. Only the device's declared keys are evaluated. This allows "generic" claims to work across devices with different sharing requirements (e.g., a claim with both `subnet` and `vlan` can match a device that only @@ -397,7 +397,7 @@ but only if pods belong to the same subnet. The driver sets in alpha. If a required key is present with a non-string value, the device is filtered out for that claim. - **Missing keys in claim**: If the claim does not provide a value for a key - the device declares in `parameterKeys`, the device is **filtered out** (see + the device declares in `parameterKeys`, the device is filtered out (see Filter phase). - **Malformed structured parameters**: If the scheduler-recognized `StructuredParameters` entry is malformed, undecodable, or uses the wrong @@ -418,7 +418,7 @@ but only if pods belong to the same subnet. The driver sets If a device has active allocations for which the scheduler cannot reconstruct the required affinity values (for example, claims created before the feature was enabled or invalid persisted claims), that device is treated as having - **unknown affinity state** and is filtered out for new `sharingAffinity` + unknown affinity state and is filtered out for new `sharingAffinity` scheduling until it becomes fully clean. #### Handling Legacy Claims with Unknown Affinity @@ -451,11 +451,11 @@ outlines how the scheduler and driver evaluate candidates based on whether | **Permissive Sharing** | No | Yes | **Allowed.** Device has no `sharingAffinity`; structured parameters are not evaluated for affinity. Standard capacity matching applies. | **Must enforce** hardware compatibility independently. Scheduler provides no affinity gating for this device. | | **Legacy/Basic** | No | No | **Allowed.** Standard DRA capacity and attribute matching. | **Must enforce** hardware compatibility independently. This is the pre-KEP-5981 behavior. | -The top rows show the scheduler as the **primary** enforcer with the driver as -a **backstop**. The bottom rows show the driver as the **sole** enforcer with +The top rows show the scheduler as the primary enforcer with the driver as +a backstop. The bottom rows show the driver as the sole enforcer with the scheduler being permissive. The transition row shows the scheduler being -**conservative** (filtering) while the driver continues **serving existing -workloads**. +conservative (filtering) while the driver continues serving existing +workloads. ### Risks and Mitigations @@ -486,7 +486,7 @@ locked to the wrong value. **Mitigation (Alpha)**: The scoring plugin reduces the probability by packing compatible workloads and preserving clean devices. However, this is a soft mitigation — it does not guarantee that a clean device will always be available, -and alpha does **not** guarantee lock-aware fairness. +and alpha does not provide lock-breaking preemption. **Alpha limitation**: In alpha, a lower-priority Pod may continue to hold a lock that blocks a higher-priority incompatible Pod even when nominal capacity @@ -520,7 +520,7 @@ rare values. **Mitigation**: In many cases, affinity values are externally defined (subnet names, partition keys) and cannot be validated by the driver. The primary -mitigation is the **scoring plugin**: by packing compatible workloads onto +mitigation is the scoring plugin: by packing compatible workloads onto already-locked devices before consuming clean ones, the scheduler naturally limits fragmentation even when affinity values are unpredictable. Additionally, cluster administrators can use `DeviceClass` CEL selectors to restrict which @@ -611,15 +611,15 @@ state. ##### Safety Model and Responsibility Split -This feature intentionally keeps **placement knowledge** and **hardware -enforcement** separate: +This feature intentionally keeps placement knowledge and hardware +enforcement separate: - **Scheduler guarantee**: when it has recognized structured parameters for all active allocations on a `sharingAffinity` device, it will not intentionally co-place claims with incompatible affinity values on that device. - **Conservative fallback**: if the scheduler cannot reconstruct the effective affinity state of a device (for example, due to legacy or invalid persisted - claims), it treats that device as **unknown** and filters it out for new + claims), it treats that device as unknown and filters it out for new `sharingAffinity` placements until the device becomes clean. - **Driver guarantee**: the driver remains the final authority for programming and validating the actual hardware mode during `NodePrepareResources`. @@ -665,13 +665,13 @@ type AllocatedState struct { device with `sharingAffinity` is a candidate ONLY if: 1. It has sufficient consumable capacity (KEP-5075) -2. The device's `AffinityStates[deviceID].Unknown` is **not** true -3. The claim has **exactly one** scheduler-recognized `StructuredParameters` +2. The device's `AffinityStates[deviceID].Unknown` is not true +3. The claim has exactly one scheduler-recognized `StructuredParameters` config entry targeting the relevant request 4. That entry can be decoded successfully using the well-known schema 5. The claim provides values for ALL keys in `sharingAffinity.parameterKeys` - (missing key → device is **not** a candidate) -6. For each required affinity key, the recognized entry provides a **string** + (missing key → device is not a candidate) +6. For each required affinity key, the recognized entry provides a string value (non-string values are invalid in alpha) 7. The device's `AffinityStates[deviceID].LockedAffinity` is either empty (unlocked) OR matches the claim's affinity values for ALL keys @@ -774,7 +774,7 @@ cycle begins. schema and extract the required structured parameters. 4. If decoding succeeds and all required affinity keys are present as strings, populate `AffinityStates[deviceID].LockedAffinity` with those values. -5. If the claim has **no** recognized `StructuredParameters` entry, malformed +5. If the claim has no recognized `StructuredParameters` entry, malformed structured parameters, non-string values for a required affinity key, or multiple recognized structured-parameters entries for the same request, set `AffinityStates[deviceID].Unknown = true` and log a warning. The scheduler @@ -869,9 +869,9 @@ spec: #### Multi-key SharingAffinity Example This example illustrates the alpha semantics when a device constrains sharing on -**multiple keys**. +multiple keys. -A driver advertises a shared RDMA-capable NIC where both **subnet** and **PKey** +A driver advertises a shared RDMA-capable NIC where both subnet and PKey must match for pods to share the same device: ```yaml @@ -910,12 +910,12 @@ Alpha matching behavior: - If the device is clean, the first compatible claim sets the lock to: - `subnet = subnet-a` - `pkey = 0x8001` -- A later claim with the **same** `subnet` and **same** `pkey` may share the +- A later claim with the same `subnet` and same `pkey` may share the device. - A claim with `subnet = subnet-a` but `pkey = 0x8002` is rejected for that - device because **all declared keys must match**. + device because all declared keys must match. - A claim that provides only `subnet` but omits `pkey` is rejected for that - device because **missing declared keys are invalid**. + device because missing declared keys are invalid. - The extra `vlan` key is ignored for this device because the driver did not declare `networking.example.com/vlan` in `parameterKeys`. @@ -990,25 +990,25 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. sharing-constrained devices - Invalid value type: a required affinity key encoded as a non-string value is rejected for `sharingAffinity` scheduling and does not populate lock state -- **Ghost Lock**: Pod is Assumed (tentative lock set) but Bind fails — verify +- Ghost Lock: Pod is Assumed (tentative lock set) but Bind fails — verify the lock is cleared immediately and the next Pod in the queue can claim the device with a different affinity value -- **Legacy Device Migration**: 5 Pods are already running on a NIC; the driver +- Legacy Device Migration: 5 Pods are already running on a NIC; the driver updates `ResourceSlice` to add `sharingAffinity`; a 6th Pod arrives with structured parameters — verify the device has `AffinityStates[deviceID].Unknown` set and the 6th Pod is filtered from that device until all legacy claims drain -- **Partial Key**: Device requires `subnet` and `pkey` in `parameterKeys`; +- Partial Key: Device requires `subnet` and `pkey` in `parameterKeys`; claim provides only `subnet` — verify the device is filtered out -- **Score Packing**: Two devices available, one already locked to subnet-X; +- Score Packing: Two devices available, one already locked to subnet-X; new claim for subnet-X → verify the claim is placed on the locked device, not the clean one (full Filter→Score→Reserve pipeline) -- **Permissive Sharing (no SA)**: Device without `sharingAffinity`, claim with +- Permissive Sharing (no SA): Device without `sharingAffinity`, claim with `StructuredParameters` — verify scheduler allows the allocation and SP are not evaluated for affinity -- **Driver Backstop**: Device without `sharingAffinity`, two claims with +- Driver Backstop: Device without `sharingAffinity`, two claims with incompatible config land on the same device — verify scheduler allows both (permissive), and `NodePrepareResources` rejects the incompatible claim -- **NodePrepareResources failure does not clear lock**: Claim is bound and +- NodePrepareResources failure does not clear lock: Claim is bound and lock is persisted, but `NodePrepareResources` fails on the node — verify the affinity lock remains in the scheduler cache @@ -1037,8 +1037,8 @@ Existing DRA scheduling tests should pass before adding sharing affinity tests. - Scheduler tracks affinity in AllocatedState - Unit and integration tests - Documentation for driver authors -- Alpha documentation explicitly calls out the lack of lock-aware fairness and - the absence of lock-breaking/preemption semantics for incompatible locks +- Alpha documentation explicitly calls out the lack of lock-breaking preemption + semantics for incompatible locks - Alpha documentation explicitly calls out string-only affinity matching and the rejection of non-string values for `sharingAffinity` keys @@ -1083,10 +1083,10 @@ This can happen during driver upgrades or when enabling the feature on existing hardware. The scheduler handles this as follows: - **Pre-existing claims continue to run** and are not evicted. -- If any active claim on that device does **not** provide reconstructable +- If any active claim on that device does not provide reconstructable affinity values for the required keys, the scheduler marks the device as `AffinityStates[deviceID].Unknown = true`. -- A device with `AffinityStates[deviceID].Unknown` set is **not eligible** for new +- A device with `AffinityStates[deviceID].Unknown` set is not eligible for new `sharingAffinity` placements, even if it still has nominal shared capacity. - **Once all active claims on that device are released**, the device becomes clean and subsequent allocations can establish and enforce affinity normally. @@ -1135,8 +1135,8 @@ the device is excluded. time, especially during skew where an older scheduler may not enforce the affinity constraints. -During version skew, the main outcomes are **permissive scheduling** by an older -scheduler or **conservative filtering** by a newer scheduler when affinity state +During version skew, the main outcomes are permissive scheduling by an older +scheduler or conservative filtering by a newer scheduler when affinity state cannot be reconstructed. Both are operationally safe as long as the driver continues rejecting incompatible prepare-time configurations. @@ -1293,7 +1293,8 @@ for example: ###### Will enabling / using this feature result in any new API calls? -No. Affinity is evaluated using existing ResourceSlice and ResourceClaim data. +No new API calls. Affinity data is extracted from ResourceSlice and ResourceClaim +objects already fetched by existing informers. ###### Will enabling / using this feature result in introducing new API types? @@ -1305,16 +1306,30 @@ No. ###### Will enabling / using this feature result in increasing size or count of the existing API objects? -- ResourceSlice: Small increase (~50-100 bytes) for devices with `sharingAffinity` -- AllocatedState (in-memory): Small increase for tracking affinity values +- ResourceSlice: Small increase per device with `sharingAffinity` — the + `parameterKeys` field adds up to 8 fully-qualified key names (capped by + `SharingAffinityParameterKeysMaxSize = 8`) +- ResourceClaim: Small increase when claims include a `StructuredParameters` + opaque config entry with affinity values (up to 8 string-valued attributes, + matching the device-side cap) ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? -Negligible. Affinity check is a simple map lookup, O(1) per attribute key. +Negligible. The Filter phase decodes the `StructuredParameters` opaque config +payload once per candidate device (bounded by payload size and 8-key cap). +The affinity comparison itself is O(k) where k ≤ 8 — a map lookup per key. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? -No. Memory increase for `AllocatedState.AffinityStates` is proportional to active shared allocations, not total devices. +No. The per-component impact is bounded: + +- **Scheduler RAM**: `AffinityStates` adds one `map[string]string` (up to 8 + entries) per device with active affinity locks — proportional to active shared + allocations, not total devices. +- **Scheduler CPU**: JSON decoding of the `StructuredParameters` opaque config + entry during Filter adds a small per-candidate cost, bounded by the 8-key cap. +- **etcd disk**: Slightly larger ResourceSlice and ResourceClaim objects (see + API size answer above), bounded by the same caps. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? @@ -1326,9 +1341,9 @@ No. Like existing scheduler-driven DRA logic, this feature depends on informer state and cached API data. Temporary API server or etcd unavailability does not by -itself invalidate already-computed in-memory lock state, but sustained control- -plane unavailability may delay reconciliation of claim release, slice updates, -or restart reconstruction. +itself invalidate already-computed in-memory lock state, but new pods will not +be scheduled during unavailability. Sustained control-plane unavailability may +delay reconciliation of claim release, slice updates, or restart reconstruction. The driver remains the final enforcement authority at prepare time. @@ -1349,6 +1364,10 @@ Known failure modes include: - **Prepare-time driver rejection**: despite scheduler filtering, the driver may still reject an incompatible or stale placement and that rejection is the final safety backstop. +- **Partial feature gate enablement**: if the feature gate is enabled on the API + server but not the scheduler (or vice versa), the `sharingAffinity` field may + be persisted but not enforced, or enforced but not accepted on writes. Ensure + the gate is enabled on both `kube-apiserver` and `kube-scheduler`. ###### What steps should be taken if SLOs are not being met to determine the problem? @@ -1359,7 +1378,9 @@ Recommended debugging flow: 2. Inspect the `ResourceClaim` and confirm there is exactly one recognized `StructuredParameters` entry for the relevant request. 3. Verify that every required affinity key is present and string-valued. -4. Check whether the target device is already locked to incompatible values. +4. Check whether the target device is already locked to incompatible values + (lock state is in the scheduler's in-memory cache — check scheduler logs + for filter reasons mentioning affinity mismatch). 5. Check whether the device is being treated as having **unknown affinity state** because of legacy or invalid active claims. 6. Review scheduler logs/events for explicit filter reasons. @@ -1377,7 +1398,8 @@ unschedulable errors whenever possible. ## Drawbacks -- Adds complexity to the scheduler's allocation logic and cache reconstruction +- Adds a new cache dimension (`AffinityStates`) to the scheduler's allocation + tracking, increasing the surface area for reconstruction bugs on restart - Once a device is locked, its effective affinity cannot change until all claims on that device are released - Fragmentation risk remains if affinity values are too fine-grained @@ -1385,14 +1407,23 @@ unschedulable errors whenever possible. schedulable capacity during rollout or migration - If a driver declares `sharingAffinity` on a device but no claims ever provide `StructuredParameters`, that device becomes effectively unschedulable for - sharing workloads — all claims are filtered out by the "Strict Gating" rule + sharing workloads — all claims are filtered out by the "Strict Gating" rule. + Drivers should coordinate with workload teams to ensure claims include + `StructuredParameters` before enabling `sharingAffinity` on devices. +- The scheduler now depends on decoding a well-known JSON schema from opaque + config — a new coupling that didn't exist before. If the schema evolves, + backward compatibility must be maintained across scheduler versions. +- Affinity locks are purely in-memory with no API or status field to inspect + which devices are locked to which values. Debugging lock state requires + scheduler logs. ## Alternatives ### Claim-side SharingAffinity (on DeviceRequest) -An alternative design places `SharingAffinity` on the `DeviceRequest` within -ResourceClaim, allowing the user to define it per workload: +Instead of using StructuredParameters in opaque config to supply constraint +values, an alternative design adds a dedicated `SharingAffinity` field on +`DeviceRequest` within ResourceClaim: ```go type DeviceRequest struct { @@ -1408,13 +1439,12 @@ type SharingAffinity struct { ``` **Rejected because**: -- The modal constraint is a property of the hardware, not the workload— - the driver knows that a NIC locked to subnet A can't serve subnet B -- Requires every claim to repeat the *constraint definition* (attribute name, - strategy) in addition to the value—the driver-side design declares the - constraint once on the device and claims only provide values -- Users must understand the sharing constraint mechanism and explicitly opt - into it, rather than simply providing config values they'd specify anyway +- **Requires an API change to ResourceClaim**: Adding a typed `SharingAffinity` + field to `DeviceRequest` introduces a new API field, whereas supplying the + same constraint values via StructuredParameters in existing opaque config + requires no API change at all. The opaque config path is the preferred + approach per @pohly's guidance—use well-known schemas inside existing opaque + config rather than adding new structured fields to ResourceClaim. ### Object Reference-based Affinity Matching @@ -1443,7 +1473,7 @@ commonConfigKind: # new field ``` **Rejected because**: -- Requires new fields on **both** ResourceClaim (`objectRefs`) and Device +- Requires new fields on both ResourceClaim (`objectRefs`) and Device (`commonConfigKind`), whereas the chosen approach adds a field only to Device and uses existing opaque config for claim-side values - Requires external CRD definitions, adding operational burden for cluster @@ -1455,7 +1485,7 @@ commonConfigKind: # new field ### Placeholder Pattern Workaround -Without this KEP, drivers must use a "placeholder pattern": +Without this KEP, drivers must use a "placeholder pattern" today: 1. Publish devices with `capacity: 1` initially 2. Wait for first claim to determine affinity value @@ -1526,10 +1556,6 @@ devices: reasoning about mutable runtime state, which is a qualitatively different problem better served by a purpose-built mechanism. -A CEL-based approach may become viable in the future if the DRA CEL environment -is extended to support runtime allocation state (see -[Future Enhancements: CEL-based Lock Expressions](#cel-based-lock-expressions)). - ## Future Enhancements The following ideas are out of scope for alpha but are worth exploring in @@ -1537,11 +1563,11 @@ beta/GA based on real-world feedback: ### Priority-based Lock Preemption -This section addresses a deliberate **alpha limitation**: alpha enforces lock -compatibility, but does not provide lock-aware fairness or any mechanism for a +This section addresses a deliberate alpha limitation: alpha enforces lock +compatibility, but does not provide any mechanism for a higher-priority Pod to break an incompatible lock. -Standard Kubernetes preemption is **blind to affinity locks**. It triggers on +Standard Kubernetes preemption is blind to affinity locks. It triggers on *resource shortage* (insufficient CPU, memory, or device slots), not on qualitative state mismatch. This creates a critical gap: @@ -1572,11 +1598,25 @@ phase: This is scoped for Beta because the core Filter/Reserve/Score mechanism must be proven in Alpha first, and lock-aware preemption requires careful -integration with the existing DRA preemption path. +integration with the existing DRA preemption path. Key design considerations +include: + +- **Victim minimization**: When multiple devices could satisfy the incoming Pod, + the preemption logic should prefer the device with the fewest lock-holding + Pods to minimize disruption. +- **Atomicity**: Preemption in Kubernetes is asynchronous—victim Pods are + deleted but do not disappear instantly. During the eviction window the old + lock is still active, so a newly-arriving compatible Pod could land on the + device and re-establish the lock, creating a preemption cascade. Standard + preemption solves the analogous problem with NominatedNode; lock-breaking + would need a similar mechanism (e.g., marking the device's lock as + "transitioning to the new value for the preempting Pod") so that future + scheduling cycles treat the device as locked to the new value, filtering + out Pods compatible only with the old lock. ### SharingStrategy (`CanSetLock` / `NeverSetLock`) -Alpha intentionally does **not** let claims control whether they may establish +Alpha intentionally does not let claims control whether they may establish a new lock on a clean device. Any compatible claim can set the initial lock, and the scheduler then packs subsequent compatible claims onto that device. @@ -1588,7 +1628,12 @@ side to control lock-setting behavior. Two candidate strategies are: - **`NeverSetLock`**: The claim may only be allocated to a device that already has a matching lock established by another claim. This is useful for background or batch jobs that should never consume a clean device and - potentially fragment capacity. + potentially fragment capacity. **Caveat**: `NeverSetLock` is a follower-only + strategy — it requires at least one `CanSetLock` claim to establish the lock + first. If no device is locked to the requested value, a `NeverSetLock` pod + will remain unschedulable indefinitely. Implementations should document this + dependency clearly and consider surfacing a scheduling event when a pod is + blocked waiting for a lock that no leader has established. If introduced in beta or later, the scheduler would evaluate this policy before capacity and key matching for unlocked devices. A claim with `NeverSetLock` @@ -1601,7 +1646,7 @@ tracking via structured parameters. ### Soft / Preferred Affinity Keys -The Alpha design enforces **hard all-or-nothing** matching: all declared +The Alpha design enforces hard all-or-nothing matching: all declared `parameterKeys` must match or the device is filtered out. Real-world hardware may have hierarchical constraints where some keys are strict sharing requirements (e.g., Subnet) and others are scheduling preferences (e.g., @@ -1619,40 +1664,11 @@ only enforcing hard locks on Subnet. The lock itself would only be set for scheduling. This avoids complicating the atomic lock model while still enabling soft optimization. -### Lock Decay / Sticky Scoring - -When a device is recently unlocked (all claims released), the hardware may or -may not retain its previous configuration depending on driver behavior — some -drivers keep the state (e.g., a loaded FPGA bitstream), others tear it down -immediately. For drivers that preserve state, a time-decaying score bonus for -recently-unlocked devices matching the previous affinity value would improve -scheduling by avoiding expensive reconfiguration. This would require the -scheduler to track historical lock values with a TTL, and would only benefit -drivers that signal "warm" state — likely via a device attribute. - -### CEL-based Lock Expressions - -As Kubernetes moves toward CEL for policy evaluation, a future enhancement -could allow drivers to publish CEL expressions on the ResourceSlice that -evaluate affinity compatibility (e.g., -`device.affinityLock['pkey'] == '' || device.affinityLock['pkey'] == claim.AffinityValues['pkey']`). -This would require extending the CEL evaluation context to include runtime -allocation state, which is a substantial change warranting its own KEP. - ### Typed Affinity Values Beyond Strings -Alpha intentionally limits `sharingAffinity` matching to `string` values, which -keeps equality semantics simple and matches the scheduler's current in-memory -representation (`map[string]string`). A future enhancement could extend this to -additional `DeviceAttribute` value types once Kubernetes defines stable -normalization and equality rules for those types in scheduler-owned lock state. - -That future work would need to answer questions such as: - -- How non-string values are normalized before comparison -- Whether different encodings of the same logical value are considered equal -- How typed values are stored in `AllocatedState` without introducing ambiguous - comparisons or upgrade hazards +Alpha limits affinity matching to string equality (`map[string]string`), which +covers all known use cases (subnets, bitstreams, partition keys). Non-string +types could be added in the future if concrete use cases arise. ## Infrastructure Needed