Skip to content

KEP-5981: Provisional KEP for DRA Sharing Affinity for Conditional Fungibility#5987

Draft
ashvindeodhar wants to merge 4 commits intokubernetes:masterfrom
ashvindeodhar:kep-5981
Draft

KEP-5981: Provisional KEP for DRA Sharing Affinity for Conditional Fungibility#5987
ashvindeodhar wants to merge 4 commits intokubernetes:masterfrom
ashvindeodhar:kep-5981

Conversation

@ashvindeodhar
Copy link
Copy Markdown

  • One-line PR description: Proposing KEP-5981 to extend Dynamic Resource Allocation (DRA) with a "Sharing Affinity" mechanism that allows the scheduler to handle resources that are conditionally fungible.
  • Other comments: This PR introduces KEP-5981, which addresses the challenge of Conditional Fungibility in shared hardware resources (e.g., RDMA NICs, FPGAs, GPUs). Currently, DRA tracks quantitative capacity (KEP-5075 Consumable Capacity) but lacks awareness of "modal locks"—scenarios where an initial allocation restricts remaining capacity to a specific configuration.
    Key technical contributions in this KEP include:
    • ResourceSlice Extension: Adds sharingAffinity to the Device spec to declare attribute keys that constrain sharing.
    • Scheduler Cache Enhancement: Updates AllocatedState to track AffinityValues, enabling an Atomic Cache Lock during the Filter and Reserve phases.
    • Staged Automation: Proposes a DeviceClass mapping strategy to solve the UX "double-entry" problem at the API boundary.
    • Priority-Aware Roadmap: Includes a graduation path for Lock-aware preemption to mitigate priority inversion risks in high-density clusters.

This design transitions DRA from Quantitative Sharing (how many slots?) to Qualitative Gating (what mode are those slots in?), which is essential for multi-tenant AI and HPC workloads.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Mar 31, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @ashvindeodhar!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot requested review from dom4ha and macsko March 31, 2026 06:24
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 31, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @ashvindeodhar. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ashvindeodhar
Once this PR has been reviewed and has the lgtm label, please assign macsko for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Mar 31, 2026
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Mar 31, 2026
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 31, 2026
@ashvindeodhar ashvindeodhar marked this pull request as draft March 31, 2026 06:25
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 31, 2026
@ashvindeodhar
Copy link
Copy Markdown
Author

Moving to draft; please ignore the previous review request added automatically

/uncc @dom4ha
/uncc @macsko

@ashvindeodhar
Copy link
Copy Markdown
Author

/cc @sunya-ch @LionelJouin

@sunya-ch
Copy link
Copy Markdown
Contributor

sunya-ch commented Apr 3, 2026

@ashvindeodhar sharingAffinity.attributeKeys initially misled me. “Attribute” sounds like a device attribute, but it actually refers to fields inside the opaque StructuredParameters of the claim.

I think I have discussed a similar issue but with a little bit different aspect when introducing the consumable capacity feature. I believe we discussed around the feature where the device dynamically bound to the first allocation. Extensibly, the binding is not limited to how the device will be consumed but, for this case, it seems to bind to a concrete configuration, and only claims with the same configuration being allowed to share (and consume the device’s capacity).

How about an alternative approach: instead of relying on opaque's attributes where every resource claim has to set the same. We probably add a field for listing runtime configuration object references. The field at device can be something like commonConfigKind where it lists the Kind of runtime configuration object that must be specified.

For example:

apiVersion: networking.example.com/v1alpha1
kind: NetworkConfiguration
metadata:
  name: subnet-a
spec:
  subnet: 10.0.1.0/24
kind: ResourceClaim
metadata:
  name: claim-net-a
spec:
  devices:
    config:
      objectRefs:
      - kind: NetworkConfiguration
        name: subnet-a
kind: ResourceSlice
metadata:
  name: net-devices
spec:
  devices:
    - name: eth1
      capacity:
        bandwidth: "10Gb"
      commonConfigKind:
      - NetworkConfiguration

@ashvindeodhar
Copy link
Copy Markdown
Author

Thanks @sunya-ch ! You're right that the "attributeKeys" naming is confusing — I'll look at renaming to something that makes the distinction from device attributes clearer (e.g., configKeys or parameterKeys).

I considered the object-reference approach, but there are a few reasons the well-known schema inside opaque config may work better for this KEP:

  • Minimum API surface: The current design adds one new field to ResourceSlice (sharingAffinity.attributeKeys on Device) and zero new fields on ResourceClaim - claim-side values go into the existing OpaqueDeviceConfiguration, using a well-known JSON schema the scheduler can decode. The object-reference approach would require new fields on both sides (ResourceSlice commonConfigKind + ResourceClaim objectRefs) plus external CRDs.
    OpaqueDeviceConfiguration is the direction Patrick suggested in this comment

  • Multi-dimensional affinity: A device may need affinity on multiple independent axes (e.g., subnet + VLAN). With object references, each axis would need its own CRD (NetworkConfiguration, VlanConfiguration) but with the well-known schema, it simplifies to -

    {
      "kind": "StructuredParameters",
      "attributes": {
        "networking.example.com/subnet": {"string": "subnet-X"},
        "networking.example.com/vlan": {"int": 100}
      }
    }
  • Consistency with related proposals: I am also exploring a complementary proposal for Contextual Capacity Management (CCM) — where the scheduler needs to read capacity hints inline from the claim to handle context-dependent capacity (e.g., a NIC whose total capacity changes based on the subnet of the first allocation). That proposal also uses a well-known schema inside opaque config (CapacityHint). Keeping both on the same pattern gives a unified architecture: the opaque config mechanism supports a family of well-known schemas that the scheduler decodes synchronously — for affinity, for capacity and for any future scenario.

Let me know what you think. That said, the naming concern is valid and I'll address it.

   - Resolve placement decision: SharingAffinity stays on ResourceSlice
     (driver-side) with rationale for why hardware modal constraints
     belong on the device, not the workload

   - Resolve claim-value delivery: adopt well-known JSON schema inside
     OpaqueDeviceConfiguration per @pohly's guidance; define normative
     StructuredParameters contract (recognition, uniqueness, coexistence,
     conflict handling, string-only alpha, malformed payloads, missing
     entries, validation intent)

   - Defer CanSetLock/NeverSetLock to Future Enhancements; alpha allows
     any compatible claim to establish the initial lock

   - Replace grandfathered-claim model with conservative unknown-affinity
     handling: devices with non-reconstructable active claims are filtered
     out until fully clean (no optimistic lock-setting over legacy claims)

   - Add Safety Model and Responsibility Split section clarifying
     scheduler guarantee vs driver guarantee vs conservative fallback

   - Introduce AffinityState struct with Unknown flag; replace flat
     AffinityValues map with AffinityStates map[DeviceID]AffinityState

   - Expand Filter phase to 7-step evaluation including UnknownAffinity
     check, exactly-one StructuredParameters entry, schema decode,
     string-type enforcement

   - Add normative Score ordering (locked-compatible > clean > filtered)

   - Add explicit alpha limitations for lock-aware fairness and
     preemption blindness throughout Summary, Non-Goals, Proposal,
     and Risks sections

   - Add string-only matching constraint with rationale to Notes,
     Filter, StructuredParameters contract, and new Future Enhancement
     (Typed Affinity Values Beyond Strings)

   - Add Multi-key SharingAffinity example with subnet+pkey walkthrough

   - Expand reconstruction algorithm to handle malformed, non-string,
     and duplicate structured-parameters entries

   - Harden Risks section: rename Stale Affinity View to Cache Staleness,
     add alpha limitation callout to Priority Inversion

   - Remove stale SharingAffinityMapping reference

   - Add Priority-based Lock Preemption and SharingStrategy to Future
     Enhancements with detailed rationale

   - Update Graduation Criteria, Upgrade/Downgrade, Version Skew,
     PRR sections for conservative unknown-affinity handling
@sunya-ch
Copy link
Copy Markdown
Contributor

sunya-ch commented Apr 8, 2026

@ashvindeodhar Thank you for explanation. I understand. Still, could you please add the above choice in alternative section for documentation purpose?

In terms of naming, I plus one for parameterKeys. And instead of sharingAffinity, could it be configAffinity for DeviceConfigAffinity?

   - Add compatibility matrix (5 scenarios × Scheduler/Driver Outcome columns)
     showing dual-enforcement model for SA+SP combinations
   - Add Enablement and Rollout Dynamics section with unknown-affinity safety
     valve, missing/malformed parameter handling, version skew/rollback, and
     recommended rollout sequence for drivers
   - Add Object Reference-based Affinity Matching as a rejected alternative
     with rationale (API surface, multi-dimensional affinity, @pohly direction)
   - Add drawback: devices with sharingAffinity but no SP claims become
     unschedulable under Strict Gating
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

3 participants