diff --git a/keps/sig-node/5963-device-compatibility-groups/README.md b/keps/sig-node/5963-device-compatibility-groups/README.md
new file mode 100644
index 000000000000..b1fac46caed3
--- /dev/null
+++ b/keps/sig-node/5963-device-compatibility-groups/README.md
@@ -0,0 +1,1104 @@
+# KEP-5963: DRA Device Compatibility Groups
+
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [User Stories](#user-stories)
+    - [Story 1](#story-1)
+    - [Story 2](#story-2)
+  - [Notes/Constraints/Caveats](#notesconstraintscaveats)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [API](#api)
+    - [CompatibilityGroups Assignment](#compatibilitygroups-assignment)
+  - [Examples](#examples)
+    - [Example 1: What the existing API enables](#example-1-what-the-existing-api-enables)
+    - [Example 2: How the existing API does not solve the problem](#example-2-how-the-existing-api-does-not-solve-the-problem)
+    - [Example 3: How the proposed API solves the problem](#example-3-how-the-proposed-api-solves-the-problem)
+    - [Example 4: Multiple compatible groups with an incompatible group](#example-4-multiple-compatible-groups-with-an-incompatible-group)
+  - [Scheduler Changes](#scheduler-changes)
+  - [Interaction with Multi-Request Claims and Device Constraints](#interaction-with-multi-request-claims-and-device-constraints)
+  - [Driver Responsibilities](#driver-responsibilities)
+  - [Test Plan](#test-plan)
+    - [Prerequisite testing updates](#prerequisite-testing-updates)
+    - [Unit tests](#unit-tests)
+    - [Integration tests](#integration-tests)
+    - [e2e tests](#e2e-tests)
+  - [Graduation Criteria](#graduation-criteria)
+  - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+  - [Version Skew Strategy](#version-skew-strategy)
+- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
+  - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+  - [Monitoring Requirements](#monitoring-requirements)
+  - [Dependencies](#dependencies)
+  - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
+- [Implementation History](#implementation-history)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+  - [Current Workaround: Driver-level Preparation Failure](#current-workaround-driver-level-preparation-failure)
+  - [Inverted naming: `mutualExclusionGroups`](#inverted-naming-mutualexclusiongroups)
+- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
+
+## Release Signoff Checklist
+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://git.k8s.io/enhancements) (not the initial KEP PR)
+- (R) KEP approvers have approved the KEP status as `implementable`
+- (R) Design details are appropriately documented
+- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
+  - e2e Tests for all Beta API Operations (endpoints)
+  - (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
+  - (R) Minimum Two Week Window for GA e2e tests to prove flake free
+- (R) Graduation criteria is in place
+  - (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA
+- (R) Production readiness review completed
+- (R) Production readiness review approved
+- "Implementation History" section is up-to-date for milestone
+- User-facing documentation has been created in [kubernetes/website](https://git.k8s.io/website), for publication to [kubernetes.io](https://kubernetes.io/)
+- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+## Summary
+
+This KEP proposes an extension to the Dynamic Resource Allocation (DRA) API to
+support mutually exclusive device allocation constraints. Hardware devices often
+support multiple partitioning or virtualization schemes (for example, GPU MIG
+slicing vs. MPS sharing) that provide different trade-offs in terms of isolation,
+performance, and resource sharing. These schemes are frequently mutually exclusive
+at the hardware level: once a physical device is partitioned or configured using
+one scheme, it cannot be reconfigured to use a different scheme until all existing
+allocations are released.
+
+The current DRA Partitionable Devices API has no mechanism for drivers to express
+these mutual exclusivity constraints. A shared counter with a capacity of one
+can ensure mutual exclusion, but this cannot be used here: such a counter
+would have to be decremented once when allocating the *first* device from a set
+of compatible devices, not once for *each* device. This cannot be expressed
+at the moment.
+
+Without a mechanism for this, incompatible allocations are only
+detected during resource preparation, after the scheduler has already made its
+decisions, leading to pod startup failures and resource thrashing. This KEP
+introduces API and scheduler changes so that compatibility constraints can be
+declared in ResourceSlice objects and enforced at scheduling time.
+
+## Motivation
+
+Hardware devices often support multiple partitioning or virtualization schemes
+that are mutually exclusive at the hardware level. For example, an NVIDIA GPU
+can be configured for MIG (Multi-Instance GPU) slicing or MPS (Multi-Process
+Service) sharing, but not both simultaneously on the same physical device.
+
+Without a mechanism to express these constraints in DRA, the following problems
+arise:
+
+1. **Late Failure Detection**: Incompatible allocations are only detected during
+  resource preparation, after scheduling decisions have already been made.
+2. **Scheduler Unawareness**: The scheduler may allocate incompatible devices,
+  leading to pod startup failures.
+3. **Poor User Experience**: Users receive cryptic preparation failures instead
+  of clear scheduling feedback.
+4. **Resource Thrashing**: The scheduler may repeatedly attempt incompatible
+  allocations before giving up.
+
+The current workaround—having DRA drivers fail resource preparation when
+incompatible allocations are attempted—is insufficient because it provides no
+mechanism to inform the scheduler, and does not prevent repeated failed attempts.
+
+### Goals
+
+- Allow DRA drivers to specify compatibility between virtual devices within a
+single physical device.
+- Allow the scheduler to make informed allocation decisions that respect
+compatibility rules declared in ResourceSlice objects.
+- Provide a generic mechanism applicable to any hardware with partitioning
+constraints, not just GPUs.
+- Maintain backward compatibility with existing ResourceSlice specifications.
+
+### Non-Goals
+
+- Allowing DRA drivers to specify compatibility between devices that do not
+  share a counter set. The scope of compatibility constraints is limited to
+  virtual devices consuming from the same counter set (which, by convention,
+  represents a single underlying physical device).
+- Providing a centralized or cluster-wide registry of compatibility group
+  names. Group names are opaque strings scoped to a single ResourceSlice pool
+  and are meaningful only to the driver that publishes them.
+- Enabling the scheduler to *reconfigure* a physical device between
+  partitioning schemes (e.g., MIG ↔ MPS) as part of scheduling. This KEP only
+  addresses rejecting incompatible allocations; transitions between schemes
+  remain a driver concern and typically require draining existing allocations.
+- Expressing compatibility constraints on `ResourceClaim` objects. The field
+  is driver-authored and lives only on `ResourceSlice`.
+- Replacing existing counter-capacity checks. `compatibilityGroups` is an
+  additional predicate; capacity math on `sharedCounters` continues to apply
+  unchanged.
+
+## Proposal
+
+**CompatibilityGroups Assignment**
+
+Add a `device.consumesCounters[].compatibilityGroups` field. Devices declare which  
+named groups they belong to. For two devices consuming counters from the same  
+counter set to be co-allocated, they must share at least one compatibility group.
+
+Devices that omit this field are compatible only with other devices in the same
+counter set that also omit it. Existing ResourceSlices (where no device sets the
+field) continue to behave as today; drivers adopting this feature should annotate
+every device sharing a counter set.
+
+### User Stories
+
+#### Story 1
+
+As a GPU operator using NVIDIA GPUs, I want to express in my ResourceSlice
+that MIG-partitioned virtual devices and MPS-sharing virtual devices on the
+same physical GPU are mutually exclusive. When a pod requesting a MIG partition
+is already running on a GPU, I want the scheduler to automatically exclude all
+MPS devices on that same GPU from consideration for new allocations, rather than
+allowing an allocation that will fail at device preparation time.
+
+#### Story 2
+
+As a hardware vendor publishing DRA drivers for an accelerator that supports
+multiple exclusive operating modes (for example, exclusive mode, software
+partitioning, and hardware partitioning), I want to declare the compatibility
+constraints directly in my ResourceSlice, so that the Kubernetes scheduler
+can enforce those constraints without requiring my driver to fail pod startup
+with cryptic error messages.
+
+### Notes/Constraints/Caveats
+
+The compatibility relation is **symmetric**: if A can be co-allocated with B,
+then B can be co-allocated with A. It is **not transitive**: A and B sharing a
+group, and B and C sharing a group, does not imply A and C share one.
+Concretely, the scheduler evaluates the pairwise predicate
+`groups(A) ∩ groups(B) ≠ ∅` against every already-allocated device on the same
+counter set; it does not compute transitive closures. Drivers that want three
+device types to be mutually co-allocatable must ensure every pair shares at
+least one group (see Example 4).
+
+### Risks and Mitigations
+
+**Scheduler performance impact**: Evaluating compatibility constraints during  
+device selection adds work to each scheduling cycle that involves DRA devices.
+
+**Older schedulers ignoring new field**: A kube-scheduler that does not  
+understand `compatibilityGroups` will ignore this  
+field and may allocate incompatible devices. This degrades to the current  
+behavior (driver fails at preparation time). Mitigation: document the version  
+skew behavior clearly; drivers must still validate at preparation time even  
+when the scheduler enforces constraints.
+
+**Incorrect driver declarations**: If a driver declares incorrect compatibility
+constraints, the scheduler may either reject valid allocations or permit invalid
+ones. Mitigation: the API is driver-authored and opt-in; drivers are responsible
+for correctness and documentation of their compatibility matrix.
+
+## Design Details
+
+### API
+
+#### CompatibilityGroups Assignment
+
+A new field `compatibilityGroups` is added inside each entry of
+`device.consumesCounters[]`. It contains a list of string group names.
+For two devices consuming counters from the same counter set to be allocated
+together, either both must omit the field (or set it to an empty list), or both
+must declare the field and share at least one group name. A nil
+`compatibilityGroups` and an empty `compatibilityGroups: []` are treated
+identically. This means a device that declares the field is never co-allocatable
+with a sibling that omits it.
+
+The field is placed on each `consumesCounters[]` entry rather than on the
+device itself because compatibility is a physical-hardware property scoped to
+the shared resource represented by the counter set. A single virtual device
+that consumes from multiple counter sets may therefore declare different
+groups per counter set, reflecting different exclusivity constraints on
+different pieces of underlying hardware. Two devices that do not share any
+counter set are never compared via this field, even if they live on the same
+node or in the same `ResourceSlice`.
+
+**Naming convention used in examples.** A device of type `T` lists `T` in its
+groups. When types `T1…Tn` are mutually co-allocatable, every device of those
+types additionally lists a shared composite group (e.g., `t1t2`). A type that
+is compatible with no other type lists only `[T]`. The scheduler does not
+parse group names — this convention is purely for readability; any opaque
+strings that satisfy the symmetry requirement work.
+
+Example showing MIG and FOO partitions on the same physical GPU:
+
+```yaml
+apiVersion: resource.k8s.io/v1
+kind: ResourceSlice
+spec:
+  driver: gpu.example.com
+  pool:
+    name: node-1-pool
+    generation: 1
+    resourceSliceCount: 2
+  sharedCounters:
+    - name: gpu-1-cs
+      counters:
+        multiprocessors:
+          value: "152"
+---
+apiVersion: resource.k8s.io/v1
+kind: ResourceSlice
+spec:
+  driver: gpu.example.com
+  pool:
+    name: node-1-pool
+    generation: 1
+    resourceSliceCount: 2
+  nodeName: node-1
+  devices:
+    - name: gpu-1-mig1
+      consumesCounters:
+        - counterSet: gpu-1-cs
+          compatibilityGroups:
+            - mig
+          counters:
+            multiprocessors:
+              value: "2"
+    - name: gpu-1-foo-part
+      consumesCounters:
+        - counterSet: gpu-1-cs
+          compatibilityGroups:
+            - foo
+            - foobar
+          counters:
+            multiprocessors:
+              value: "17"
+    - name: gpu-1-bar-part
+      consumesCounters:
+        - counterSet: gpu-1-cs
+          compatibilityGroups:
+            - bar
+            - foobar
+          counters:
+            multiprocessors:
+              value: "17"
+```
+
+- `gpu-1-mig1` (groups: `mig`) and `gpu-1-foo-part` (groups: `foo`, `foobar`)
+share no compatibility group, so they cannot be co-allocated on the same
+counter set.
+- `gpu-1-foo-part` (groups: `foo`, `foobar`) and `gpu-1-bar-part` (groups:
+`bar`, `foobar`) share the `foobar` group, so they can be co-allocated on the
+same counter set.
+
+### Examples
+
+The following examples demonstrate the problem and the proposed solution using
+a GPU that supports two mutually exclusive partitioning schemes: MIG (hardware
+partitioning into isolated instances) and MPS (software-level time-sharing).
+
+#### Example 1: What the existing API enables
+
+The DRA Partitionable Devices API uses shared counter sets to track the
+capacity of a physical device across its virtual partitions. When all virtual
+devices on a GPU use the same partitioning scheme, the counter capacity check
+is sufficient to ensure correct allocation.
+
+ResourceSlices — a single GPU advertising three MIG 1g partitions:
+
+```yaml
+apiVersion: resource.k8s.io/v1
+kind: ResourceSlice
+metadata:
+  name: node-1-gpu-0-counters
+spec:
+  driver: gpu.example.com
+  pool:
+    name: node-1-pool
+    generation: 1
+    resourceSliceCount: 2
+  sharedCounters:
+    - name: gpu-0-counters
+      counters:
+        multiprocessors:
+          value: "100"
+---
+apiVersion: resource.k8s.io/v1
+kind: ResourceSlice
+metadata:
+  name: node-1-gpu-0-devices
+spec:
+  driver: gpu.example.com
+  pool:
+    name: node-1-pool
+    generation: 1
+    resourceSliceCount: 2
+  nodeName: node-1
+  devices:
+    - name: gpu-0-mig-1g-0
+      attributes:
+        type:
+          string: "mig-1g"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          counters:
+            multiprocessors:
+              value: "20"
+    - name: gpu-0-mig-1g-1
+      attributes:
+        type:
+          string: "mig-1g"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          counters:
+            multiprocessors:
+              value: "20"
+    - name: gpu-0-mig-1g-2
+      attributes:
+        type:
+          string: "mig-1g"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          counters:
+            multiprocessors:
+              value: "20"
+```
+
+ResourceClaims — two pods each requesting a MIG 1g partition:
+
+```yaml
+apiVersion: resource.k8s.io/v1
+kind: ResourceClaim
+metadata:
+  name: pod-a-gpu
+  namespace: default
+spec:
+  devices:
+    requests:
+      - name: gpu
+        selectors:
+          - cel:
+              expression: >-
+                device.driver == 'gpu.example.com' &&
+                device.attributes['type'].string == 'mig-1g'
+---
+apiVersion: resource.k8s.io/v1
+kind: ResourceClaim
+metadata:
+  name: pod-b-gpu
+  namespace: default
+spec:
+  devices:
+    requests:
+      - name: gpu
+        selectors:
+          - cel:
+              expression: >-
+                device.driver == 'gpu.example.com' &&
+                device.attributes['type'].string == 'mig-1g'
+```
+
+The scheduler allocates `gpu-0-mig-1g-0` to pod-a and `gpu-0-mig-1g-1` to
+pod-b. Both consume from `gpu-0-counters` (20 + 20 = 40 <= 100). Both pods
+start successfully because both devices use the same MIG partitioning mode.
+
+#### Example 2: How the existing API does not solve the problem
+
+When a driver advertises devices from multiple mutually exclusive partitioning
+schemes on the same GPU, all sharing the same counter set, the current API has
+no way to express that these schemes cannot coexist.
+
+ResourceSlices — the same GPU now advertising both MIG and MPS devices:
+
+```yaml
+apiVersion: resource.k8s.io/v1
+kind: ResourceSlice
+metadata:
+  name: node-1-gpu-0-counters
+spec:
+  driver: gpu.example.com
+  pool:
+    name: node-1-pool
+    generation: 1
+    resourceSliceCount: 2
+  sharedCounters:
+    - name: gpu-0-counters
+      counters:
+        multiprocessors:
+          value: "100"
+---
+apiVersion: resource.k8s.io/v1
+kind: ResourceSlice
+metadata:
+  name: node-1-gpu-0-devices
+spec:
+  driver: gpu.example.com
+  pool:
+    name: node-1-pool
+    generation: 1
+    resourceSliceCount: 2
+  nodeName: node-1
+  devices:
+    # MIG partitions
+    - name: gpu-0-mig-1g-0
+      attributes:
+        type:
+          string: "mig-1g"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          counters:
+            multiprocessors:
+              value: "20"
+    - name: gpu-0-mig-1g-1
+      attributes:
+        type:
+          string: "mig-1g"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          counters:
+            multiprocessors:
+              value: "20"
+    # MPS shares
+    - name: gpu-0-mps-0
+      attributes:
+        type:
+          string: "mps"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          counters:
+            multiprocessors:
+              value: "50"
+    - name: gpu-0-mps-1
+      attributes:
+        type:
+          string: "mps"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          counters:
+            multiprocessors:
+              value: "50"
+```
+
+ResourceClaims — pod-a requests a MIG partition, pod-b requests an MPS share:
+
+```yaml
+apiVersion: resource.k8s.io/v1
+kind: ResourceClaim
+metadata:
+  name: pod-a-gpu
+  namespace: default
+spec:
+  devices:
+    requests:
+      - name: gpu
+        selectors:
+          - cel:
+              expression: >-
+                device.driver == 'gpu.example.com' &&
+                device.attributes['type'].string == 'mig-1g'
+---
+apiVersion: resource.k8s.io/v1
+kind: ResourceClaim
+metadata:
+  name: pod-b-gpu
+  namespace: default
+spec:
+  devices:
+    requests:
+      - name: gpu
+        selectors:
+          - cel:
+              expression: >-
+                device.driver == 'gpu.example.com' &&
+                device.attributes['type'].string == 'mps'
+```
+
+The scheduler sees `gpu-0-mig-1g-0` (20 SMs) and `gpu-0-mps-0` (50 SMs).
+Total: 70 <= 100 — the counter capacity check passes. The scheduler allocates
+both. But at preparation time, the driver fails because MIG and MPS cannot be
+active simultaneously on the same physical GPU. Pod-b gets a cryptic
+preparation error. The scheduler may retry the same incompatible combination
+repeatedly, causing resource thrashing.
+
+#### Example 3: How the proposed API solves the problem
+
+With `compatibilityGroups`, the driver declares that MIG devices belong to the
+`"mig"` group and MPS devices belong to the `"mps"` group. The scheduler
+enforces that devices sharing a counter set must share at least one
+compatibility group.
+
+ResourceSlices — same devices, now with compatibility groups:
+
+```yaml
+apiVersion: resource.k8s.io/v1
+kind: ResourceSlice
+metadata:
+  name: node-1-gpu-0-counters
+spec:
+  driver: gpu.example.com
+  pool:
+    name: node-1-pool
+    generation: 1
+    resourceSliceCount: 2
+  sharedCounters:
+    - name: gpu-0-counters
+      counters:
+        multiprocessors:
+          value: "100"
+---
+apiVersion: resource.k8s.io/v1
+kind: ResourceSlice
+metadata:
+  name: node-1-gpu-0-devices
+spec:
+  driver: gpu.example.com
+  pool:
+    name: node-1-pool
+    generation: 1
+    resourceSliceCount: 2
+  nodeName: node-1
+  devices:
+    # MIG partitions
+    - name: gpu-0-mig-1g-0
+      attributes:
+        type:
+          string: "mig-1g"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          compatibilityGroups:
+            - mig
+          counters:
+            multiprocessors:
+              value: "20"
+    - name: gpu-0-mig-1g-1
+      attributes:
+        type:
+          string: "mig-1g"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          compatibilityGroups:
+            - mig
+          counters:
+            multiprocessors:
+              value: "20"
+    # MPS shares
+    - name: gpu-0-mps-0
+      attributes:
+        type:
+          string: "mps"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          compatibilityGroups:
+            - mps
+          counters:
+            multiprocessors:
+              value: "50"
+    - name: gpu-0-mps-1
+      attributes:
+        type:
+          string: "mps"
+      consumesCounters:
+        - counterSet: gpu-0-counters
+          compatibilityGroups:
+            - mps
+          counters:
+            multiprocessors:
+              value: "50"
+```
+
+ResourceClaims — identical to Example 2:
+
+```yaml
+apiVersion: resource.k8s.io/v1
+kind: ResourceClaim
+metadata:
+  name: pod-a-gpu
+  namespace: default
+spec:
+  devices:
+    requests:
+      - name: gpu
+        selectors:
+          - cel:
+              expression: >-
+                device.driver == 'gpu.example.com' &&
+                device.attributes['type'].string == 'mig-1g'
+---
+apiVersion: resource.k8s.io/v1
+kind: ResourceClaim
+metadata:
+  name: pod-b-gpu
+  namespace: default
+spec:
+  devices:
+    requests:
+      - name: gpu
+        selectors:
+          - cel:
+              expression: >-
+                device.driver == 'gpu.example.com' &&
+                device.attributes['type'].string == 'mps'
+```
+
+The scheduler allocates `gpu-0-mig-1g-0` (group: `mig`) to pod-a. When
+evaluating `gpu-0-mps-0` (group: `mps`) for pod-b, it checks
+compatibility: both devices consume from `gpu-0-counters`, but they share no
+compatibility group (`mig` vs `mps`). The scheduler rejects the allocation and
+pod-b becomes Unschedulable with event: "claim violates device compatibility
+constraints". No cryptic preparation failure, no resource thrashing.
+
+Two MIG devices (both group: `mig`) or two MPS devices (both group: `mps`) can
+still be co-allocated, since they share a group. Each device lists only its
+own type because MIG and MPS are not compatible with each other; if they
+were, they would also share a composite group like `migmps`.
+
+#### Example 4: Multiple compatible groups with an incompatible group
+
+A device may support more than two partitioning schemes, some of which can
+coexist. In this example, a device advertises three partition types: `foo`,
+`bar`, and `baz`. `foo` and `bar` can coexist on the same device, but `baz`
+is incompatible with both.
+
+By convention, each device's `compatibilityGroups` is a composite of the
+types it can be co-allocated with: each device lists its own type, plus a
+shared composite group for every set of types it is compatible with. So
+`foo` devices list `[foo, foobar]`, `bar` devices list `[bar, foobar]`, and
+`baz` — compatible with no other type — lists `[baz]`.
+
+This example is written generically — the counter name `units` stands in for
+any hardware-specific resource (SMs, bandwidth, slots).
+
+ResourceSlices — a device advertising foo, bar, and baz partitions:
+
+```yaml
+apiVersion: resource.k8s.io/v1
+kind: ResourceSlice
+metadata:
+  name: node-1-device-0-counters
+spec:
+  driver: device.example.com
+  pool:
+    name: node-1-pool
+    generation: 1
+    resourceSliceCount: 2
+  sharedCounters:
+    - name: device-0-counters
+      counters:
+        units:
+          value: "100"
+---
+apiVersion: resource.k8s.io/v1
+kind: ResourceSlice
+metadata:
+  name: node-1-device-0-devices
+spec:
+  driver: device.example.com
+  pool:
+    name: node-1-pool
+    generation: 1
+    resourceSliceCount: 2
+  nodeName: node-1
+  devices:
+    # foo partitions
+    - name: device-0-foo-0
+      attributes:
+        type:
+          string: "foo"
+      consumesCounters:
+        - counterSet: device-0-counters
+          compatibilityGroups:
+            - foo
+            - foobar
+          counters:
+            units:
+              value: "25"
+    # bar partitions
+    - name: device-0-bar-0
+      attributes:
+        type:
+          string: "bar"
+      consumesCounters:
+        - counterSet: device-0-counters
+          compatibilityGroups:
+            - bar
+            - foobar
+          counters:
+            units:
+              value: "25"
+    # baz partitions
+    - name: device-0-baz-0
+      attributes:
+        type:
+          string: "baz"
+      consumesCounters:
+        - counterSet: device-0-counters
+          compatibilityGroups:
+            - baz
+          counters:
+            units:
+              value: "50"
+```
+
+`device-0-foo-0` (groups: `foo`, `foobar`) and `device-0-bar-0` (groups:
+`bar`, `foobar`) share the `foobar` group, so they can be co-allocated.
+`device-0-baz-0` (groups: `baz`) shares no group with either, so it cannot be
+co-allocated with them.
+
+For instance, if pod-a is allocated `device-0-foo-0`, a subsequent pod
+requesting `device-0-bar-0` succeeds (both share `foobar`), but a pod
+requesting `device-0-baz-0` is rejected (`foo`/`foobar` vs `baz` — no shared
+group).
+
+### Scheduler Changes
+
+The DRA scheduler plugin is enhanced to:
+
+1. Maintain a cache of allocated devices per node, including their compatibility
+  fields (`compatibilityGroups` values).
+2. For each candidate device during allocation, evaluate whether it is compatible
+  with all currently allocated devices on the node, and whether all allocated
+   devices are compatible with it (bidirectional check).
+3. Remove candidate devices from consideration if they violate compatibility
+  constraints.
+4. Emit clear scheduling events when a device is rejected due to compatibility.
+
+**Complexity.** Let *M* be the number of devices already allocated on a
+counter set, *N* the number of candidates under consideration for that
+counter set, and *G* the maximum number of groups declared per counter-set
+consumption entry. The additional filter cost per scheduling cycle is
+O(*N* · *M* · *G*) for pairwise group-intersection checks, with typical *G*
+≤ 4 (hardware partition modes per device are small in practice). The
+existing DRA allocation loop already iterates over candidates per counter
+set, so the new work is a constant-factor-per-candidate addition rather than
+a new outer loop.
+
+### Interaction with Multi-Request Claims and Device Constraints
+
+**Multiple requests within one claim.** The compatibility predicate is
+evaluated pairwise between every device already allocated *on the same counter
+set* and each candidate, regardless of whether the allocated device belongs to
+the same claim, a different claim on the same pod, or a different pod entirely.
+Two devices within a single `ResourceClaim` that land on the same counter set
+are therefore subject to the same pairwise check: the second request sees the
+first as already-allocated state.
+
+**Allocation order.** The scheduler does not reorder requests within a claim
+to improve feasibility. If requests are ordered such that an early compatible
+pick later blocks a mandatory pick, the claim becomes Unschedulable and
+standard retry behavior applies. This matches how existing DRA constraints
+behave.
+
+**Composition with `DeviceConstraints`.** `compatibilityGroups` is a
+driver-authored, ResourceSlice-side constraint. `DeviceConstraints` (e.g.,
+`matchAttribute`) is a user-authored, ResourceClaim-side constraint. The two
+are evaluated independently and both must pass for a candidate to be
+allocated. A claim can never *relax* a driver-declared compatibility group,
+and a driver can never *force* a claim-side `matchAttribute`. They compose by
+conjunction.
+
+### Driver Responsibilities
+
+Resource drivers are responsible for:
+
+1. Populating `compatibilityGroups` for all devices with compatibility requirements.
+2. Ensuring compatibility rules are symmetric and consistent across all devices
+  in a ResourceSlice.
+3. Documenting their compatibility matrix.
+4. Continuing to validate at resource preparation time for version-skew safety.
+
+### Test Plan
+
+[X] I/we understand the owners of the involved components may require updates to  
+existing tests to make this code solid enough prior to committing the changes necessary  
+to implement this enhancement.
+
+##### Prerequisite testing updates
+
+None. The DRA scheduler plugin and `ResourceSlice` validation already have
+unit and integration coverage; new tests are additive.
+
+##### Unit tests
+
+- `k8s.io/dynamic-resource-allocation/structured`: pairwise group-intersection
+  predicate (empty, nil, single, multiple groups; nil-vs-nil, nil-vs-set,
+  set-vs-set; `[]` treated as nil).
+- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources`: filter
+  behavior with mixed compatible and incompatible candidates on the same
+  counter set; no-op behavior when the feature gate is disabled; no-op
+  behavior when no device in the slice declares the field.
+- `k8s.io/kubernetes/pkg/apis/resource/validation`: field validation —
+  accepted shapes, max group-name length, max groups per counter consumption.
+
+##### Integration tests
+
+- Feature gate enablement/disablement round-trip: field is persisted when
+  enabled, dropped on write when disabled.
+- Scheduler rejects a claim when the only remaining candidate on a node
+  belongs to an incompatible group; admits it when a compatible candidate
+  exists on another node.
+- Upgrade → downgrade → upgrade: allocations made during the "upgrade" phase
+  remain valid after downgrade; re-enabling enforcement does not re-evaluate
+  existing allocations.
+
+##### e2e tests
+
+- Fake DRA driver advertising two mutually exclusive groups (`mig`, `mps`) on
+  a single counter set. Scheduling a `mig` pod followed by an `mps` pod on
+  the same node leaves the second pod Unschedulable with the documented
+  event; reversing the order reproduces the behavior symmetrically.
+- Same driver with compatible groups (`foo`, `bar`) — both pods schedule.
+- Feature-gate-off baseline: the second pod reaches preparation and the
+  driver rejects it (pre-KEP behavior preserved).
+
+### Graduation Criteria
+#### Alpha
+- API defined and implemented
+- All relevant code is merged and placed behind a feature flag
+- Unit and integration tests
+- Driver-author documentation published under `kubernetes/website` (DRA
+  drivers section), including the strict nil-matching rule and a worked
+  MIG/MPS example.
+
+#### Beta
+- E2E tests passing in CI 
+- Validated with at least one production DRA driver (out-of-tree testing)
+
+#### GA
+- At least 2 releases as beta
+
+### Upgrade / Downgrade Strategy
+#### Upgrade
+Upon upgrading, no `ResourceSlice` leverages the new optional fields yet, so the current behavior remains as-is
+
+#### Downgrade
+If downgrading to a version that does not have this enhancement implemented, older schedulers and api-servers do not know of the added optional field, and revert to their defined behavior prior to this enhancement
+
+Allocated devices that leveraged this new field will remain allocated, and future allocations will not take `compatibilityGroups` into consideration.
+
+
+### Version Skew Strategy
+
+The feature introduces a new optional field on `ResourceSlice` and new
+enforcement logic in the scheduler. Skew behaviors to consider:
+
+**New kube-apiserver + old kube-scheduler.** The apiserver accepts and persists
+`compatibilityGroups`. An old scheduler ignores the field and may allocate
+incompatible devices. This degrades to the pre-KEP behavior: the DRA driver
+rejects the allocation at resource preparation time. Drivers MUST continue to
+validate at preparation time for this reason (see Driver Responsibilities).
+
+**Old kube-apiserver + new kube-scheduler.** The old apiserver drops the unknown
+field on writes. ResourceSlices in etcd therefore do not carry
+`compatibilityGroups`, and the new scheduler sees only nil values, producing the
+pre-KEP behavior. No incorrect allocations result.
+
+**Mixed-version HA kube-scheduler.** If one replica enforces the field and
+another does not, the enforcing replica may reject allocations the
+non-enforcing replica would accept. Both outcomes are safe (either the
+scheduler correctly rejects, or the driver rejects at preparation time).
+Resolution is to complete the scheduler rollout.
+
+**Downgrade with in-flight allocations.** Devices already allocated under the
+new rules remain allocated across a downgrade; the post-downgrade scheduler
+will not consider `compatibilityGroups` for future allocations, reverting to
+pre-KEP behavior. No existing allocations are invalidated.
+
+**Feature gate off on one component.** If `DRADeviceCompatibilityGroups` is
+enabled on kube-apiserver but disabled on kube-scheduler (or vice versa),
+behavior matches the corresponding skew row above — apiserver stores the field
+but scheduler ignores it, or scheduler enforces on field values that the
+apiserver may drop on writes.
+
+## Production Readiness Review Questionnaire
+
+### Feature Enablement and Rollback
+
+###### How can this feature be enabled / disabled in a live cluster?
+
+- Feature gate
+  - Feature gate name: DRADeviceCompatibilityGroups
+  - Components depending on the feature gate: kube-scheduler, kube-apiserver
+- Gate behavior per component:
+  - **kube-apiserver**: when disabled, strips `compatibilityGroups` on writes
+    and hides it on reads of `ResourceSlice`. Prevents drivers from persisting
+    values that cannot be enforced.
+  - **kube-scheduler**: when disabled, ignores the field on read and does not
+    perform the pairwise intersection check during filtering.
+- No control-plane downtime is required to toggle the gate.
+- No node downtime or reprovisioning is required.
+
+###### Does enabling the feature change any default behavior?
+No, this KEP proposes an additional optional field to the `ResourceSlice` API
+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+Yes, rolling back the enablement will revert the cluster to its pre-enablement behavior
+
+###### What happens if we reenable the feature if it was previously rolled back?
+Existing `compatibilityGroup` configurations in `ResourceSlice`s will become effective again
+
+###### Are there any tests for feature enablement/disablement?
+Yes, there will be integration tests to verify feature enablement/disablement
+
+### Rollout, Upgrade and Rollback Planning
+
+###### How can a rollout or rollback fail? Can it impact already running workloads?
+Rollout risk is limited to the two components touched by the feature gate
+(kube-apiserver field handling and kube-scheduler filter logic).
+Already-running workloads are not affected: compatibility filtering only runs
+during scheduling of *new* allocations, so disabling the gate or rolling back
+binaries does not disturb existing pod/device bindings.
+
+###### What specific metrics should inform a rollback?
+A new scheduler metric
+`scheduler_dra_compatibility_rejections_total{driver,counter_set,reason}`
+counts claim filter rejections caused by compatibility constraints. A rollback
+is warranted if this metric spikes unexpectedly after a driver update (likely
+an incorrect compatibility matrix — see Risks → Incorrect driver declarations).
+Operators should also watch `scheduler_unschedulable_pods` correlated with
+events matching `Insufficient compatible DRA devices`.
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+Upgrade → downgrade → upgrade will be covered by the integration test
+described in Test Plan → Integration tests. At alpha, manual verification on a
+kind cluster with the feature gate flipped is acceptable; CI coverage is a
+Beta graduation criterion.
+
+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+No
+
+### Monitoring Requirements
+
+###### How can an operator determine if the feature is in use by workloads?
+This feature is not intended for use by workload usage, it is intended for DRA Drivers
+
+###### How can someone using this feature know that it is working for their instance?
+
+- Events
+  - Scheduling events:
+    - When a candidate device is filtered out because its compatibility groups
+      do not intersect those of an already-allocated device on the same
+      counter set, the scheduler logs a per-node filter reason of the form:
+      ```
+      device gpu-0-mps-0 (groups [mps]) incompatible with allocated device gpu-0-mig-1g-0 (groups [mig]) on counterSet gpu-0-counters
+      ```
+    - If no node has any allocatable candidate, the standard scheduler
+      "0/N nodes are available" event aggregates this reason across nodes
+      (e.g., `4 Insufficient compatible DRA devices`).
+- Pod.status
+  - Condition name: Unschedulable
+
+###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
+N/A
+
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+Operators can use `scheduler_dra_compatibility_rejections_total` together with
+`scheduler_unschedulable_pods` and the standard DRA scheduler plugin latency
+metrics to determine whether compatibility filtering is contributing to
+scheduling failures or latency regressions.
+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?
+No — `scheduler_dra_compatibility_rejections_total{driver,counter_set,reason}`
+(introduced by this KEP; see Rollout → What specific metrics should inform a
+rollback) covers the primary observability need. Additional breakdowns can be
+added post-alpha if field feedback justifies them.
+
+### Dependencies
+DRA Partitionable Devices enabled
+
+###### Does this feature depend on any specific services running in the cluster?
+No
+
+### Scalability
+
+###### Will enabling / using this feature result in any new API calls?
+No
+
+###### Will enabling / using this feature result in introducing new API types?
+No, only a new API field
+
+###### Will enabling / using this feature result in any new calls to the cloud provider?
+No
+
+###### Will enabling / using this feature result in increasing size or count of the existing API objects?
+Yes, additional field to the `ResourceSlice` API
+
+###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
+Scheduling cycles that involve DRA devices incur an additional
+O(*N* · *M* · *G*) group-intersection check per counter set (see Design
+Details → Scheduler Changes), where *M* is devices already allocated on that
+counter set, *N* is candidates considered, and *G* is groups per device. For
+realistic values (*M* ≤ 16, *N* ≤ 64, *G* ≤ 4) the added work is in the low
+thousands of string comparisons per counter set per cycle and is not expected
+to be measurable against existing DRA scheduling cost. Benchmarks will be run
+during alpha and reported in the Implementation History.
+
+###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
+No
+
+###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
+No
+
+### Troubleshooting
+
+###### How does this feature react if the API server and/or etcd is unavailable?
+No new side effects
+
+###### What are other known failure modes?
+N/A
+
+###### What steps should be taken if SLOs are not being met to determine the problem?
+TBD
+
+## Implementation History
+
+- 2026-03-17: KEP opened, status `provisional`.
+- 2026-04-18: KEP under review; API shape and default semantics settled.
+
+## Drawbacks
+
+Adding compatibility constraint support to the scheduler increases the  
+complexity of the DRA scheduling logic. The new field must be evaluated for  
+every device candidate during every scheduling cycle that involves DRA  
+resources, which adds latency and memory overhead.
+
+## Alternatives
+
+### Current Workaround: Driver-level Preparation Failure
+
+The existing workaround is for DRA drivers to fail resource preparation when
+incompatible allocations are attempted. This approach is insufficient because:
+
+- It detects incompatibilities only after scheduling has committed to the
+allocation, leading to pod startup failures.
+- It provides no mechanism to inform the scheduler so it can try other nodes
+or device combinations.
+- It results in resource thrashing as the scheduler retries the same failing
+combination.
+
+### Inverted naming: `mutualExclusionGroups`
+
+An alternative API would invert the semantics: instead of declaring which
+groups a device *belongs to* (co-allocation predicate), declare which groups
+a device is *incompatible with* (exclusion predicate). Two devices would then
+be co-allocatable if and only if the intersection of their exclusion sets and
+their own group memberships is empty.
+
+The inverted model is arguably more intuitive for the motivating case — a MIG
+device "excludes MPS," full stop — and does not require drivers to list each
+peer group in their own entry (as Example 4 does, where `foo` devices must
+include `bar` in their group list). It was rejected because:
+
+- The co-allocation framing composes naturally with the existing DRA model,
+  where counter-set membership already expresses "can share resources." A
+  group is a finer-grained membership within the same model.
+- Exclusion semantics require two fields to express the same information (the
+  groups you *are* in, and the groups you *exclude*), or a global registry of
+  group names. Membership-only is simpler.
+- Symmetry is easier to validate: a driver that forgets to include `foo` in a
+  `bar` device's groups produces a diagnosable allocation failure, rather
+  than silent incorrect behavior under exclusion semantics.
+
+## Infrastructure Needed (Optional)
+
diff --git a/keps/sig-node/5963-device-compatibility-groups/kep.yaml b/keps/sig-node/5963-device-compatibility-groups/kep.yaml
new file mode 100644
index 000000000000..4ccbeae4913b
--- /dev/null
+++ b/keps/sig-node/5963-device-compatibility-groups/kep.yaml
@@ -0,0 +1,40 @@
+title: DRA Device Compatibility Groups
+kep-number: 5963
+authors:
+  - "@omeryahud"
+owning-sig: sig-node
+participating-sigs:
+  - sig-scheduling
+status: provisional
+creation-date: 2026-03-17
+reviewers:
+  - TBD
+approvers:
+  - TBD
+
+# The target maturity stage in the current dev cycle for this KEP.
+# If the purpose of this KEP is to deprecate a user-visible feature
+# and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
+stage: alpha
+
+# The most recent milestone for which work toward delivery of this KEP has been
+# done. This can be the current (upcoming) milestone, if it is being actively
+# worked on.
+latest-milestone: v1.37
+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: v1.37
+  beta: v1.38
+  stable: v1.39
+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+feature-gates:
+  - name: DRADeviceCompatibilityGroups
+    components:
+      - kube-scheduler
+      - kube-apiserver
+disable-supported: true
+
+metrics: []