Skip to content

Commit a31625a

Browse files
committed
device-compatibility-groups KEP
Signed-off-by: Omer Yahud <oyahud@nvidia.com>
1 parent 4a8e3ca commit a31625a

File tree

2 files changed

+425
-0
lines changed

2 files changed

+425
-0
lines changed
Lines changed: 385 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,385 @@
1+
# KEP-5963: DRA Device Compatibility Groups
2+
3+
- [Release Signoff Checklist](#release-signoff-checklist)
4+
- [Summary](#summary)
5+
- [Motivation](#motivation)
6+
- [Goals](#goals)
7+
- [Non-Goals](#non-goals)
8+
- [Proposal](#proposal)
9+
- [User Stories](#user-stories-optional)
10+
- [Story 1](#story-1-optional)
11+
- [Story 2](#story-2-optional)
12+
- [Notes/Constraints/Caveats](#notesconstraintscaveats-optional)
13+
- [Risks and Mitigations](#risks-and-mitigations)
14+
- [Design Details](#design-details)
15+
- [Test Plan](#test-plan)
16+
- [Prerequisite testing updates](#prerequisite-testing-updates)
17+
- [Unit tests](#unit-tests)
18+
- [Integration tests](#integration-tests)
19+
- [e2e tests](#e2e-tests)
20+
- [Graduation Criteria](#graduation-criteria)
21+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
22+
- [Version Skew Strategy](#version-skew-strategy)
23+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
24+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
25+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
26+
- [Monitoring Requirements](#monitoring-requirements)
27+
- [Dependencies](#dependencies)
28+
- [Scalability](#scalability)
29+
- [Troubleshooting](#troubleshooting)
30+
- [Implementation History](#implementation-history)
31+
- [Drawbacks](#drawbacks)
32+
- [Alternatives](#alternatives)
33+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
34+
35+
## Release Signoff Checklist
36+
37+
Items marked with (R) are required *prior to targeting to a milestone / release*.
38+
39+
- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://git.k8s.io/enhancements) (not the initial KEP PR)
40+
- (R) KEP approvers have approved the KEP status as `implementable`
41+
- (R) Design details are appropriately documented
42+
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
43+
- e2e Tests for all Beta API Operations (endpoints)
44+
- (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
45+
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
46+
- (R) Graduation criteria is in place
47+
- (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA
48+
- (R) Production readiness review completed
49+
- (R) Production readiness review approved
50+
- "Implementation History" section is up-to-date for milestone
51+
- User-facing documentation has been created in [kubernetes/website](https://git.k8s.io/website), for publication to [kubernetes.io](https://kubernetes.io/)
52+
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
53+
54+
## Summary
55+
56+
This KEP proposes an extension to the Dynamic Resource Allocation (DRA) API to
57+
support mutually exclusive device allocation constraints. Hardware devices often
58+
support multiple partitioning or virtualization schemes (for example, GPU MIG
59+
slicing vs. MPS sharing) that provide different trade-offs in terms of isolation,
60+
performance, and resource sharing. These schemes are frequently mutually exclusive
61+
at the hardware level: once a physical device is partitioned or configured using
62+
one scheme, it cannot be reconfigured to use a different scheme until all existing
63+
allocations are released.
64+
65+
The current DRA Partitionable Devices API has no mechanism for drivers to express
66+
these mutual exclusivity constraints. Without it, incompatible allocations are only
67+
detected during resource preparation, after the scheduler has already made its
68+
decisions, leading to pod startup failures and resource thrashing. This KEP
69+
introduces API and scheduler changes so that compatibility constraints can be
70+
declared in ResourceSlice objects and enforced at scheduling time.
71+
72+
## Motivation
73+
74+
Hardware devices often support multiple partitioning or virtualization schemes
75+
that are mutually exclusive at the hardware level. For example, an NVIDIA GPU
76+
can be configured for MIG (Multi-Instance GPU) slicing or MPS (Multi-Process
77+
Service) sharing, but not both simultaneously on the same physical device.
78+
79+
Without a mechanism to express these constraints in DRA, the following problems
80+
arise:
81+
82+
1. **Late Failure Detection**: Incompatible allocations are only detected during
83+
resource preparation, after scheduling decisions have already been made.
84+
2. **Scheduler Unawareness**: The scheduler may allocate incompatible devices,
85+
leading to pod startup failures.
86+
3. **Poor User Experience**: Users receive cryptic preparation failures instead
87+
of clear scheduling feedback.
88+
4. **Resource Thrashing**: The scheduler may repeatedly attempt incompatible
89+
allocations before giving up.
90+
91+
The current workaround—having DRA drivers fail resource preparation when
92+
incompatible allocations are attempted—is insufficient because it provides no
93+
mechanism to inform the scheduler, and does not prevent repeated failed attempts.
94+
95+
### Goals
96+
97+
- Allow DRA drivers to specify compatibility between virtual devices within a
98+
single physical device.
99+
- Allow the scheduler to make informed allocation decisions that respect
100+
compatibility rules declared in ResourceSlice objects.
101+
- Provide a generic mechanism applicable to any hardware with partitioning
102+
constraints, not just GPUs.
103+
- Maintain backward compatibility with existing ResourceSlice specifications.
104+
105+
### Non-Goals
106+
107+
- Allow DRA drivers to specify compatibility between physical or virtual devices
108+
across different physical devices or different device classes. The scope of
109+
compatibility constraints is limited to virtual devices sharing the same
110+
underlying physical device.
111+
112+
## Proposal
113+
114+
**CompatibilityGroups Assignment**
115+
116+
Add a `device.consumesCounters[].compatibilityGroups` field. Devices declare which
117+
named groups they belong to. For two devices consuming counters from the same
118+
counter set to be co-allocated, they must share at least one compatibility group.
119+
Devices without this field are considered compatible with all groups. This
120+
approach is simpler and has minimal API surface.
121+
122+
### User Stories
123+
124+
#### Story 1
125+
126+
As a GPU operator using NVIDIA GPUs, I want to express in my ResourceSlice
127+
that MIG-partitioned virtual devices and MPS-sharing virtual devices on the
128+
same physical GPU are mutually exclusive. When a pod requesting a MIG partition
129+
is already running on a GPU, I want the scheduler to automatically exclude all
130+
MPS devices on that same GPU from consideration for new allocations, rather than
131+
allowing an allocation that will fail at device preparation time.
132+
133+
#### Story 2
134+
135+
As a hardware vendor publishing DRA drivers for an accelerator that supports
136+
multiple exclusive operating modes (for example, exclusive mode, software
137+
partitioning, and hardware partitioning), I want to declare the compatibility
138+
constraints directly in my ResourceSlice, so that the Kubernetes scheduler
139+
can enforce those constraints without requiring my driver to fail pod startup
140+
with cryptic error messages.
141+
142+
### Notes/Constraints/Caveats
143+
144+
The compatibility constraint is bidirectional and transitive: if device A
145+
specifies a constraint that excludes device B, then allocating A must prevent
146+
B from being allocated, and vice versa. Both proposals implement this
147+
bidirectional check in the scheduler.
148+
149+
### Risks and Mitigations
150+
151+
**Scheduler performance impact**: Evaluating compatibility constraints during
152+
device selection adds work to each scheduling cycle that involves DRA devices.
153+
154+
**Older schedulers ignoring new field**: A kube-scheduler that does not
155+
understand `compatibilityGroups` will ignore this
156+
field and may allocate incompatible devices. This degrades to the current
157+
behavior (driver fails at preparation time). Mitigation: document the version
158+
skew behavior clearly; drivers must still validate at preparation time even
159+
when the scheduler enforces constraints.
160+
161+
**Incorrect driver declarations**: If a driver declares incorrect compatibility
162+
constraints, the scheduler may either reject valid allocations or permit invalid
163+
ones. Mitigation: the API is driver-authored and opt-in; drivers are responsible
164+
for correctness and documentation of their compatibility matrix.
165+
166+
## Design Details
167+
168+
### API
169+
170+
#### CompatibilityGroups Assignment
171+
172+
A new field `compatibilityGroups` is added inside each entry of
173+
`device.consumesCounters[]`. It contains a list of string group names.
174+
For two devices consuming counters from the same counter set to be allocated
175+
together, they must share at least one group name. Devices that omit this
176+
field are considered compatible with all groups.
177+
178+
Example showing MIG and FOO partitions on the same physical GPU:
179+
180+
```yaml
181+
apiVersion: resource.k8s.io/v1
182+
kind: ResourceSlice
183+
spec:
184+
sharedCounters:
185+
- name: gpu-1-cs
186+
counters:
187+
multiprocessors:
188+
value: "152"
189+
devices:
190+
- name: gpu-1-mig1
191+
consumesCounters:
192+
- counterSet: gpu-1-cs
193+
compatibilityGroups:
194+
- mig
195+
counters:
196+
multiprocessors:
197+
value: "2"
198+
- name: gpu-1-foo-part
199+
consumesCounters:
200+
- counterSet: gpu-1-cs
201+
compatibilityGroups:
202+
- foo
203+
- bar
204+
counters:
205+
multiprocessors:
206+
value: "17"
207+
- name: gpu-1-bar-part
208+
consumesCounters:
209+
- counterSet: gpu-1-cs
210+
compatibilityGroups:
211+
- foo
212+
- bar
213+
counters:
214+
multiprocessors:
215+
value: "17"
216+
```
217+
218+
- `gpu-1-mig1` and `gpu-1-foo-part` share no compatibility group (`mig` vs
219+
`foo`/`bar`), so they cannot be co-allocated on the same counter set.
220+
- `gpu-1-foo-part` and `gpu-1-bar-part` share compatibility groups (`foo`, `bar`),
221+
so they can be co-allocated on the same counter set.
222+
223+
### Scheduler Changes
224+
225+
The DRA scheduler plugin is enhanced to:
226+
227+
1. Maintain a cache of allocated devices per node, including their compatibility
228+
fields (`compatibilityGroups` values).
229+
2. For each candidate device during allocation, evaluate whether it is compatible
230+
with all currently allocated devices on the node, and whether all allocated
231+
devices are compatible with it (bidirectional check).
232+
3. Remove candidate devices from consideration if they violate compatibility
233+
constraints.
234+
4. Emit clear scheduling events when a device is rejected due to compatibility.
235+
236+
### Driver Responsibilities
237+
238+
Resource drivers are responsible for:
239+
240+
1. Populating `compatibilityGroups` for all devices with compatibility requirements.
241+
2. Ensuring compatibility rules are symmetric and consistent across all devices
242+
in a ResourceSlice.
243+
3. Documenting their compatibility matrix.
244+
4. Continuing to validate at resource preparation time for version-skew safety.
245+
246+
### Test Plan
247+
248+
[X] I/we understand the owners of the involved components may require updates to
249+
existing tests to make this code solid enough prior to committing the changes necessary
250+
to implement this enhancement.
251+
252+
##### Prerequisite testing updates
253+
254+
##### Unit tests
255+
256+
- TBD
257+
258+
##### Integration tests
259+
260+
- TBD
261+
262+
##### e2e tests
263+
264+
- TBD
265+
266+
### Graduation Criteria
267+
268+
### Upgrade / Downgrade Strategy
269+
270+
### Version Skew Strategy
271+
272+
## Production Readiness Review Questionnaire
273+
274+
### Feature Enablement and Rollback
275+
276+
###### How can this feature be enabled / disabled in a live cluster?
277+
278+
- Feature gate (also fill in values in `kep.yaml`)
279+
- Feature gate name: DRADeviceCompatibilityGroups
280+
- Components depending on the feature gate: kube-scheduler
281+
- Other
282+
- Describe the mechanism:
283+
- Will enabling / disabling the feature require downtime of the control
284+
plane?
285+
- Will enabling / disabling the feature require downtime or reprovisioning
286+
of a node?
287+
288+
###### Does enabling the feature change any default behavior?
289+
290+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
291+
292+
###### What happens if we reenable the feature if it was previously rolled back?
293+
294+
###### Are there any tests for feature enablement/disablement?
295+
296+
### Rollout, Upgrade and Rollback Planning
297+
298+
###### How can a rollout or rollback fail? Can it impact already running workloads?
299+
300+
###### What specific metrics should inform a rollback?
301+
302+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
303+
304+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
305+
306+
### Monitoring Requirements
307+
308+
###### How can an operator determine if the feature is in use by workloads?
309+
310+
###### How can someone using this feature know that it is working for their instance?
311+
312+
- Events
313+
- Event Reason:
314+
- API .status
315+
- Condition name:
316+
- Other field:
317+
- Other (treat as last resort)
318+
- Details:
319+
320+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
321+
322+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
323+
324+
- Metrics
325+
- Metric name:
326+
- [Optional] Aggregation method:
327+
- Components exposing the metric:
328+
- Other (treat as last resort)
329+
- Details:
330+
331+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
332+
333+
### Dependencies
334+
335+
###### Does this feature depend on any specific services running in the cluster?
336+
337+
### Scalability
338+
339+
###### Will enabling / using this feature result in any new API calls?
340+
341+
###### Will enabling / using this feature result in introducing new API types?
342+
343+
###### Will enabling / using this feature result in any new calls to the cloud provider?
344+
345+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
346+
347+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
348+
349+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
350+
351+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
352+
353+
### Troubleshooting
354+
355+
###### How does this feature react if the API server and/or etcd is unavailable?
356+
357+
###### What are other known failure modes?
358+
359+
###### What steps should be taken if SLOs are not being met to determine the problem?
360+
361+
## Implementation History
362+
363+
## Drawbacks
364+
365+
Adding compatibility constraint support to the scheduler increases the
366+
complexity of the DRA scheduling logic. The new field must be evaluated for
367+
every device candidate during every scheduling cycle that involves DRA
368+
resources, which adds latency and memory overhead.
369+
370+
## Alternatives
371+
372+
### Current Workaround: Driver-level Preparation Failure
373+
374+
The existing workaround is for DRA drivers to fail resource preparation when
375+
incompatible allocations are attempted. This approach is insufficient because:
376+
377+
- It detects incompatibilities only after scheduling has committed to the
378+
allocation, leading to pod startup failures.
379+
- It provides no mechanism to inform the scheduler so it can try other nodes
380+
or device combinations.
381+
- It results in resource thrashing as the scheduler retries the same failing
382+
combination.
383+
384+
## Infrastructure Needed (Optional)
385+

0 commit comments

Comments
 (0)