Skip to content

DRA pod reconciler and node labeling#1289

Merged
kubernetes-prow[bot] merged 1 commit into
kubernetes-sigs:mainfrom
TomerNewman:MGMT-24473-dra-pod-reconciler
Jun 22, 2026
Merged

DRA pod reconciler and node labeling#1289
kubernetes-prow[bot] merged 1 commit into
kubernetes-sigs:mainfrom
TomerNewman:MGMT-24473-dra-pod-reconciler

Conversation

@TomerNewman

@TomerNewman TomerNewman commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Add DRA node labeling via unified PodNodeLabel reconciler

Summary

Merge DevicePluginPodReconciler and the new DRA pod labeling logic into a single PodNodeLabelReconciler. The reconciler watches DaemonSet pods with a ModuleNameLabel, determines the role from the DaemonSetRole label, and manages the corresponding node label (device-plugin-ready or dra-ready) based on pod readiness. This provides a node-level readiness signal indicating which nodes have a functioning device-plugin or DRA driver for a Module.

Changes

Unified controller:

  • PodNodeLabelReconciler in internal/controllers/pod_node_label_reconciler.go — replaces DevicePluginPodReconciler
  • Single registration in cmd/manager/main.go
  • Determines label name based on DaemonSetRole label: dradra-ready, otherwise → device-plugin-ready

Label utilities:

  • GetDRANodeLabel in internal/utils/kmmlabels.go — generates the dra-ready node label
  • GetDRANodeLabel exposed in public API via pkg/labels/labels.go

Constants:

  • DRARoleLabelValue and DRAReadySuffix in internal/constants/constants.go

Testing

  • Unit tests: Tests organized in two Context blocks (device-plugin pods, DRA pods) covering label add/remove, finalizer, replacement check, error paths, empty-nodeName, and DRA exclusion
  • Live cluster validation: Deployed to minikube — verified labeling on ready, unlabeling on delete, switching between roles, and independent operation of both roles

Acceptance Criteria

  • AC-1: PodNodeLabelReconciler watches pods and sets dra-ready or device-plugin-ready label on node when ready
  • AC-2: Label removed when pod not ready or deleting
  • AC-3: Pod event filter requires ModuleNameLabel + DaemonSet owner
  • AC-4: Reconciler uses NodeLabelerFinalizer
  • AC-5: Device-plugin replacement check excludes DRA pods
  • AC-6: Single PodNodeLabelReconciler registered in cmd/manager/main.go
  • AC-7: Unit tests cover both roles — label add/remove, predicate filtering, replacement check, error paths

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 17, 2026
@netlify

netlify Bot commented Jun 17, 2026

Copy link
Copy Markdown

Deploy Preview for kubernetes-sigs-kmm ready!

Name Link
🔨 Latest commit d250135
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kmm/deploys/6a38fea95b6b400008365f72
😎 Deploy Preview https://deploy-preview-1289--kubernetes-sigs-kmm.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: TomerNewman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 17, 2026
@TomerNewman

Copy link
Copy Markdown
Collaborator Author

@k8s-ci-robot k8s-ci-robot requested review from ybettan and removed request for mresvanis June 17, 2026 12:13
@codecov-commenter

codecov-commenter commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.46939% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.52%. Comparing base (fa23a9b) to head (d250135).
⚠️ Report is 385 commits behind head on main.

Files with missing lines Patch % Lines
internal/controllers/pod_node_label_reconciler.go 74.41% 6 Missing and 5 partials ⚠️
cmd/manager/main.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1289      +/-   ##
==========================================
- Coverage   79.09%   73.52%   -5.58%     
==========================================
  Files          51       67      +16     
  Lines        5109     5004     -105     
==========================================
- Hits         4041     3679     -362     
- Misses        882     1156     +274     
+ Partials      186      169      -17     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TomerNewman TomerNewman force-pushed the MGMT-24473-dra-pod-reconciler branch from 999f13a to d6eaa21 Compare June 17, 2026 14:07
@TomerNewman TomerNewman marked this pull request as ready for review June 17, 2026 14:08
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 17, 2026
@TomerNewman TomerNewman force-pushed the MGMT-24473-dra-pod-reconciler branch from d6eaa21 to 6417d60 Compare June 18, 2026 09:20
Comment thread cmd/manager/main.go Outdated
@ybettan

ybettan commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

providing downstream consumers a signal (dra-ready node label) indicating which nodes have a functioning DRA driver for a Module.

Why is d/s related here?

return ctrl.Result{}, fmt.Errorf("pod %s/%s has no %q label", pod.Namespace, pod.Name, constants.ModuleNameLabel)
}

labelName := utils.GetDRANodeLabel(pod.Namespace, moduleName)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetDRANodeLabel is a it confusing here since we are constructing/building the label pattern, not getting it from any object.

Can we rename this function while still staying consistent with how we construct version-ready/ready labels?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the labelName receiver have a too generic name IMO.

@TomerNewman TomerNewman Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the same naming convention as the existing GetDevicePluginNodeLabel, GetKernelModuleReadyNodeLabel, GetDevicePluginTargetNodeLabel, etc. in kmmlabels.go. Renaming just this one would be inconsistent

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

labelName is the same variable name used in DevicePluginPodReconciler for the same concept. Changing it only here would be inconsistent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

labelName is the same variable name used in DevicePluginPodReconciler for the same concept. Changing it only here would be inconsistent.

Seems like this controller doesn't exist anymore. So can we change it only in one place in the converged controller?

)

if !podutils.IsPodReady(pod) || !pod.DeletionTimestamp.IsZero() {
if nodeName != "" {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a funcion here or anything else to make it clearer we are targeting pods that weren't scheduled yet.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added the same comment as in DevicePluginPodReconciler.

Comment thread internal/controllers/dra_pod_reconciler.go Outdated
@TomerNewman TomerNewman force-pushed the MGMT-24473-dra-pod-reconciler branch from 6417d60 to 740df9d Compare June 21, 2026 08:54
@kubernetes-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: TomerNewman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@TomerNewman

Copy link
Copy Markdown
Collaborator Author

providing downstream consumers a signal (dra-ready node label) indicating which nodes have a functioning DRA driver for a Module.

Why is d/s related here?

"Downstream consumers" here doesn't refer to hub-spoke — it means anything that reads node labels to make scheduling or readiness decisions (other controllers, custom schedulers, or user workloads with node selectors). Poor wording on my part — I've updated the PR description to say "provides a node-level readiness signal" instead.

Comment thread internal/constants/constants.go Outdated

WorkerPodVersionLabelPrefix = "beta.kmm.node.kubernetes.io/version-worker-pod"
DevicePluginVersionLabelPrefix = "beta.kmm.node.kubernetes.io/version-device-plugin"
DRAVersionLabelPrefix = "beta.kmm.node.kubernetes.io/version-dra"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be in this PR. This is related to ordered uprade.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed.

@yevgeny-shnaidman

Copy link
Copy Markdown
Contributor

General question: dra_pod_reconciler.go is a copy of device_plugin_pod_reconciler.go, except for the label it uses. Since device plugin and dra are mutually exclusive, can't we just use the device-pod-reconcile (and rename it) for both device plugin and dra

@TomerNewman TomerNewman force-pushed the MGMT-24473-dra-pod-reconciler branch from 740df9d to cb93dc3 Compare June 21, 2026 11:55
@TomerNewman TomerNewman force-pushed the MGMT-24473-dra-pod-reconciler branch from cb93dc3 to fba8829 Compare June 21, 2026 12:07
labelName := utils.GetDevicePluginNodeLabel(pod.Namespace, moduleName)
isDRA := pod.Labels[constants.DaemonSetRole] == constants.DRARoleLabelValue

var labelName string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var labelName string
labelName := utils.GetDevicePluginNodeLabel(pod.Namespace, moduleName)
if isDRA {
labelName = utils.GetDRANodeLabel(pod.Namespace, moduleName)
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Make sure we don't already have a new running pod before unlabeling the node
// Making sure there is no other pod of the same role already running
labelSelector := client.MatchingLabels{constants.ModuleNameLabel: moduleName}
if isDRA {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have not we also added a role label for devicePlugin?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing device-plugin pods from older KMM versions may not have the role label. Adding it to the selector would miss those pods during the replacement check and incorrectly unlabel the node. That's why we only narrow the selector for DRA (which is new — all DRA pods will always have the label) and use a post-filter skip for device-plugin instead.

var foundRunningPod bool
for _, p := range modulePodsList.Items {
for _, p := range podsList.Items {
if !isDRA && p.Labels[constants.DaemonSetRole] == constants.DRARoleLabelValue {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to make the same check for DRA pod?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No — when isDRA is true, lines 69-71 already add DaemonSetRole=dra to the label selector, so the List only returns DRA pods. Device-plugin pods won't appear in the results. This skip (line 81) is only needed for the device-plugin path, where the query doesn't filter by role (for backwards compatibility), so DRA pods could appear in the list and need to be excluded.

)
}

func (dppr *DevicePluginPodReconciler) addLabel(ctx context.Context, nodeName string, labelName string) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not keep this function here? this is the only code that is using it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved back as private methods on the reconciler.

@TomerNewman TomerNewman force-pushed the MGMT-24473-dra-pod-reconciler branch from fba8829 to d2ff271 Compare June 21, 2026 15:58
@TomerNewman

Copy link
Copy Markdown
Collaborator Author

/retest

1 similar comment
@TomerNewman

Copy link
Copy Markdown
Collaborator Author

/retest

@TomerNewman TomerNewman force-pushed the MGMT-24473-dra-pod-reconciler branch 2 times, most recently from 44e8e24 to f4c9bd9 Compare June 22, 2026 05:26
@ybettan

ybettan commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

I am happy with this PR. @yevgeny-shnaidman can lgtm when he is happy too.

@TomerNewman TomerNewman force-pushed the MGMT-24473-dra-pod-reconciler branch from f4c9bd9 to 4e001cb Compare June 22, 2026 08:59
// Make sure we don't already have a new running pod before unlabeling the node
// Making sure there is no other pod of the same role already running
labelSelector := client.MatchingLabels{constants.ModuleNameLabel: moduleName}
if isDRA {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have a code in line 42-43 that check if this DRA or not. Lets define a variable there that will hold either constants.DRARoleLabelValue or constants.DevicePluginRoleLabelValue and here we just set it , without needing to check again if it is DRA

Merge DevicePluginPodReconciler and the new DRA pod labeling logic into
a single PodNodeLabelReconciler that handles both roles. The reconciler
watches DaemonSet pods with a ModuleNameLabel, determines the role from
the DaemonSetRole label, and manages the corresponding node label
(device-plugin-ready or dra-ready) based on pod readiness.
@TomerNewman TomerNewman force-pushed the MGMT-24473-dra-pod-reconciler branch from 4e001cb to d250135 Compare June 22, 2026 09:21
@yevgeny-shnaidman

Copy link
Copy Markdown
Contributor

/lgtm

@kubernetes-prow kubernetes-prow Bot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 22, 2026
@TomerNewman

Copy link
Copy Markdown
Collaborator Author

/retest

@kubernetes-prow kubernetes-prow Bot merged commit 9547ce6 into kubernetes-sigs:main Jun 22, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants