chore(cli): hand GPUBinding management over to app-service on upgrade by dkeven · Pull Request #3119 · beclab/Olares

dkeven · 2026-05-21T06:15:52Z

Summary

Companion to the app-service compute subsystem work (beclab/Olares#3112) and the HAMi scheduler refactor (beclab/HAMi#19). Hands GPUBinding CRD management over from HAMi to app-service on upgrade, so the binding store stops being HAMi's private state and becomes the authoritative compute-allocation surface app-service drives.

Bump bundled HAMi image / chart to v2.6.18 (the version that drops the /gpus* HTTP routes and the CleanupGPUBindingsLoop).
Apply the new GPUBinding CRD schema explicitly during upgrade, because Helm 3 does not refresh chart crds/ directories on helm upgrade.
Walk every legacy GPUBinding produced by old HAMi, resolve it back to its owning ApplicationManager via owner / namespace / pod-selector heuristics, drop the ones whose app is no longer holding an allocation, and re-stamp the survivors with the metadata app-service needs (managed-by labels, app/owner labels, mode label, spec.owner, spec.namespace). Each migrated binding is also recorded in the new os-framework/app-gpu-allocations configmap so app-service inherits the existing placement without rescheduling any running workload.

The upgrade flow is registered both in upgrader_1_12_6 (one-time path for clusters jumping straight to 1.12.6) and in a new daily upgrader upgrader_1_12_7_20260520 (catches clusters already on 1.12.6+).

Test plan

cd cli && make build && go vet ./...
go test ./cli/pkg/upgrade/...
Manual: fresh single-node GPU install, verify the new GPUBinding CRD ships with spec.owner / spec.namespace and HAMi pod runs v2.6.18
Manual: cluster on pre-1.12.6 with HAMi v2.6.17 and existing GPUBindings — run upgrade, verify (a) CRD has new fields, (b) HAMi rolls to v2.6.18, (c) running GPU workloads keep running, (d) their bindings are re-stamped with owner/namespace/labels, (e) os-framework/app-gpu-allocations configmap has matching rows
Manual: same cluster but with one GPU app in stopped / uninstalled state — verify its legacy binding is removed and no allocation row is written for it
Manual: cluster already on 1.12.6+ — verify the daily upgrader picks it up and the migration is idempotent on a re-run

Made with Cursor

Note

Medium Risk
Touches Kubernetes CRDs and performs in-cluster migration/deletion of GPUBinding resources and writes a new allocation ConfigMap; mistakes could prune fields or drop active allocations, but the flow is bounded and largely idempotent/skips when resources are absent.

Overview
Upgrade flow now explicitly manages the GPUBinding CRD and migrates legacy bindings for app-service. The 1.12.6 upgrader (and a new daily 1.12.7-20260520 upgrader) runs a new ApplyGPUBindingCRD step before the HAMi Helm upgrade to ensure the cluster CRD schema includes spec.namespace/spec.owner (Helm won’t upgrade crds/).

After system components upgrade, the CLI now runs MigrateLegacyGPUBindings to walk existing legacy GPUBinding objects, resolve the owning ApplicationManager, delete orphan/stale bindings, re-stamp surviving bindings with app-service labels/owner fields, and persist placement into os-framework/app-gpu-allocations.

Separately, the bundled HAMi version is bumped to v2.6.18, and the GPUBinding CRD manifest is updated to include the new namespace and owner fields.

^{Reviewed by Cursor Bugbot for commit b261070. Bugbot is set up for automated code reviews on this repo. Configure here.}

Until now the gpu.bytetrade.io/GPUBinding CRD has been owned end-to-end by HAMi: its scheduler created bindings via the /gpus* HTTP surface, ran CleanupGPUBindingsLoop to garbage-collect orphans, and treated the binding store as its private state. The new app-service compute subsystem takes over that role — it picks the device, writes the binding, and persists the canonical allocation in the configmap os-framework/app-gpu-allocations — while HAMi v2.6.18 drops the app-level routes and the cleanup loop and keeps only the scheduler filter that consumes the bindings. For fresh clusters this is just a chart upgrade, but every existing cluster has a pile of GPUBindings produced by the old HAMi flow that app-service does not recognize: no allocation-configmap row, no managed-by labels, and (because the old CRD didn't carry them) no way to tell apart bindings from two users who installed the same app. Without a migration the upgrade would leave those workloads running but invisible to app-service, and the next lifecycle op would either no-op them or rewrite them from scratch. Add a one-shot upgrade flow that, in order: 1. Applies the bundled GPUBinding CRD before the HAMi helm upgrade so the apiserver knows about the new spec fields up-front. Helm 3 does not refresh chart `crds/` on upgrade, so the schema bump would otherwise be silently dropped on existing clusters. 2. Lets upgraderBase run the regular HAMi helm upgrade to v2.6.18, which is the version that retires the legacy /gpus* routes and the CleanupGPUBindingsLoop in favor of app-service ownership. 3. Walks every legacy GPUBinding, resolves it back to its owning ApplicationManager via owner / namespace / pod-selector heuristics, drops the ones whose app is no longer in a state that holds an allocation, and re-stamps the survivors with the metadata app-service requires (managed-by labels, app/owner labels, mode label, and the new spec.owner / spec.namespace). Each migrated binding is also recorded as an allocation row in os-framework/app-gpu-allocations so app-service inherits the existing placement without rescheduling. The flow is registered in both upgrader_1_12_6 (one-time path for clusters jumping straight to 1.12.6) and a new daily upgrader upgrader_1_12_7_20260520 (catches clusters already on 1.12.6+). Bump the bundled HAMi image / chart version to v2.6.18 so fresh installs ship the app-service-owned model from day one. Co-authored-by: Cursor <cursoragent@cursor.com>

vercel · 2026-05-21T06:15:59Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
olares	Ready	Preview, Comment	May 21, 2026 6:16am

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
olares-docs	Ignored		May 21, 2026 6:16am

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b261070. Configure here.}

…ce on upgrade (#3119)

vercel Bot deployed to Preview – olares May 21, 2026 06:16 View deployment

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread cli/pkg/upgrade/migrate_gpu_bindings.go

eball approved these changes May 21, 2026

View reviewed changes

eball merged commit 4ea6691 into main May 21, 2026
25 of 26 checks passed

eball added a commit that referenced this pull request Jun 12, 2026

cherry-pick: chore(cli): hand GPUBinding management over to app-servi…

682a109

…ce on upgrade (#3119)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(cli): hand GPUBinding management over to app-service on upgrade#3119

chore(cli): hand GPUBinding management over to app-service on upgrade#3119
eball merged 1 commit into
mainfrom
cli/chore/app_gpu_migration

dkeven commented May 21, 2026 •

edited by cursor Bot

Loading

Uh oh!

vercel Bot commented May 21, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dkeven commented May 21, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

vercel Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dkeven commented May 21, 2026 •

edited by cursor Bot

Loading

vercel Bot commented May 21, 2026 •

edited

Loading