chore(cli): hand GPUBinding management over to app-service on upgrade#3119
Merged
Conversation
Until now the gpu.bytetrade.io/GPUBinding CRD has been owned end-to-end
by HAMi: its scheduler created bindings via the /gpus* HTTP surface,
ran CleanupGPUBindingsLoop to garbage-collect orphans, and treated the
binding store as its private state. The new app-service compute
subsystem takes over that role — it picks the device, writes the
binding, and persists the canonical allocation in the configmap
os-framework/app-gpu-allocations — while HAMi v2.6.18 drops the
app-level routes and the cleanup loop and keeps only the scheduler
filter that consumes the bindings.
For fresh clusters this is just a chart upgrade, but every existing
cluster has a pile of GPUBindings produced by the old HAMi flow that
app-service does not recognize: no allocation-configmap row, no
managed-by labels, and (because the old CRD didn't carry them) no way
to tell apart bindings from two users who installed the same app.
Without a migration the upgrade would leave those workloads running
but invisible to app-service, and the next lifecycle op would either
no-op them or rewrite them from scratch.
Add a one-shot upgrade flow that, in order:
1. Applies the bundled GPUBinding CRD before the HAMi helm upgrade
so the apiserver knows about the new spec fields up-front. Helm
3 does not refresh chart `crds/` on upgrade, so the schema bump
would otherwise be silently dropped on existing clusters.
2. Lets upgraderBase run the regular HAMi helm upgrade to v2.6.18,
which is the version that retires the legacy /gpus* routes and
the CleanupGPUBindingsLoop in favor of app-service ownership.
3. Walks every legacy GPUBinding, resolves it back to its owning
ApplicationManager via owner / namespace / pod-selector
heuristics, drops the ones whose app is no longer in a state
that holds an allocation, and re-stamps the survivors with the
metadata app-service requires (managed-by labels, app/owner
labels, mode label, and the new spec.owner / spec.namespace).
Each migrated binding is also recorded as an allocation row in
os-framework/app-gpu-allocations so app-service inherits the
existing placement without rescheduling.
The flow is registered in both upgrader_1_12_6 (one-time path for
clusters jumping straight to 1.12.6) and a new daily upgrader
upgrader_1_12_7_20260520 (catches clusters already on 1.12.6+).
Bump the bundled HAMi image / chart version to v2.6.18 so fresh
installs ship the app-service-owned model from day one.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b261070. Configure here.
eball
approved these changes
May 21, 2026
eball
added a commit
that referenced
this pull request
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Companion to the app-service compute subsystem work (beclab/Olares#3112) and the HAMi scheduler refactor (beclab/HAMi#19). Hands GPUBinding CRD management over from HAMi to app-service on upgrade, so the binding store stops being HAMi's private state and becomes the authoritative compute-allocation surface app-service drives.
v2.6.18(the version that drops the/gpus*HTTP routes and theCleanupGPUBindingsLoop).crds/directories onhelm upgrade.spec.owner,spec.namespace). Each migrated binding is also recorded in the newos-framework/app-gpu-allocationsconfigmap so app-service inherits the existing placement without rescheduling any running workload.The upgrade flow is registered both in
upgrader_1_12_6(one-time path for clusters jumping straight to 1.12.6) and in a new daily upgraderupgrader_1_12_7_20260520(catches clusters already on 1.12.6+).Test plan
cd cli && make build && go vet ./...go test ./cli/pkg/upgrade/...spec.owner/spec.namespaceand HAMi pod runsv2.6.18v2.6.17and existing GPUBindings — run upgrade, verify (a) CRD has new fields, (b) HAMi rolls tov2.6.18, (c) running GPU workloads keep running, (d) their bindings are re-stamped with owner/namespace/labels, (e)os-framework/app-gpu-allocationsconfigmap has matching rowsstopped/uninstalledstate — verify its legacy binding is removed and no allocation row is written for itMade with Cursor
Note
Medium Risk
Touches Kubernetes CRDs and performs in-cluster migration/deletion of
GPUBindingresources and writes a new allocation ConfigMap; mistakes could prune fields or drop active allocations, but the flow is bounded and largely idempotent/skips when resources are absent.Overview
Upgrade flow now explicitly manages the
GPUBindingCRD and migrates legacy bindings for app-service. The1.12.6upgrader (and a new daily1.12.7-20260520upgrader) runs a newApplyGPUBindingCRDstep before the HAMi Helm upgrade to ensure the cluster CRD schema includesspec.namespace/spec.owner(Helm won’t upgradecrds/).After system components upgrade, the CLI now runs
MigrateLegacyGPUBindingsto walk existing legacyGPUBindingobjects, resolve the owningApplicationManager, delete orphan/stale bindings, re-stamp surviving bindings with app-service labels/owner fields, and persist placement intoos-framework/app-gpu-allocations.Separately, the bundled HAMi version is bumped to
v2.6.18, and the GPUBinding CRD manifest is updated to include the newnamespaceandownerfields.Reviewed by Cursor Bugbot for commit b261070. Bugbot is set up for automated code reviews on this repo. Configure here.