Skip to content

chore(cli): hand GPUBinding management over to app-service on upgrade#3119

Merged
eball merged 1 commit into
mainfrom
cli/chore/app_gpu_migration
May 21, 2026
Merged

chore(cli): hand GPUBinding management over to app-service on upgrade#3119
eball merged 1 commit into
mainfrom
cli/chore/app_gpu_migration

Conversation

@dkeven

@dkeven dkeven commented May 21, 2026

Copy link
Copy Markdown
Member

Summary

Companion to the app-service compute subsystem work (beclab/Olares#3112) and the HAMi scheduler refactor (beclab/HAMi#19). Hands GPUBinding CRD management over from HAMi to app-service on upgrade, so the binding store stops being HAMi's private state and becomes the authoritative compute-allocation surface app-service drives.

  • Bump bundled HAMi image / chart to v2.6.18 (the version that drops the /gpus* HTTP routes and the CleanupGPUBindingsLoop).
  • Apply the new GPUBinding CRD schema explicitly during upgrade, because Helm 3 does not refresh chart crds/ directories on helm upgrade.
  • Walk every legacy GPUBinding produced by old HAMi, resolve it back to its owning ApplicationManager via owner / namespace / pod-selector heuristics, drop the ones whose app is no longer holding an allocation, and re-stamp the survivors with the metadata app-service needs (managed-by labels, app/owner labels, mode label, spec.owner, spec.namespace). Each migrated binding is also recorded in the new os-framework/app-gpu-allocations configmap so app-service inherits the existing placement without rescheduling any running workload.

The upgrade flow is registered both in upgrader_1_12_6 (one-time path for clusters jumping straight to 1.12.6) and in a new daily upgrader upgrader_1_12_7_20260520 (catches clusters already on 1.12.6+).

Test plan

  • cd cli && make build && go vet ./...
  • go test ./cli/pkg/upgrade/...
  • Manual: fresh single-node GPU install, verify the new GPUBinding CRD ships with spec.owner / spec.namespace and HAMi pod runs v2.6.18
  • Manual: cluster on pre-1.12.6 with HAMi v2.6.17 and existing GPUBindings — run upgrade, verify (a) CRD has new fields, (b) HAMi rolls to v2.6.18, (c) running GPU workloads keep running, (d) their bindings are re-stamped with owner/namespace/labels, (e) os-framework/app-gpu-allocations configmap has matching rows
  • Manual: same cluster but with one GPU app in stopped / uninstalled state — verify its legacy binding is removed and no allocation row is written for it
  • Manual: cluster already on 1.12.6+ — verify the daily upgrader picks it up and the migration is idempotent on a re-run

Made with Cursor


Note

Medium Risk
Touches Kubernetes CRDs and performs in-cluster migration/deletion of GPUBinding resources and writes a new allocation ConfigMap; mistakes could prune fields or drop active allocations, but the flow is bounded and largely idempotent/skips when resources are absent.

Overview
Upgrade flow now explicitly manages the GPUBinding CRD and migrates legacy bindings for app-service. The 1.12.6 upgrader (and a new daily 1.12.7-20260520 upgrader) runs a new ApplyGPUBindingCRD step before the HAMi Helm upgrade to ensure the cluster CRD schema includes spec.namespace/spec.owner (Helm won’t upgrade crds/).

After system components upgrade, the CLI now runs MigrateLegacyGPUBindings to walk existing legacy GPUBinding objects, resolve the owning ApplicationManager, delete orphan/stale bindings, re-stamp surviving bindings with app-service labels/owner fields, and persist placement into os-framework/app-gpu-allocations.

Separately, the bundled HAMi version is bumped to v2.6.18, and the GPUBinding CRD manifest is updated to include the new namespace and owner fields.

Reviewed by Cursor Bugbot for commit b261070. Bugbot is set up for automated code reviews on this repo. Configure here.

Until now the gpu.bytetrade.io/GPUBinding CRD has been owned end-to-end
by HAMi: its scheduler created bindings via the /gpus* HTTP surface,
ran CleanupGPUBindingsLoop to garbage-collect orphans, and treated the
binding store as its private state. The new app-service compute
subsystem takes over that role — it picks the device, writes the
binding, and persists the canonical allocation in the configmap
os-framework/app-gpu-allocations — while HAMi v2.6.18 drops the
app-level routes and the cleanup loop and keeps only the scheduler
filter that consumes the bindings.

For fresh clusters this is just a chart upgrade, but every existing
cluster has a pile of GPUBindings produced by the old HAMi flow that
app-service does not recognize: no allocation-configmap row, no
managed-by labels, and (because the old CRD didn't carry them) no way
to tell apart bindings from two users who installed the same app.
Without a migration the upgrade would leave those workloads running
but invisible to app-service, and the next lifecycle op would either
no-op them or rewrite them from scratch.

Add a one-shot upgrade flow that, in order:

  1. Applies the bundled GPUBinding CRD before the HAMi helm upgrade
     so the apiserver knows about the new spec fields up-front. Helm
     3 does not refresh chart `crds/` on upgrade, so the schema bump
     would otherwise be silently dropped on existing clusters.
  2. Lets upgraderBase run the regular HAMi helm upgrade to v2.6.18,
     which is the version that retires the legacy /gpus* routes and
     the CleanupGPUBindingsLoop in favor of app-service ownership.
  3. Walks every legacy GPUBinding, resolves it back to its owning
     ApplicationManager via owner / namespace / pod-selector
     heuristics, drops the ones whose app is no longer in a state
     that holds an allocation, and re-stamps the survivors with the
     metadata app-service requires (managed-by labels, app/owner
     labels, mode label, and the new spec.owner / spec.namespace).
     Each migrated binding is also recorded as an allocation row in
     os-framework/app-gpu-allocations so app-service inherits the
     existing placement without rescheduling.

The flow is registered in both upgrader_1_12_6 (one-time path for
clusters jumping straight to 1.12.6) and a new daily upgrader
upgrader_1_12_7_20260520 (catches clusters already on 1.12.6+).

Bump the bundled HAMi image / chart version to v2.6.18 so fresh
installs ship the app-service-owned model from day one.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vercel

vercel Bot commented May 21, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
olares Ready Ready Preview, Comment May 21, 2026 6:16am
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
olares-docs Ignored Ignored May 21, 2026 6:16am

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b261070. Configure here.

Comment thread cli/pkg/upgrade/migrate_gpu_bindings.go
@eball eball merged commit 4ea6691 into main May 21, 2026
25 of 26 checks passed
eball added a commit that referenced this pull request Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants