Problem Statement
Disposable per-user sandboxes backed by durable PVCs (as proposed in open PR
#2034) give great continuity, but for lifecycle management Kubernetes OpenShell
exposes create/delete but no suspend/resume. There's no way to free a sandbox's**
**compute when idle while retaining its identity and PVC. For deployments with many
provisioned-but-often-idle users, keeping every pod running 24/7 is a large,
mostly-wasted compute cost; deleting on idle instead loses the sandbox identity
(and, without #2034, the data). We want: idle → free compute; next login → resume
with state intact.
This is the Kubernetes-driver realization of the general capability proposed in
#1823 (checkpoint/pause/resume), scoped concretely to what agent-sandbox already
implements.
Proposed Design
Surface agent-sandbox's existing suspend/resume capability through the OpenShell
Kubernetes driver and gateway/CLI — the lifecycle analog of how #2034 surfaced
pod-template/volume config through driver_config.kubernetes.
agent-sandbox already implements this in its Sandbox CRD and controller:
- v1beta1:
spec.operatingMode: Running | Suspended (default Running).
- v1alpha1: the equivalent is
spec.replicas (0 = suspended; the API
conversion maps Suspended ↔ replicas=0).
- On Suspended, the controller deletes the backing Pod (frees CPU/memory)
while leaving the Sandbox object and its PVCs in place — PVCs are reconciled
independently and removed only when the Sandbox itself is deleted. Status surfaces
a Suspended condition (PodTerminated / PodNotTerminated).
- On Running, the controller recreates the Pod and reattaches the same PVC(s).
What OpenShell would add:
- Driver: set
operatingMode (v1beta1) / replicas=0 (v1alpha1) on the managed
Sandbox CR to suspend, and flip back to resume.
- Gateway: keep the sandbox registered across a suspend (don't treat the absent
Pod as a dead sandbox) and re-route on resume when the Pod returns.
- Interface: a lifecycle op (e.g.
openshell sandbox suspend|resume) and/or an
idle policy; resume triggered by the controlling app on session start.
- Existing seam: OpenShell already defines a
StopSandbox RPC in the
compute-driver contract (proto/compute_driver.proto), but it is currently
unimplemented for the Kubernetes driver
(crates/openshell-driver-kubernetes/src/grpc.rs) — a natural hook for wiring
suspend, with resume as its counterpart.
- Pairs with open PR #2034: that PR proposes the durable, caller-owned per-user PVC;
this gives the lifecycle to free its compute while keeping the data.
Alternatives Considered
- Always-on pods (status quo): simplest, but pays compute for every provisioned
user, not just active ones — expensive at scale.
- Delete + recreate on idle/login: frees compute, but churns the sandbox identity
and pays a full cold-create each login; with #2034 the data survives, but
registration/orphan handling is messier than a first-class suspend.
- In-place pod restart only: doesn't free compute.
- →
operatingMode suspend/resume is preferable: it's a first-class primitive
already modeled and implemented in agent-sandbox; OpenShell only needs to expose
and drive it.
Checklist
Related: #1823 (general checkpoint/pause/resume design), #1551 (VM-driver
suspend/resume). Builds on open PR #2034 by @mjamiv.
Problem Statement
Disposable per-user sandboxes backed by durable PVCs (as proposed in open PR
#2034) give great continuity, but for lifecycle management Kubernetes OpenShell
exposes create/delete but no suspend/resume. There's no way to free a sandbox's**
**compute when idle while retaining its identity and PVC. For deployments with many
provisioned-but-often-idle users, keeping every pod running 24/7 is a large,
mostly-wasted compute cost; deleting on idle instead loses the sandbox identity
(and, without #2034, the data). We want: idle → free compute; next login → resume
with state intact.
This is the Kubernetes-driver realization of the general capability proposed in
#1823 (checkpoint/pause/resume), scoped concretely to what agent-sandbox already
implements.
Proposed Design
Surface agent-sandbox's existing suspend/resume capability through the OpenShell
Kubernetes driver and gateway/CLI — the lifecycle analog of how #2034 surfaced
pod-template/volume config through
driver_config.kubernetes.agent-sandbox already implements this in its
SandboxCRD and controller:spec.operatingMode: Running | Suspended(defaultRunning).spec.replicas(0= suspended; the APIconversion maps
Suspended ↔ replicas=0).while leaving the
Sandboxobject and its PVCs in place — PVCs are reconciledindependently and removed only when the Sandbox itself is deleted. Status surfaces
a
Suspendedcondition (PodTerminated/PodNotTerminated).What OpenShell would add:
operatingMode(v1beta1) /replicas=0(v1alpha1) on the managedSandbox CR to suspend, and flip back to resume.
Pod as a dead sandbox) and re-route on resume when the Pod returns.
openshell sandbox suspend|resume) and/or anidle policy; resume triggered by the controlling app on session start.
StopSandboxRPC in thecompute-driver contract (
proto/compute_driver.proto), but it is currentlyunimplemented for the Kubernetes driver
(
crates/openshell-driver-kubernetes/src/grpc.rs) — a natural hook for wiringsuspend, with resume as its counterpart.
this gives the lifecycle to free its compute while keeping the data.
Alternatives Considered
user, not just active ones — expensive at scale.
and pays a full cold-create each login; with #2034 the data survives, but
registration/orphan handling is messier than a first-class suspend.
operatingModesuspend/resume is preferable: it's a first-class primitivealready modeled and implemented in agent-sandbox; OpenShell only needs to expose
and drive it.
Checklist
Related: #1823 (general checkpoint/pause/resume design), #1551 (VM-driver
suspend/resume). Builds on open PR #2034 by @mjamiv.