Skip to content

Spike: should the built-in compute drivers move to the external (out-of-process) model? #2061

Description

@elezar

Spike: should the built-in compute drivers move to the external (out-of-process) model?

Spike/investigation spun out of the #1952 discussion. This poses the problem and the
questions to investigate — it deliberately proposes no concrete design. The output of
the spike is a recommendation on whether and in what order to pursue the migration,
which then feeds a design RFC if warranted.

Problem

Today the built-in compute drivers run two different ways:

  • In-process via the ComputeDriver trait: Docker, Podman, Kubernetes.
  • Out-of-process over compute_driver.proto: VM (gateway-spawned) and all third-party --compute-driver-socket drivers.

Maintaining both paths has a recurring cost: because Docker is in-process, it keeps growing hooks that reach into gateway-local state with no proto equivalent, which then have to be unwound for alignment. Two are in flight right now:

This spike investigates whether migrating the built-in drivers onto the external model (the one VM and third-party drivers already use) is worthwhile, and what it would require.

Why it might be worth doing

  • Reduce the core team's development/maintenance burden: collapse the two driver integration paths into one, so cross-cutting gateway work is done once instead of twice.
  • Align with OpenShell Drivers #1051's uniform-driver requirement: third-party drivers must be out-of-process, so out-of-process is already the third-party model; moving the built-ins onto it gives one model for first and third parties.

Explicitly not a reason: deployment footprint (binary size / supply-chain). That is #1943's lane (conditional compilation), achievable at compile time without any of this.

What we already know (context, not conclusions)

  • VM is the precedent: in-tree, out-of-process, gateway-launched, and needs no gateway listeners — so "out-of-process" does not imply "untrusted" or "needs special networking."
  • refactor: make sandbox readiness gateway-owned across compute drivers #1951 and refactor: align driver ownership of gateway callback listeners #1952 are prerequisites, not just siblings: a driver cannot be cleanly externalized while it depends on in-process access to gateway state. Both remove exactly such couplings (readiness, listeners).
  • Constraints already surfaced: the driver transport is UDS-only today (no networked transport for a different-host driver); Docker provisioning uses host bind mounts (same-host); Kubernetes is the existing cross-host story.

Questions to investigate

Out of scope for this spike

Concrete solution design (how callback reachability is implemented, and any specific mechanism) is deliberately not proposed here. That belongs to a design RFC after this spike concludes the migration is worth pursuing. (An earlier, more detailed exploration — including candidate mechanisms — is preserved separately in docker-external-driver-design-exploration.md for reference; it is not part of this spike.)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:gatewayGateway server and control-plane workspike

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions