Skip to content

byoc: per-job balance keying — design discussion#3914

Draft
rickstaa wants to merge 1 commit into
byoc-offchain-no-ethfrom
byoc-per-job-balance
Draft

byoc: per-job balance keying — design discussion#3914
rickstaa wants to merge 1 commit into
byoc-offchain-no-ethfrom
byoc-per-job-balance

Conversation

@rickstaa

Copy link
Copy Markdown
Member

Context

While reviewing #3869, the BYOC payment accounting model surfaced a structural concern: balances are keyed by (sender_address, capability_id) rather than per-job/per-stream. Effects:

  • Multiple concurrent jobs from the same sender for the same capability share one balance pool.
  • The orchestrator can't manage admission for individual jobs based on payment state — only aggregate sender balance for that capability.
  • Remote signers (clearinghouse signing on behalf of multiple downstream customers) collapse all customer balances into one bucket, so per-customer billing has to be reconstructed out-of-band.

LiveAI explicitly works around this — server/ai_process.go calls clearSessionBalance(sess, RandomManifestID()) per stream to force per-session isolation in the existing AddressBalances map. BYOC inherited the per-capability pattern without that band-aid.

What this PR does

Switches the ManifestID used for balance keying from capability to JobRequest.ID across the BYOC orch + gateway:

Orchestrator

  • processJob, monitorOrchStream, ProcessStreamPayment, confirmPayment, processPayment, chargeForCompute, getPaymentBalance all key Balance/DebitFees/ProcessPayment on jobReq.ID.
  • confirmPayment/processPayment take both jobID (balance) and capability (capacity-slot management) since the two concerns are now distinct.
  • GetJobToken accepts an optional Livepeer-Job-Id header. Initial token requests (no job context) return balance=0; refresh calls for an existing job/stream pass the ID and get the per-job balance.

Gateway

  • createPayment, updateGatewayBalance, getToken thread jobID.
  • sendPaymentForStream sets req.ID = streamID so the orch matches per-stream payments.

Capacity-management call sites (FreeExternalCapabilityCapacity, RemoveExternalCapability, worker routing) still use Capability — those are scheduling, not money.

Stacked on #3906 to keep the diff scoped to the refactor.

The genuine question for the author

@eliteprox — before going further, I'd like to understand the reasoning behind sender+capability keying. Reasons it might have been intentional:

  1. PM session alignment. One pm.Sender.StartSession() produces tickets for many jobs. Capability-keyed balance lets one PM session feed one balance bucket cleanly. Per-job keying creates a many-to-one routing question on ticket payout that this PR doesn't yet solve.
  2. Prepaid-credit mental model. "User buys $X of compute for capability Y, spends across many requests" maps cleanly to capability balance.
  3. First-token discovery info. The capability balance returned in the initial token response was a useful signal for the gateway. Per-job balance is always 0 there.
  4. Bounded key cardinality. (senders × capabilities) is bounded. (senders × job_ids) grows unbounded — every completed job leaves a dust balance that needs GC.
  5. Residual carry-over. User does job A, gateway crashes mid-flow, user submits job B with same capability — capability-keyed model preserves leftover credit. Per-job keying strands it.

If any of these were primary drivers, this refactor needs to address them rather than ignore them. If the original choice was inherited from the pre-existing AI-jobs pattern without explicit consideration of streams or remote signers, the trade-offs above are mostly recoverable.

Out of scope (deferred)

  • Per-job PM session reset. LiveAI calls clearSessionBalance per stream to avoid ticket nonce reuse. BYOC still calls Sender.StartSession() once per createPayment. If we go per-job balance, per-job PM sessions likely need to follow.
  • Test suite update. TestGetJobToken_Success now correctly returns 0 (no jobID provided) and the test asserts the legacy 1000. Other tests in payment_test.go / job_orchestrator_test.go / stream_test.go still assume capability-keyed balance. Left so the diff is reviewable as a design question first.
  • Backward compat. No wire-format change, but old gateways/orchs talking to new ones will create capability-keyed balances the new code never reads. Coordinate-deploy or fall-through-during-deprecation needs deciding.
  • Dust-balance GC. Per-job keying leaves orphan balances after job completion.

Status

Draft. Builds clean (go build ./...). One test fails intentionally to document the behavior change. Not ready to merge — raising for design discussion before investing in the test pass and PM session refactor.

BYOC payment accounting was previously keyed by (sender_address,
capability_id), aggregating all jobs from a sender into a single
balance pool per capability. This makes per-job admission control
impossible and forces remote signers to do their own per-customer
accounting on top.

Switch the ManifestID used for balance keying from capability to
JobRequest.ID across the BYOC orchestrator and gateway. Each job (and
each stream) now gets its own isolated balance bucket, mirroring how
live-video-to-video uses RandomManifestID per stream
(server/ai_process.go clearSessionBalance).

Orchestrator:
- processJob, monitorOrchStream, ProcessStreamPayment, confirmPayment,
  processPayment, chargeForCompute, getPaymentBalance now key on
  jobReq.ID for Balance/DebitFees/ProcessPayment calls.
- confirmPayment/processPayment take both jobID (balance) and capability
  (capacity-slot management), which are now distinct concerns.
- GetJobToken accepts an optional Livepeer-Job-Id header. When present
  it returns the per-job balance for that ID; otherwise 0 (initial
  token requests have no job context yet).

Gateway:
- createPayment and updateGatewayBalance key local balance on jobReq.ID.
- getToken takes an optional jobID and forwards it as Livepeer-Job-Id
  so refresh/payment-renewal calls fetch the right per-job balance.
- sendPaymentForStream sets req.ID = streamID so the orch matches the
  payment to the correct per-stream balance.
- Capacity-management call sites still use Capability.

Behavior change: TestGetJobToken_Success now fails because it asserts
the legacy capability-balance return value. The test reflects the
intended new contract (no balance returned without job context) and
needs to be updated to pass Livepeer-Job-Id and assert per-job
balance — left as a follow-up for the test pass.

Out of scope (separate changes):
- Per-job PM session reset (LiveAI calls clearSessionBalance per
  stream; BYOC still uses one Sender.StartSession per payment).
- Backward-compat strategy for old gateways/orchestrators.
- Test suite update across payment_test, job_orchestrator_test,
  stream_test, job_gateway_test.
- Dust-balance GC for completed jobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code

Projects

No open projects
Status: Triage

Development

Successfully merging this pull request may close these issues.

1 participant