byoc: per-job balance keying — design discussion#3914
Draft
rickstaa wants to merge 1 commit into
Draft
Conversation
BYOC payment accounting was previously keyed by (sender_address, capability_id), aggregating all jobs from a sender into a single balance pool per capability. This makes per-job admission control impossible and forces remote signers to do their own per-customer accounting on top. Switch the ManifestID used for balance keying from capability to JobRequest.ID across the BYOC orchestrator and gateway. Each job (and each stream) now gets its own isolated balance bucket, mirroring how live-video-to-video uses RandomManifestID per stream (server/ai_process.go clearSessionBalance). Orchestrator: - processJob, monitorOrchStream, ProcessStreamPayment, confirmPayment, processPayment, chargeForCompute, getPaymentBalance now key on jobReq.ID for Balance/DebitFees/ProcessPayment calls. - confirmPayment/processPayment take both jobID (balance) and capability (capacity-slot management), which are now distinct concerns. - GetJobToken accepts an optional Livepeer-Job-Id header. When present it returns the per-job balance for that ID; otherwise 0 (initial token requests have no job context yet). Gateway: - createPayment and updateGatewayBalance key local balance on jobReq.ID. - getToken takes an optional jobID and forwards it as Livepeer-Job-Id so refresh/payment-renewal calls fetch the right per-job balance. - sendPaymentForStream sets req.ID = streamID so the orch matches the payment to the correct per-stream balance. - Capacity-management call sites still use Capability. Behavior change: TestGetJobToken_Success now fails because it asserts the legacy capability-balance return value. The test reflects the intended new contract (no balance returned without job context) and needs to be updated to pass Livepeer-Job-Id and assert per-job balance — left as a follow-up for the test pass. Out of scope (separate changes): - Per-job PM session reset (LiveAI calls clearSessionBalance per stream; BYOC still uses one Sender.StartSession per payment). - Backward-compat strategy for old gateways/orchestrators. - Test suite update across payment_test, job_orchestrator_test, stream_test, job_gateway_test. - Dust-balance GC for completed jobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
While reviewing #3869, the BYOC payment accounting model surfaced a structural concern: balances are keyed by
(sender_address, capability_id)rather than per-job/per-stream. Effects:LiveAI explicitly works around this —
server/ai_process.gocallsclearSessionBalance(sess, RandomManifestID())per stream to force per-session isolation in the existingAddressBalancesmap. BYOC inherited the per-capability pattern without that band-aid.What this PR does
Switches the
ManifestIDused for balance keying fromcapabilitytoJobRequest.IDacross the BYOC orch + gateway:Orchestrator
processJob,monitorOrchStream,ProcessStreamPayment,confirmPayment,processPayment,chargeForCompute,getPaymentBalanceall key Balance/DebitFees/ProcessPayment onjobReq.ID.confirmPayment/processPaymenttake bothjobID(balance) andcapability(capacity-slot management) since the two concerns are now distinct.GetJobTokenaccepts an optionalLivepeer-Job-Idheader. Initial token requests (no job context) return balance=0; refresh calls for an existing job/stream pass the ID and get the per-job balance.Gateway
createPayment,updateGatewayBalance,getTokenthreadjobID.sendPaymentForStreamsetsreq.ID = streamIDso the orch matches per-stream payments.Capacity-management call sites (
FreeExternalCapabilityCapacity,RemoveExternalCapability, worker routing) still useCapability— those are scheduling, not money.Stacked on #3906 to keep the diff scoped to the refactor.
The genuine question for the author
@eliteprox — before going further, I'd like to understand the reasoning behind sender+capability keying. Reasons it might have been intentional:
pm.Sender.StartSession()produces tickets for many jobs. Capability-keyed balance lets one PM session feed one balance bucket cleanly. Per-job keying creates a many-to-one routing question on ticket payout that this PR doesn't yet solve.(senders × capabilities)is bounded.(senders × job_ids)grows unbounded — every completed job leaves a dust balance that needs GC.If any of these were primary drivers, this refactor needs to address them rather than ignore them. If the original choice was inherited from the pre-existing AI-jobs pattern without explicit consideration of streams or remote signers, the trade-offs above are mostly recoverable.
Out of scope (deferred)
clearSessionBalanceper stream to avoid ticket nonce reuse. BYOC still callsSender.StartSession()once percreatePayment. If we go per-job balance, per-job PM sessions likely need to follow.TestGetJobToken_Successnow correctly returns 0 (no jobID provided) and the test asserts the legacy 1000. Other tests inpayment_test.go/job_orchestrator_test.go/stream_test.gostill assume capability-keyed balance. Left so the diff is reviewable as a design question first.Status
Draft. Builds clean (
go build ./...). One test fails intentionally to document the behavior change. Not ready to merge — raising for design discussion before investing in the test pass and PM session refactor.