Implement a fallback storage configuration by vtsao-openai · Pull Request #335 · buildbarn/bb-storage

vtsao-openai · 2026-05-03T20:13:26Z

This configuration allows you to specify a primary and secondary storage backend, where if the primary goes down, reads & writes will go to the secondary.

Writes to the primary are async, best-effort replicated to the secondary. Writes to the secondary are not replicated back to the primary. This backend provides more availability at the cost of consistency. Although most of the time we don't expect the secondary to be actually used.

The main use case of this backend, at least for us, is to be able to update our backend storage deployment without down time. We currently use a sharded backend storage on k8s via a stateful set. So any deployments currently cause down time as k8s has to take down a pod (storage shard) when updating it. This also helps in general if k8s decides to restart a pod for any reason.

Regenerate proto bindings. Co-authored-by: Codex <noreply@openai.com>

EdSchouten · 2026-05-04T15:42:29Z

The main use case of this backend, at least for us, is to be able to update our backend storage deployment without down time. We currently use a sharded backend storage on k8s via a stateful set. So any deployments currently cause down time as k8s has to take down a pod (storage shard) when updating it. This also helps in general if k8s decides to restart a pod for any reason.

Out of curiosity, why are you doing this? Why not run bb-storage as a simple deployment? That way Kubernetes is capable of spinning up replacements before shutting down the old pod.

vtsao-openai · 2026-05-04T18:30:56Z

The main use case of this backend, at least for us, is to be able to update our backend storage deployment without down time. We currently use a sharded backend storage on k8s via a stateful set. So any deployments currently cause down time as k8s has to take down a pod (storage shard) when updating it. This also helps in general if k8s decides to restart a pod for any reason.

Out of curiosity, why are you doing this? Why not run bb-storage as a simple deployment? That way Kubernetes is capable of spinning up replacements before shutting down the old pod.

Hey @EdSchouten it's because we're using PVs for our disk because ephemeral storage isn't enough for us. So I think we'd run into the same issue whether we're using deployments or stateful sets - even with deployments the PVC can only be mounted to a single pod so I think we'd run into the same issue - at some point we have to switch the PVC mount over to the new pod (which will result in downtime). And the disk types we require (for latency reasons) are all ReadWriteOnce, so only one pod can be mounted to them.

So I'm happy to be wrong, but I'm not sure this can be solved purely with k8s. I think we need this kind of fallback mechanism natively in Buildbarn.

artyrian · 2026-05-06T16:58:16Z

We have a similar Buildbarn setup with multiple shards (and also tried additional replicas per shard), it doesn’t provide true HA in k8s terms.
I also reviewed the ADR (https://github.com/buildbarn/bb-adrs/blob/main/0002-storage.md#adding-fault-tolerance), but didn't find a straightforward way to achieve fast shard failover with PVCs without downtime.

vtsao-openai · 2026-05-06T17:17:30Z

Yeah the mirrored backend does not actually provide HA, I think it's even documented that it does not in the proto comments.

@EdSchouten another benefit of this fallback approach is not just for deployments, but in case the storage shard actually goes down for whatever reason, builds will not fail. Yes it's at the cost of consistency (which in our case is probably fine - it should be no different than if the digests just didn't exist in the cache in the first place).

Also this allows people not using k8s to be able to achieve some more availability if they want to use this backend.

moroten · 2026-05-06T18:23:21Z

EdSchouten another benefit of this fallback approach is not just for deployments, but in case the storage shard actually goes down for whatever reason, builds will not fail. Yes it's at the cost of consistency (which in our case is probably fine - it should be no different than if the digests just didn't exist in the cache in the first place).

Consider the case where Bazel asks the remote cluster to execute an action. The output is stored in mirror A because B is down. 5 minutes later, Bazel wants to use the output but A is down and B doesn't have it. The difference from a blob missing from the start is that Bazel also knows it is missing. In this case, Bazel should be able to assume that the blob still exists, it did 5 minutes ago.

artyrian · 2026-05-07T20:18:55Z

The output is stored in mirror A because B is down. 5 minutes later, Bazel wants to use the output but A is down and B doesn't have it. The difference from a blob missing from the start is that Bazel also knows it is missing.

Isn't this inconsistency equivalent to a normal cache eviction? If a blob gets evicted between the time Bazel stores it and tries to reuse it, the action cache would similarly point to a missing CAS entry. Bazel handles by reexecuting, so why is the fallback case worse? Or is the concern about bazel clients that don't gracefully handle (AC hit + CAS miss) and fail hard?

vtsao-openai force-pushed the dev/vtsao/storage-fallback-backend branch from bd91eb3 to 2618d5f Compare May 3, 2026 23:47

Add fallback blob access backend

8849476

Regenerate proto bindings. Co-authored-by: Codex <noreply@openai.com>

vtsao-openai force-pushed the dev/vtsao/storage-fallback-backend branch from 2618d5f to 8849476 Compare May 3, 2026 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a fallback storage configuration#335

Implement a fallback storage configuration#335
vtsao-openai wants to merge 1 commit into
buildbarn:mainfrom
vtsao-openai:dev/vtsao/storage-fallback-backend

vtsao-openai commented May 3, 2026

Uh oh!

EdSchouten commented May 4, 2026

Uh oh!

vtsao-openai commented May 4, 2026

Uh oh!

artyrian commented May 6, 2026

Uh oh!

vtsao-openai commented May 6, 2026

Uh oh!

moroten commented May 6, 2026

Uh oh!

artyrian commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vtsao-openai commented May 3, 2026

Uh oh!

EdSchouten commented May 4, 2026

Uh oh!

vtsao-openai commented May 4, 2026

Uh oh!

artyrian commented May 6, 2026

Uh oh!

vtsao-openai commented May 6, 2026

Uh oh!

moroten commented May 6, 2026

Uh oh!

artyrian commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants