Skip to content

dist: skip stale cert when rebuilding scheduler HTTP client#2707

Open
timn-nexthop wants to merge 1 commit into
mozilla:mainfrom
nexthop-ai:upstream/cert-fix
Open

dist: skip stale cert when rebuilding scheduler HTTP client#2707
timn-nexthop wants to merge 1 commit into
mozilla:mainfrom
nexthop-ai:upstream/cert-fix

Conversation

@timn-nexthop
Copy link
Copy Markdown

When a build server registers with a new certificate, the scheduler rebuilds its outbound reqwest client with the new cert plus every other cert it knows about. The loop over certs.values() ran before certs.insert(...) overwrote the map entry, so it still contained the stale cert for the same server_id — meaning both the old and new self-signed certs for that server were installed as trust anchors in the rebuilt client.

Each build server's cert is self-signed, so old and new share a Subject DN. TLS validators index trust anchors by Subject; with two anchors having the same name, path building can pick the stale one, fail signature verification against its public key, and reject the handshake. The result is that cert rotation on a build server deterministically breaks the scheduler's ability to talk to it until the scheduler restarts.

Skip the entry matching server_id when iterating existing certs so only the up-to-date cert for that server ends up as a trust anchor.

We've been running with this change since mid December and it's fixed this problem for us.

When a build server registers with a new certificate, the scheduler
rebuilds its outbound reqwest client with the new cert plus every
other cert it knows about. The loop over `certs.values()` ran before
`certs.insert(...)` overwrote the map entry, so it still contained
the stale cert for the same `server_id` — meaning both the old and
new self-signed certs for that server were installed as trust anchors
in the rebuilt client.

Each build server's cert is self-signed, so old and new share a
Subject DN. TLS validators index trust anchors by Subject; with two
anchors having the same name, path building can pick the stale one,
fail signature verification against its public key, and reject the
handshake. The result is that cert rotation on a build server
deterministically breaks the scheduler's ability to talk to it until
the scheduler restarts.

Skip the entry matching `server_id` when iterating existing certs so
only the up-to-date cert for that server ends up as a trust anchor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant