From 493f2bfdcacbcb934858b519556073c8b8ef9951 Mon Sep 17 00:00:00 2001 From: Greg Shear Date: Wed, 15 Apr 2026 17:47:28 -0400 Subject: [PATCH] RFCs --- plans/api-deprecation.md | 71 +++++ plans/orthogonal-authz.md | 59 ++++ plans/service-accounts.md | 556 ++++++++++++++++++++++++++++++++++++++ plans/support-role.md | 293 ++++++++++++++++++++ plans/user-management.md | 201 ++++++++++++++ 5 files changed, 1180 insertions(+) create mode 100644 plans/api-deprecation.md create mode 100644 plans/orthogonal-authz.md create mode 100644 plans/service-accounts.md create mode 100644 plans/support-role.md create mode 100644 plans/user-management.md diff --git a/plans/api-deprecation.md b/plans/api-deprecation.md new file mode 100644 index 00000000000..05bf92e7d1a --- /dev/null +++ b/plans/api-deprecation.md @@ -0,0 +1,71 @@ +# API Deprecation Lifecycle + +## Executive Summary + +Estuary maintains an evolving product API, but today we have no mechanism to retire an endpoint once it's in use. The immediate motivator is the user-management migration from PostgREST to GraphQL — flowctl users are still hitting the old PostgREST endpoints, and we have no systematic way to detect that or steer them to the replacement. + +Supabase logs show who's calling what, but only with seven days of retention — not long enough to track adoption of a replacement endpoint. Communicating deprecation to customers is either a mass email or relies on institutional knowledge of which customers happen to be using which APIs. + +This plan establishes a general-purpose deprecation lifecycle for the control-plane API. The challenge: while we control the dashboard UI and can migrate it to new endpoints on our own schedule, flowctl is installed on customer machines and older versions will continue to call deprecated endpoints indefinitely unless we give ourselves a way to see them and reach their operators. + +- Engineering gets visibility into which tenants (and which flowctl versions) are calling a given endpoint - once request volume drops below some acceptable threshold, we can remove the endpoint. +- Deprecated endpoints announce themselves via standard `Deprecation`/`Sunset` response headers. +- flowctl surfaces those headers noisily, printing a stderr warning on every response from a deprecated endpoint so the signal reaches the operator running the command or reading CI logs. +- Affected customers get targeted outreach — automated, periodic email alerts with increasing frequency as the sunset date approaches — specific tenants still calling a deprecated endpoint hear about it directly. + +At current scale, flowctl adoption is small enough that watching call volume in Loki and reaching out to affected customers is the primary enforcement mechanism. We'll hold off on Sunset headers and actual endpoint removal until the warning (P1) and alerting (P3) machinery is live and broadly adopted. Until then, deprecation headers plus human support follow-up is sufficient. + +## Technical Notes + +### Signaling deprecation to API consumers + +Both PostgREST and GraphQL endpoints return standard `Deprecation` and `Sunset` headers. GraphQL additionally marks deprecated operations and fields in the schema itself, so schema-aware clients get the signal through introspection as well. Successor information (e.g. "use the `listConnectors` GraphQL operation instead") is stored in the deprecation table and surfaced in flowctl warnings and alert emails rather than via a `Link` header — GraphQL operations don't have their own URLs, so a link isn't meaningful. + +### PostgREST deprecation headers set via pre-request function + +PostgREST supports a `db-pre-request` configuration — a Postgres function that runs before every request and can set response headers via `set_config('response.headers', ...)`. We use this to inject deprecation headers. + +The deprecation metadata lives in a `deprecated_endpoints` table — endpoint path, deprecation date, optional sunset date, and a human-readable successor description (e.g. "use the `listConnectors` GraphQL operation"). The pre-request function looks up `current_setting('request.path')` against this table and sets `Deprecation` and (if present) `Sunset` headers. The same table serves as the source of truth for alert emails to communicate successor info to users. + +## Open Questions + +- **Pre-request table lookup performance.** The `deprecated_endpoints` table is the single source of truth for deprecation metadata — used by the pre-request header injection, alert emails, and potentially a GraphQL query for flowctl to enrich deprecation warnings. But the pre-request function runs on every PostgREST request, so we need to verify the per-request cost of the table lookup is negligible (the table will be tiny and should stay in the buffer cache, but we should confirm this). + +## Phases + +### P1: flowctl deprecation warnings + +flowctl learns to inspect responses from the control-plane API for `Deprecation` and `Sunset` headers and prints a human-readable warning on stderr, once per invocation, including the sunset date and successor information when present. We aren't setting either of these headers yet. + +The warning message distinguishes between two contexts. When the deprecated call originates from a built-in flowctl subcommand, the warning tells the user to update flowctl — the newer version already uses the successor endpoint. When it originates from a user-defined raw API call, the warning names the deprecated endpoint and its sunset date if known. Successor information (which endpoint or operation to use instead) becomes available once the deprecation table exists in P2 — flowctl can query it to enrich the warning. + +This phase also fixes a bug: flowctl already constructs a `flowctl-` User-Agent and applies it to its agent-API HTTP client, but the PostgREST client never receives the header. As a result, every PostgREST call from flowctl currently arrives at the server with an empty UA. + +### P2: PostgREST deprecation signaling + +Build the `deprecated_endpoints` table and the PostgREST pre-request function that injects `Deprecation` (and eventually `Sunset`) headers based on it. Then use it to deprecate our first endpoints — likely `user_grants` and `role_grants` once the GraphQL operations that replace them ship as part of the user-management migration. An endpoint must not be marked deprecated until flowctl's own subcommands have migrated to the successor — otherwise the "update flowctl" advice in the deprecation warning would be wrong. After this phase we can actually begin deprecating PostgREST endpoints: customers running an updated flowctl see warnings (from P1), engineering uses Loki to see who's still calling a given endpoint, and we do manual customer outreach based on that visibility. + +This LogQL query shows who's calling specific endpoints, filtering out dashboard and Supabase JS traffic to isolate programmatic callers. Once the P1 UA fix has propagated, we can filter on `user_agent` directly instead of excluding known non-flowctl callers by referer and client info. + +```logql +{service="edge_logs"} + | metadata_request_path =~ "/rest/v1/(user_grants|role_grants).*" + | metadata_request_method != "OPTIONS" + | metadata_request_headers_x_client_info !~ "supabase-js-web/.*" + | metadata_request_headers_referer !~ "https://dashboard\\.estuary\\.dev.*" + | line_format "{{.metadata_request_method}} {{.metadata_request_path}} {{.metadata_response_status_code}} sub={{.metadata_request_sb_jwt_authorization_payload_subject}} ua={{.metadata_request_headers_user_agent}}" +``` + +### P3: Automated customer email alerts + +_Speculative — details will firm up once P2 is in use ... and we have enough customers using flowctl to justify._ A new alert type on the existing alerting infrastructure sends periodic email alerts to tenants still calling deprecated endpoints. Alerts only fire once a sunset date is set — no sunset, no emails. As the sunset date approaches, alert frequency increases: roughly weekly at first, then every few days, then daily as the deadline nears. + +## Phase Dependencies + +```mermaid +graph TD + P1[flowctl deprecation warnings, fix missing UA header] + P2[Send PostgREST deprecation headers] + P3[Automated customer email alerts] + P1 --> P2 --> P3 +``` diff --git a/plans/orthogonal-authz.md b/plans/orthogonal-authz.md new file mode 100644 index 00000000000..c06f13affad --- /dev/null +++ b/plans/orthogonal-authz.md @@ -0,0 +1,59 @@ +# Orthogonal Authz + +## Executive Summary + +Estuary's access control today is a tiered role model with only two tiers in practice: `read` for looking at data, and `admin` for everything else. That makes `admin` badly overloaded: platform engineers receive billing email alerts meant for finance, and the finance team has access to take down a production system. + +This plan refactors the role hierarchy into fine-grained, independent capabilities — most immediately to support dedicated **billing** and **user management** capabilities, so customers can delegate those responsibilities without handing out platform admin. + +## Technical Notes + +- **Capabilities are a flat set, not a hierarchy.** The five capabilities — `read`, `write`, `admin`, `billing`, `user_management` — don't imply each other. An admin grant does not grant `billing`, and (once the migration completes) does not grant `write` either; each capability is listed explicitly. This is the whole point of the refactor, and has downstream consequences — most notably for publish-target checks, see Phases below. + + Once capabilities are orthogonal, the names `write` and `admin` start to feel vague — they were meaningful as tiers but don't describe a specific power on their own. A later migration phase renames and/or splits them (e.g. `write → publish`, or separating task control from catalog edits) once the shape and Postgrest retirement allow it. + +- **Capabilities inherit down the prefix tree.** A grant at `acmeCo/` applies to every descendant prefix — `acmeCo/sales/`, `acmeCo/sales/leads/`, and so on. A user's effective capabilities on a given prefix are the union of every grant at that prefix or any ancestor. This is how scoping already works for `read`/`write`/`admin`, and the new capabilities inherit the same way. + + > The `billing` capability only really makes sense at the root prefix and will be inert on any subprefix; granting this capability on subprefixes will be inert. The UI can handle this as a special case. + +- **Role grants narrow capabilities, never widen them.** When a user reaches a prefix through a role grant, their effective capabilities are the intersection of what the user has and what the role grant allows. Neither side can escalate past the other: + + | Alice's user grant on `acmeCo/` | `acmeCo/` role grant on `partner/shared/` | Alice's effective capabilities on `partner/shared/` | + | ------------------------------- | ----------------------------------------- | --------------------------------------------------- | + | `{read, write, billing}` | `{read, write}` | `{read, write}` — `billing` is filtered out | + | `{read}` | `{read, write}` | `{read}` — the role grant can't add `write` | + +## Open Questions + +1. **Do we need a `traverse` capability to gate role-grant traversal?** + + Today, only users with the `admin` role can traverse role grants at all. A read-only user on `acmeCo/` cannot follow a role grant from `acmeCo/` → `partner/shared/`. + + The role grant rule as stated in Technical Notes would change this. Once capabilities are orthogonal and we drop the `admin`-required gate, any user whose capabilities intersect with a role grant's capabilities can traverse it. That means every existing read-only user would suddenly gain read access to every prefix reachable through existing role grants — a potentially large, silent expansion of access. + + Should we add an explicit `traverse` capability to prevent this? With `traverse`, a user can only follow a role-grant edge if `traverse` appears on their user grant. `traverse` is a gate — it controls whether the user can enter the role grant at all, but it doesn't carry through to the effective capability set: + + | User grant on `acmeCo/` | Role grant `acmeCo/` → `partner/shared/` | Effective capabilities on `partner/shared/` | + |---|---|---| + | `{read, write}` | `{read, write}` | none — no `traverse` on user grant | + | `{read, traverse}` | `{read, write}` | `{read}` — `traverse` lets her in, but `write` is filtered out because it wasn't on the user grant | + + We could backfill and add `traverse` wherever there is already an `admin` grant so as not to change anyone's existing level of access. + +## Phases (still in progress) + +We will interleave these phases with other changes (service accounts, better user management, billing features) as needed. + +**Phase 1 — add the array, orthogonal capabilities only.** Introduce `capabilities capability[] NOT NULL DEFAULT '{}'` on `user_grants`. The existing `capability` enum stays authoritative for `read`/`write`/`admin`; the array only carries the new orthogonal capabilities (`billing`, `user_management`). Only the GraphQL/Rust path reads the array. This lets us gate `billing` and `user_management` features immediately without touching existing authz code paths. + +**Phase 2 — dual-write the tiered capabilities into the array.** The array becomes authoritative for the Rust/GraphQL authz layer for all five capabilities; the enum stays authoritative for RLS. A sync trigger keeps them coherent during the Postgrest sunset: + +- _New-path writes_ (GraphQL/Rust) set the array directly and project to the enum: `admin` if the array contains it, else `write`, else `read`. Orthogonal-only grants (e.g. `{billing}`) project to enum `read`, accepting a Postgrest read-leak within the prefix as Postgrest is sunsetting. +- _Legacy-path writes_ (Postgrest/direct SQL) trigger a DB function that expands the enum to its tier capabilities (`admin → {read, write, admin}`, `write → {read, write}`, `read → {read}`) and merges them with any existing orthogonal capabilities on the row. A Postgrest write re-expresses only the tier portion; capabilities like `billing` are left untouched. Postgrest can't remove orthogonal capabilities, which is fine — they're only managed through the new path. +- Add a `capabilities capability[]` column to `role_grants` (same as `user_grants`), backfill from the existing enum, and update role-grant traversal logic to compute intersections against the new array. +- A one-shot backfill populates tier capabilities into the array for all existing rows using the same expansion. +- If we decide to add the `traverse` capability, this backfill should also add `traverse` to every existing admin user and role grant, preserving today's behavior where admins can follow role-grant edges. Going forward, `traverse` is auto-bundled whenever an `admin` grant is created — the grant-expansion rule becomes `admin → {read, write, admin, traverse}`. A later phase of the user-management RFC will unbundle `traverse` from `admin` when the UI supports assigning capabilities individually. + +**Phase 3 — cutover.** Once Postgrest retires, drop the enum column on both tables, remove the sync trigger, and remove the projection logic. `CapabilitySet` becomes the only representation. The publish-target check becomes a plain flag-containment test for `write`; admin grants continue to satisfy it because the grant-expansion rule always stores `{read, write, admin}` on admin grants. + +**Phase 4 — rename and split the legacy tier names.** With Postgrest gone and `CapabilitySet` as the sole representation, the `write` and `admin` names can be replaced with capabilities that describe specific powers (e.g. `publish`, `manage`, or finer splits between task control and catalog edits). This is a pure rename/split inside the new model — a migration on `grant_capability` values, updates to the Rust `CapabilitySet` variants, and a sweep of the call sites. Sequenced last because it's disruptive to read without a forcing function, and only makes sense once nothing outside the new model speaks the old names. diff --git a/plans/service-accounts.md b/plans/service-accounts.md new file mode 100644 index 00000000000..46fdfba2c81 --- /dev/null +++ b/plans/service-accounts.md @@ -0,0 +1,556 @@ +# Service Accounts + +## Executive Summary + +CI/CD pipelines, AI agents, and other programmatic integrations need stable credentials +that aren't tied to a human's `user_grants`. + +A **service account** is a non-human identity that authorized the same way as a human +user — same `user_grants`, same resource access, same `user_roles()` resolution. + +A service account can authenticate in one of two ways: + +- **API key** — a long-lived credential we mint with a user-configurable lifetime + that counts down from creation and does NOT reset on each use. Like a refresh + token, it is exchanged via `generate_access_token` for a short-lived JWT; the + JWT is the bearer token used against PostgREST and the rest of the stack. +- **OIDC** — the service account is configured to trust an external identity + provider (GitHub Actions, for example). The IdP signs a short-lived token + with its private key; we validate it against the issuer's JWKS and exchange + it for a short-lived Estuary access token. No long-lived secrets to manage. + +A service account can have multiple active API keys and OIDC configurations +simultaneously, enabling zero-downtime rotation — mint a new key, update +consumers, then revoke the old one. + +Any admin can create a service account, but can only grant it access to their own +prefix or something more specific — an `acmeCo/` admin can create a service account +scoped to `acmeCo/staging/`, but an `acmeCo/staging/` admin cannot grant access to +`acmeCo/` or `acmeCo/prod/`. + +Both service accounts and their credentials are fully manageable from the dashboard UI +and from `flowctl`, so admins can script provisioning (e.g., bootstrap a CI environment) +or drive everything interactively from the browser. + +## Technical notes + +- **`auth.users` rows.** Service accounts are real Supabase users. All existing RLS policies, + PostgREST authorization, `user_roles()` resolution, and `role_grants` traversal work + unchanged. _This avoids putting the PostgREST-to-GraphQL migration on the critical + path to releasing the service account feature._ + +- **`internal.service_accounts` table.** A new table keyed by `auth.users.id` that + holds service-account-specific metadata (owning prefix, display name, created-by, + disabled state). Its presence is also what distinguishes a non-human `auth.users` + row from a real person — code paths that assume "auth.users means a human" + (member lists, onboarding) filter by joining against this table. + +- **API keys get their own `internal.api_keys` table.** + API keys behave much like refresh tokens: long-lived credentials exchanged via + `generate_access_token` for a short-lived access token. An API key is reusable + until it's revoked or expires at a fixed date (no sliding window like a refresh token has). + +- **One `user_grants` row per service account.** Each SA has exactly one grant. + This makes ownership unambiguous: whoever admins that prefix, manages + the SA. Disabling the SA deletes the `user_grants` row and revokes all API keys and + OIDC trust policies, so every avenue of access is cut at once. `prefix` and + `capability` columns on `internal.service_accounts` mirror the original grant so this + information survives a deletion (and could be used to re-enabled the SA) — the UI + can still show the disabled SA with its intended scope. Access to multiple prefixes, + when needed, is modeled through `role_grants` rather than adding `user_grants`. + +## Phase dependency graph + +```mermaid +flowchart TD + P1[Backend CRUD & Tokens] + P2[Management UI] + P5[flowctl Auth via API Key] + P3[OIDC Token Exchange] + P4[OIDC Configuration UI] + P6[flowctl Service Account & API Key Commands] + P7[flowctl OIDC Trust Policy Commands] + + P1 --> P2 + P1 --> P3 + P1 --> P5 + P1 --> P6 + P3 --> P4 + P3 --> P7 + P6 --> P7 + + click P1 "https://github.com/estuary/flow/issues/2857" + click P2 "https://github.com/estuary/flow/issues/2861" + click P3 "https://github.com/estuary/flow/issues/2859" + click P4 "https://github.com/estuary/flow/issues/2863" + click P5 "https://github.com/estuary/flow/issues/2858" + click P6 "https://github.com/estuary/flow/issues/2860" + click P7 "https://github.com/estuary/flow/issues/2862" +``` + +Two independent branches off the root: the API key branch (P2, P5, P6) and the OIDC +branch (P3, P4, P7). P7 is the only phase with two upstream dependencies (P3 for +the backend it wraps, P6 for the subcommand tree it extends). + +## Phase 1 — Service Account CRUD & Tokens (Backend) + +Delivers the complete service account lifecycle via GraphQL. The mutations can +be driven by any authenticated admin (GraphQL playground, curl, or flowctl using +their existing human credentials). Once minted, an API key becomes a usable +credential by POSTing it to `generate_access_token` to receive a short-lived JWT +— that JWT is what actually goes in the `Authorization: Bearer` header. The raw +`flow_sa_...` string is not itself a valid Bearer token; flowctl only handles the +exchange transparently once Phase 5 lands. + +### Data Model + +**`internal.service_accounts`**: + +| Column | Type | Notes | +| -------------- | --------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | +| `user_id` | UUID PK, FK → `auth.users` | The service account's identity | +| `prefix` | `catalog_prefix` NOT NULL | Owning prefix — set at creation, immutable. Used for authorization on all management operations (disable, key rotation, etc.) | +| `capability` | `grant_capability` NOT NULL | | +| `display_name` | TEXT NOT NULL | Human-readable label (e.g., "CI deploy bot") | +| `created_by` | UUID FK → `auth.users` | Audit trail — which human created this | +| `last_used_at` | TIMESTAMPTZ | Updated on each `generate_access_token()` call via any of this account's api keys | +| `disabled_at` | TIMESTAMPTZ | Set by `disableServiceAccount` — account cannot authenticate and is shown as disabled in the UI | +| `created_at` | TIMESTAMPTZ | | +| `updated_at` | TIMESTAMPTZ | | + +The referenced row in `auth.users` has no password and no OAuth identity. Application-layer +queries that need to distinguish service accounts from humans (member lists, onboarding) +join against `internal.service_accounts`. + +**`internal.api_keys`** — credentials for service accounts (separate from human `refresh_tokens`): + +| Column | Type | Notes | +| -------------------- | ---------------------- | ------------------------------------------------------------- | +| `id` | UUID PK | | +| `service_account_id` | UUID FK → `auth.users` | Which service account this key authenticates as | +| `secret_hash` | TEXT NOT NULL | Hashed secret (plaintext returned once at creation) | +| `label` | TEXT NOT NULL | Human-readable label (e.g., "GitHub Actions") | +| `expires_at` | TIMESTAMPTZ NOT NULL | Hard cutoff — no sliding window | +| `created_by` | UUID FK → `auth.users` | Which human created this key | +| `last_used_at` | TIMESTAMPTZ | Updated on each `generate_access_token()` call using this key | +| `created_at` | TIMESTAMPTZ | | + +**Token format:** API keys are prefixed `flow_sa_` followed by base64-encoded `{id}:{secret}`. +The prefix routes server-side lookups to `internal.api_keys`, and helps users distinguish +service account keys from personal refresh tokens. + +Update `generate_access_token()` to accept an API key input alongside the existing +`{refresh_token_id, secret}` input. The two shapes are mutually exclusive and +routed by which field is present: + +- **New input** — `{api_key: "flow_sa_..."}`: decode the base64 `{id}:{secret}` + payload, look up in `internal.api_keys`, verify `secret_hash`, check `expires_at`, + update `last_used_at`, mint a JWT with `service_account_id` as `sub`. +- **Existing input** — `{refresh_token_id, secret}`: unchanged, routes through + `public.refresh_tokens` as today. + +Response shape is unchanged (`{access_token, refresh_token?}`). The optional +`refresh_token` is never set for the api-key branch — API keys don't rotate; +the same `flow_sa_...` string is reused until its `expires_at`. + +The additive input shape keeps the RPC fully backward compatible with existing +flowctl clients. + +### GraphQL API + +**Mutations:** (requires admin capability on the prefix) + +- `createServiceAccount(prefix, capability, displayName)` → `{ userId, prefix, capability, displayName, createdAt }` + - Creates `auth.users` row + `internal.service_accounts` row + `user_grants` row + for `(user_id, prefix, capability)` as the only grant + +- `disableServiceAccount(userId)` → `Boolean` + - Deletes all `user_grants` (removes catalog access) and all `api_keys` (invalidates credentials) + - Sets `internal.service_accounts.disabled_at = now()` + - The `auth.users` row is intentionally preserved — several tables (`drafts`, `publications`, + `publication_specs`) reference it with default `NO ACTION` FK constraints, so + a hard delete would require cascading cleanup of audit history. Keeping the row is simpler and + preserves attribution. + - A disabled service account cannot authenticate (no api_keys, no grants) but remains + visible in the UI as disabled for auditability + +- `enableServiceAccount(userId)` → `Boolean` + - Clears `internal.service_accounts.disabled_at` and re-creates the `user_grants` row + for `(user_id, prefix, capability)` using the service account's original `prefix` and `capability` + - Does NOT restore previously revoked `api_keys` — the admin must mint new ones via + `createApiKey` + +- `createApiKey(userId, label, validFor)` → `{ keyId, secret }` + - Creates an `internal.api_keys` row for the target service account with `expires_at = now() + validFor` + - `validFor`: ISO 8601 duration (e.g., `P90D`, `P1Y`) + - Returns the `flow_sa_`-prefixed secret exactly once — never retrievable again + +- `revokeApiKey(keyId)` → `Boolean` + - Deletes the `api_keys` row + +**Queries:** + +- `serviceAccounts(after, first)` → paginated list + - Includes: `userId`, `displayName`, `prefix`, `capability`, `createdBy`, `createdAt`, + `apiKeys[]` (each with `keyId`, `label`, `createdAt`, `expiresAt` — secrets are + not retrievable since only `secret_hash` is stored) + - **Requires:** caller is admin (or read?) on `prefix` + +### Verification + +- [ ] Create service account with grant to `acmeCo/` → create API key → exchange for + access token via `generate_access_token()` → use JWT with PostgREST and GraphQL + → grants resolve normally through `user_roles()` +- [ ] API key with elapsed `expires_at` → `generate_access_token()` rejects +- [ ] Disable service account → all API keys revoked, grant removed, account marked disabled +- [ ] Disabled service account can't create new API keys +- [ ] Re-enabled service account has same access as before +- [ ] Multiple active API keys on same service account → all work (rotation scenario) +- [ ] `acmeCo/staging/` admin cannot manage a service account with prefix `acmeCo/prod/` +- [ ] Service account does NOT appear in tenant member lists + +--- + +## Phase 2 — Service Account Management UI (Frontend) + +Uses Phase 1 APIs with no backend changes. + +### Views + +**Service Accounts list** (under tenant admin area): + +- Table: display name, prefix, created by, created date, API key count +- Scoped to the admin's current tenant prefix +- "Create Service Account" action + +**Service Account detail:** + +- Display name (editable) +- Prefix (read-only, set at creation) +- API keys list: label, created date, expires date, status (active / expired) +- "Create API Key" action with copy-once secret display + configurable lifetime +- "Revoke API Key" action per key +- "Disable Service Account" action with confirmation + +### API Key Creation UX + +1. Admin clicks "Create API Key" +2. Enters label (e.g., "GitHub Actions") and lifetime (dropdown: 90 days, 180 days, 1 year, custom) +3. System returns the `flow_sa_`-prefixed secret — the admin configures it in their + CI system (or flowctl, per Phase 5), which exchanges it for a short-lived access + token on each run +4. Secret is displayed once in a copy-able field with a warning that it won't be shown again +5. Admin copies and configures their CI/CD system + +### Verification + +- Create service account, create API key, copy secret → use in API call → works +- Secret display disappears on navigation — not retrievable +- Expired API keys show visual indicator +- Disable service account → shown as disabled in list + +--- + +## Phase 3 — OIDC Token Exchange (Backend) + +Enables external identity providers (GitHub Actions, GitLab CI, cloud provider +workload identity) to exchange their OIDC tokens for short-lived Estuary access +tokens. Eliminates long-lived secrets in CI/CD. + +### Design + +Implements a token exchange endpoint inspired by RFC 8693. An external OIDC token +is exchanged for a short-lived JWT scoped to a specific service account's grants. + +**`internal.oidc_trust_policies`** — maps external identities to service accounts: + +| Column | Type | Notes | +| -------------------- | ---------------------- | --------------------------------------------------------------------- | +| `id` | flowid PK | | +| `service_account_id` | UUID FK → `auth.users` | Which service account to act as | +| `issuer` | TEXT NOT NULL | OIDC issuer URL (e.g., `https://token.actions.githubusercontent.com`) | +| `subject_pattern` | TEXT NOT NULL | Regex or exact match on `sub` claim | +| `claims_filter` | JSONB | Additional claim constraints (e.g., `{"repository": "estuary/flow"}`) | +| `max_token_lifetime` | INTERVAL | Upper bound on issued token lifetime | +| `created_by` | UUID FK → `auth.users` | | +| `created_at` | TIMESTAMPTZ | | + +**Endpoint:** `POST /auth/v1/token-exchange` (or GraphQL mutation) + +``` +{ + "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange", + "subject_token": "", + "subject_token_type": "urn:ietf:params:oauth:token-type:jwt", + "requested_token_type": "urn:ietf:params:oauth:token-type:access_token" +} +``` + +**Flow:** + +1. Decode external JWT, extract `iss` and `sub` +2. Fetch issuer's JWKS (cached) and verify signature +3. Match against `oidc_trust_policies` — `issuer` + `subject_pattern` + `claims_filter` +4. Mint a short-lived JWT with the matched service account's `user_id` as `sub` +5. Return access token (no refresh token — short-lived only) + +**Authorization for trust policy management:** caller must be admin on the +linked service account's `prefix`. + +### Verification + +- Configure GitHub Actions trust policy → run workflow → token exchange succeeds + → API calls work with service account's grants +- Mismatched subject or claims → rejected +- Expired external token → rejected +- Token lifetime respects `max_token_lifetime` + +--- + +## Phase 4 — OIDC Configuration UI (Frontend) + +Extends service account detail view with trust policy management, using Phase 3 APIs. + +### UX + +- "OIDC Trust Policies" section on service account detail +- "Add Trust Policy" with guided setup per provider: + - GitHub Actions: repo name → auto-fills issuer + subject pattern + - GitLab CI: project path → auto-fills + - Custom: manual issuer URL + subject pattern + optional claims +- Display active policies with issuer, subject pattern, and last-used timestamp + +### Verification + +- Create trust policy through UI → GitHub Actions workflow authenticates successfully +- Delete trust policy → subsequent workflow runs fail authentication +- Provider-specific templates pre-fill correct values + +--- + +## Phase 5 — flowctl Authenticates as a Service Account + +Lets CI jobs run flowctl using a service account API key instead of a +human's refresh token. Depends only on the `generate_access_token` RPC +change from Phase 1. Ships independently of Phase 6. + +Today flowctl picks up a long-lived credential from the `FLOW_AUTH_TOKEN` +env var ([config.rs:162-173](crates/flowctl/src/config.rs#L162-L173)), +expecting a base64-encoded JSON refresh token `{id, secret}`. CI jobs set +this to a refresh token tied to a human user — awkward because the +credential is personal, uses a sliding expiry (never forced to rotate), +and breaks if the human's account is deleted. + +After Phase 5, the same env var also accepts a `flow_sa_...` API key: + +```yaml +- run: flowctl catalog publish --specs ./catalog.yaml + env: + FLOW_AUTH_TOKEN: ${{ secrets.ESTUARY_API_KEY }} # flow_sa_... string +``` + +**Detection is by prefix.** In [config.rs](crates/flowctl/src/config.rs), +the `FLOW_AUTH_TOKEN` handler branches on whether the value starts with +`flow_sa_`. The existing base64-JSON path is untouched — base64's alphabet +doesn't include `_`, so the two formats can't collide, and existing CI +jobs that feed a refresh token through this env var keep working unchanged. + +### Config changes + +- Add `user_api_key: Option` to `Config` alongside `user_refresh_token`. + Mutually exclusive in practice — whichever is set drives credential refresh. +- Load path: if `FLOW_AUTH_TOKEN` starts with `flow_sa_`, populate `user_api_key`; + otherwise decode as today into `user_refresh_token`. +- No config-file migration needed. Existing `~/.config/flowctl/*.json` files + that contain `user_refresh_token` are unaffected. + +### `refresh_authorizations` changes + +[flow-client/src/client.rs:462](crates/flow-client/src/client.rs#L462): + +- New branch: if `user_api_key` is set, call `generate_access_token` with the + new `{api_key}` input (Phase 1), get back an access token, return the api + key unchanged. Skip the "auto-create a refresh token" branch — the API key + IS the long-lived credential. +- Existing refresh-token branches are unchanged. + +### New auth command + +`flowctl auth api-key --token flow_sa_...` stores the key into the config +file for interactive workstation use. The existing `flowctl auth token` +remains for short-lived access tokens and for the interactive login flow. + +### Verification + +- [ ] Existing CI job using a base64-JSON refresh token in `FLOW_AUTH_TOKEN` + continues to work with no changes +- [ ] `FLOW_AUTH_TOKEN=flow_sa_...` set in env → flowctl runs as the service + account, grants resolve through `user_roles()` to the service account's + prefix +- [ ] Expired API key in `FLOW_AUTH_TOKEN` → flowctl fails with a clear error + (no attempt to refresh, since API keys don't rotate) +- [ ] Access token nearing expiry is re-minted from the API key on the next + request; the API key itself is not rotated or re-written anywhere +- [ ] `flowctl auth api-key --token flow_sa_...` persists to config and + subsequent invocations authenticate correctly + +--- + +## Phase 6 — flowctl Service Account & API Key Commands + +Adds a `flowctl service-accounts` subcommand tree covering service account +lifecycle and API key management. Depends only on Phase 1. Ships +independently of Phase 5; until then, admins drive provisioning from the +dashboard (Phase 2) or exchange the secret returned by `api-keys create` via +`generate_access_token` (the same flow a refresh token uses today) to obtain +a short-lived JWT for ad-hoc `Authorization: Bearer` use. + +Lives at [crates/flowctl/src/service_accounts/](crates/flowctl/src/service_accounts/) +alongside existing top-level subcommands like `catalog`, `draft`, and `auth`. + +**Pattern to follow:** [crates/flowctl/src/alert_subscriptions/](crates/flowctl/src/alert_subscriptions/) +is the closest shape — full CRUD (`list-query.graphql`, `create-mutation.graphql`, +`update-mutation.graphql`, `delete-mutation.graphql`) wired through +`graphql_client::GraphQLQuery` derives and the shared `post_graphql()` helper +in [crates/flowctl/src/graphql.rs](crates/flowctl/src/graphql.rs). The module +header comment in `graphql.rs` documents the derive attributes (`extern_enums`, +`response_derives`, scalar mapping) that these queries will need. + +### Command surface + +``` +flowctl service-accounts list [--prefix ] +flowctl service-accounts create --prefix --capability --name +flowctl service-accounts disable +flowctl service-accounts enable + +flowctl service-accounts api-keys list +flowctl service-accounts api-keys create --label