diff --git a/docs/engineering/docs.json b/docs/engineering/docs.json index f1dad3baf0..ef61c9c8a0 100644 --- a/docs/engineering/docs.json +++ b/docs/engineering/docs.json @@ -191,47 +191,15 @@ "groups": [ { "group": "Infra", - "pages": ["infra/index","infra/infra-work"] + "pages": ["infra/index"] }, { "group": "Clusters", "pages": [ - "infra/clusters/kubeconfig", - "infra/clusters/create-region", "infra/clusters/delete-cluster", - "infra/clusters/networks", "infra/clusters/scheduling" ] }, - { - "group": "Observability", - "pages": [ - "infra/observability/alerting", - "infra/observability/checkly", - "infra/observability/incident-io" - ] - }, - { - "group": "Metering", - "pages": [ - "infra/metering/gauge" - ] - }, - { - "group": "Deployments", - "pages": [ - "infra/deployments/argocd-debugging", - "infra/deployments/cronjobs" - ] - }, - { - "group": "Custom Domains", - "pages": ["infra/domain-connect"] - }, - { - "group": "Secrets", - "pages": ["infra/secrets/aws-secrets"] - }, { "group": "ClickHouse", "pages": [ @@ -257,16 +225,6 @@ ] } ] - }, - { - "group": "Legacy (2025)", - "pages": [ - "infra/legacy-2025/github-oidc", - "infra/legacy-2025/github-actions-deploy-role-setup", - "infra/legacy-2025/pulumi-iac-esc", - "infra/legacy-2025/pulumi-infrastructure-architecture", - "infra/legacy-2025/pulumi-workflow" - ] } ] } diff --git a/docs/engineering/infra/clusters/create-region.mdx b/docs/engineering/infra/clusters/create-region.mdx deleted file mode 100644 index 1c403ff7df..0000000000 --- a/docs/engineering/infra/clusters/create-region.mdx +++ /dev/null @@ -1,366 +0,0 @@ ---- -title: Creating a New EKS Cluster Region -description: Adding a new EKS region. ---- - -End-to-end guide for adding a new AWS region to the Unkey EKS infrastructure. - -Assumes familiarity with Kubernetes, AWS, and the existing repo layout. - ---- - -## Prerequisites - -Before starting, ensure you have: - -- **AWS credentials** configured (`AWS_PROFILE`) with permissions for EKS, IAM, Route53, Secrets Manager, and ELB -- **CLI tools** installed: `awscli`, `eksctl`, `kubectl`, `helm`, `argocd` -- **GitHub App credentials** for ArgoCD repository access -- **Route53 hosted zones** created for `.aws.unkey.com` and `aws.unkey.cloud` -- **CIDR allocation** — confirm the target region has an entry in [`networks`](/infra/clusters/networks). The generator script will refuse to run if the CIDR is missing. - ---- - -## Step 1: Generate configuration - -The `generate-region-config.sh` script creates all eksctl and helm environment files for a region. - -### Dry run first - -```bash -cd eks-cluster -./scripts/generate-region-config.sh --dry-run - -# With a non-default environment: -./scripts/generate-region-config.sh staging --dry-run -``` - -This prints the file list and CIDR without writing anything. - -### Generate files - -```bash -# Base region (unkey-api + infrastructure only) -./scripts/generate-region-config.sh - -# Full deploy region (adds control-api, frontline, krane, vault, etc.) -./scripts/generate-region-config.sh --with-deploy -``` - -### What gets created - -| Category | Apps | When | -| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------- | -| **Always generated** | eksctl config, argocd, core, networking, reloader, runtime, dragonfly, tailscale, external-dns, observability, thanos, vector-logs, **unkey-api** | Every run | -| **Deploy-only** (`--with-deploy`) | control-api, control-worker, restate, sentinel, frontline, krane, vault | Only with `--with-deploy` | - -Files are written to `configs//.yaml` and `helm-chart//environments//.yaml`. The script refuses to overwrite existing files — delete them first if you need to regenerate. - ---- - -## Step 2: Review & commit - -Check the generated files make sense: - -```bash -git diff --stat -git diff -``` - -Things to verify: - -- VPC CIDR matches the [networks](/infra/clusters/networks) assignment -- Hostnames and domain patterns look correct -- gossip WAN seeds have a `TODO` comment (expected — you'll fill them in at Step 6) - -Commit the generated config and push: - -```bash -git add configs/ helm-chart/ -git commit -m "Add region config for " -git push -``` - -### Promote all apps to the new commit - -Each ArgoCD ApplicationSet reads a promotion file (`eks-cluster/promotions//.yaml`) that pins a specific git SHA as the `targetRevision`. After pushing the new region's config, you **must promote every app** to a revision that includes the new env files — otherwise ArgoCD will check out the older pinned commit where the files don't exist, and all apps will show `Unknown` sync status. - -Use the `promote` script to update all apps to the pushed commit: - -```bash -./scripts/promote $(git ls-remote origin main | awk '{print $1}') -git add eks-cluster/promotions/ -git commit -m "Promote all apps for " -git push -``` - ---- - -## Step 3: Verify secrets replication - -All secrets in AWS Secrets Manager (`unkey/shared`, `unkey/control`, `unkey/krane`, etc.) are already replicated from `us-east-1` to the regions where unkey-api runs. Once the cluster is up, External Secrets will pull from the local region's Secrets Manager automatically. - -Verify replication is in place for your region: - -```bash -aws secretsmanager describe-secret \ - --secret-id unkey/shared \ - --region us-east-1 \ - --query 'ReplicationStatus[].Region' \ - --output text -``` - -If your region is **not** in the list, you need to add it to each secret's replication configuration: - -```bash -# Add a new replica region to an existing secret -aws secretsmanager replicate-secret-to-regions \ - --secret-id unkey/shared \ - --add-replica-regions Region= \ - --region us-east-1 - -# Repeat for each secret: unkey/control, unkey/krane, -# unkey/sentinel, unkey/vault, unkey/vector, unkey/frontline, unkey/argocd -``` - -The `replicate-secrets-to-new-region.sh` script automates this for all secrets at once: - -```bash -./scripts/replicate-secrets-to-new-region.sh us-east-1 -``` - -After initial replication, AWS keeps them in sync automatically — no cron or Lambda needed. - -See [AWS Secrets](../secrets/aws-secrets) for the full secret inventory. - ---- - -## Step 4: Create cluster - -Set the required variables and run the bootstrap script: - -```bash -ENVIRONMENT=production001 PRIMARY_REGION= ./scripts/setup-cluster.sh -``` - -The script executes in order: - -| Step | What happens | -| ---- | ----------------------------------------------------------- | -| 1 | Create IAM policies (ExternalDNS, SecretsManager, ALB, ACK) | -| 2 | Create EKS cluster (without node groups) | -| 3 | Wait for cluster ACTIVE status | -| 4 | Update kubeconfig | -| 5 | Patch addon tolerations | -| 6 | Create node groups | -| 7 | Create observability S3 bucket | -| 8 | Install AWS Load Balancer Controller | -| 9 | Install CRDs (Prometheus, External Secrets) | -| 10 | Install and configure ArgoCD | - -For production environments you'll be prompted to type the environment name to confirm. - -**Expected duration:** 15–25 minutes (mostly waiting for EKS cluster and node group creation). - ---- - -## Step 5: Verify deployment - -```bash -# Nodes are ready -kubectl get nodes - -# ArgoCD is running -kubectl get pods -n argocd - -# Get ArgoCD admin password -kubectl -n argocd get secret argocd-initial-admin-secret \ - -o jsonpath="{.data.password}" | base64 -d; echo - -# Access ArgoCD UI -kubectl port-forward svc/argocd-server -n argocd 8080:443 -``` - -Check that ArgoCD has picked up the new region's ApplicationSets and apps are syncing. Core infrastructure apps (external-dns, observability, etc.) should sync automatically. - ---- - -## Step 6: Configure gossip WAN seeds - -Both **unkey-api** and **frontline** (if `--with-deploy`) use memberlist-based WAN gossip for cross-region state sharing. This is a chicken-and-egg problem: each region needs to know the other region's NLB DNS name, but that NLB doesn't exist until the chart deploys. - -### 6a. Deploy with empty seeds (already done) - -The generated config has `UNKEY_GOSSIP_WAN_SEEDS: ""` for unkey-api and appropriate defaults for frontline. ArgoCD will deploy them, creating the NLB and registering DNS via ExternalDNS. - -### 6b. Verify the gossip NLB DNS is registered - -Wait for ExternalDNS to create the DNS records, then verify: - -```bash -# unkey-api -dig unkey-api-gossip..aws.unkey.cloud - -# frontline (deploy regions only) -dig frontline-gossip..aws.unkey.cloud -``` - -If ExternalDNS hasn't registered the friendly name yet, get the raw NLB hostname: - -```bash -kubectl get svc -n api unkey-api-gossip-wan \ - -o jsonpath='{.status.loadBalancer.ingress[0].hostname}' -``` - -### 6c. Update the new region's WAN seeds - -Point the new region to the existing region(s). - -**unkey-api** — edit `helm-chart/unkey-api/environments//.yaml`: - -```yaml -env: - UNKEY_GOSSIP_WAN_SEEDS: "unkey-api-gossip..aws.unkey.cloud" -``` - -**frontline** (deploy regions only) — edit `helm-chart/frontline/environments//.yaml`: - -```yaml -gossip: - wanSeeds: "frontline-gossip..aws.unkey.cloud" -``` - -### 6d. Update existing regions to include the new region - -Each existing region must add the new region as a seed. Seeds are comma-separated if there are multiple peer regions. - -**Example** — existing `us-east-1` unkey-api config gets: - -```yaml -env: - UNKEY_GOSSIP_WAN_SEEDS: "unkey-api-gossip.eu-central-1.aws.unkey.cloud,unkey-api-gossip..aws.unkey.cloud" -``` - -### 6e. Commit, push, and sync - -```bash -git add helm-chart/ -git commit -m "Wire gossip WAN seeds for " -git push -``` - -ArgoCD will redeploy the affected services. Pods restart and join the WAN gossip ring. - -### 6f. Verify gossip is healthy - -```bash -# Check unkey-api gossip logs -kubectl logs -n api -l app.kubernetes.io/component=unkey-api --tail=50 | grep -i gossip - -# Check the WAN NLB has a healthy target -aws elbv2 describe-target-health \ - --target-group-arn \ - --region -``` - ---- - -## Step 7: Enable Global Accelerator (deploy regions only) - -For regions running frontline with `--with-deploy`, the generated config already sets `globalAccelerator.enabled: true` and includes the listener ARN. After the frontline NLB is created: - -1. The GA resolver Helm hook job runs automatically -2. It discovers the NLB ARN and creates an `EndpointGroup` CRD -3. The ACK Global Accelerator controller reconciles and attaches the NLB to the Global Accelerator - -Verify: - -```bash - -# EndpointGroup exists -kubectl get endpointgroups -n frontline -``` - -If the Global Accelerator doesn't exist yet (first-time setup), create it first: - -```bash -ENVIRONMENT=production001 ./scripts/setup-global-accelerator.sh -``` - ---- - -## Quick Reference - -| Script | What it does | -| ------------------------------------ | ---------------------------------------------------------------------------------------- | -| `generate-region-config.sh` | Generate all config files for a new region | -| `promote` | Update promotion files to deploy a revision via ArgoCD | -| `promotion-changelists` | Generate a changelog of PRs between the old and new promotion revisions | -| `replicate-secrets-to-new-region.sh` | Add a new region to secrets replication (only needed for regions not already replicated) | -| `setup-cluster.sh` | Full cluster bootstrap (IAM → EKS → nodes → ArgoCD) | -| `setup-global-accelerator.sh` | Create Global Accelerator (one-time) | -| `setup-acm-certificate.sh` | Create wildcard ACM cert for a region | -| `validate-aws-resources.sh` | Validate AWS resources exist | -| `apply-addon-tolerations.sh` | Patch EKS addon tolerations | - ---- - -## Troubleshooting - -### CIDR not found - -``` -Error: No CIDR found for 'production001-xx-xxxx-1' -``` - -The region isn't in the `CIDR_MAP` in `generate-region-config.sh`. Add it there and in [networks](/infra/clusters/networks). - -### Node groups not scheduling pods - -All node groups use taints. Pods need matching tolerations. Check: - -```bash -kubectl describe node | grep Taints -kubectl get pods -A --field-selector=status.phase!=Running -kubectl describe pod -n # look for "Insufficient" or "didn't match" -``` - -Common taints: - -| Node group | Taint | -| --------------- | ------------------------------------- | -| `unkey` | `node-class=unkey:NoSchedule` | -| `untrusted` | `node-class=untrusted:NoSchedule` | -| `sentinel` | `node-class=sentinel:NoSchedule` | -| `observability` | `node-class=observability:NoSchedule` | -| `api` | `node-class=api:NoSchedule` | - -### Gossip not joining - -1. **DNS not resolving** — ExternalDNS may not have registered yet. Check `kubectl logs -n networking -l app.kubernetes.io/name=external-dns`. -2. **NLB not ready** — `kubectl get svc -n api unkey-api-gossip-wan` should show an external hostname. -3. **Security groups** — WAN gossip uses port 7947 TCP+UDP. The NLB must allow inbound on this port. -4. **Secret mismatch** — All regions in a gossip ring must share the same `UNKEY_GOSSIP_SECRET_KEY` (pulled from AWS Secrets Manager). - -### ExternalSecrets failing - -```bash -kubectl get externalsecrets -A -kubectl describe externalsecret -n -``` - -Check that: - -- Secrets are replicated to this region (see [Step 3](#step-3-verify-secrets-replication)) -- Pod Identity association exists for the service account -- The SecretStore references the correct region - -### ArgoCD apps not syncing - -```bash -argocd app list -argocd app get -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-application-controller --tail=100 -``` - -Verify the ApplicationSet generator includes the new cluster/region. diff --git a/docs/engineering/infra/clusters/kubeconfig.mdx b/docs/engineering/infra/clusters/kubeconfig.mdx deleted file mode 100644 index 8e5e0f73c8..0000000000 --- a/docs/engineering/infra/clusters/kubeconfig.mdx +++ /dev/null @@ -1,33 +0,0 @@ ---- -title: Kubeconfig Setup -description: Configure kubeconfig for all live clusters. ---- - -Run the following commands to configure your kubeconfig with all live clusters. - -### Production clusters - -```bash -AWS_PROFILE=unkey-production001-admin aws eks update-kubeconfig --region us-east-1 --name deploy-us-east-1 --alias deploy-us-east-1 -AWS_PROFILE=unkey-production001-admin aws eks update-kubeconfig --region us-west-2 --name deploy-us-west-2 --alias deploy-us-west-2 -AWS_PROFILE=unkey-production001-admin aws eks update-kubeconfig --region eu-central-1 --name deploy-eu-central-1 --alias deploy-eu-central-1 -AWS_PROFILE=unkey-production001-admin aws eks update-kubeconfig --region eu-central-1 --name beautiful-dance-crab --alias beautiful-dance-crab -``` - -### Automode clusters - -```bash -AWS_PROFILE=unkey-production001-admin aws eks update-kubeconfig --region us-east-2 --name k8s-automode-us-east-2 --alias k8s-automode-us-east-2 -AWS_PROFILE=unkey-production001-admin aws eks update-kubeconfig --region us-west-2 --name k8s-automode-us-west-2 --alias k8s-automode-us-west-2 -AWS_PROFILE=unkey-production001-admin aws eks update-kubeconfig --region ap-south-1 --name k8s-automode-ap-south-1 --alias k8s-automode-ap-south-1 -AWS_PROFILE=unkey-production001-admin aws eks update-kubeconfig --region ap-northeast-1 --name k8s-am-ap-northeast-1 --alias k8s-am-ap-northeast-1 -AWS_PROFILE=unkey-production001-admin aws eks update-kubeconfig --region ap-southeast-2 --name k8s-am-ap-southeast-2 --alias k8s-am-ap-southeast-2 -AWS_PROFILE=unkey-production001-admin aws eks update-kubeconfig --region sa-east-1 --name k8s-am-sa-east-1 --alias k8s-am-sa-east-1 -``` - -### Staging clusters - -```bash -AWS_PROFILE=unkey-sandbox-admin aws eks update-kubeconfig --region eu-central-1 --name staging-eu-central-1 --alias staging-eu-central-1 -AWS_PROFILE=unkey-sandbox-admin aws eks update-kubeconfig --region us-west-2 --name staging-us-west-2 --alias staging-us-west-2 -``` diff --git a/docs/engineering/infra/clusters/networks.mdx b/docs/engineering/infra/clusters/networks.mdx deleted file mode 100644 index adf2c70ab2..0000000000 --- a/docs/engineering/infra/clusters/networks.mdx +++ /dev/null @@ -1,92 +0,0 @@ ---- -title: Network CIDR Assignments -description: VPC CIDR allocations for all regions. ---- - -Last updated: 2026-02-26 - -## Address Space Layout - -``` -10.0 – 10.10 Legacy (decommission) -10.12 – 10.16 EKS production production account /18 per VPC -10.20 – 10.24 Talos (future) production account /18 per VPC -10.28 – 10.29 Staging sandbox account /20 per VPC -``` - -EKS starts at `10.12`, Talos at `10.20` (+8 offset), same slot within each `/16`. - ---- - -## Quick Reference - -| Region | EKS Prod `/18` | Talos Prod `/18` | Staging `/20` | -|-----------------|-------------------|--------------------|--------------------| -| us-east-1 | `10.12.0.0/18` | `10.20.0.0/18` | `10.28.0.0/20` | -| us-east-2 | `10.12.64.0/18` | `10.20.64.0/18` | `10.28.16.0/20` | -| us-west-1 | `10.12.128.0/18` | `10.20.128.0/18` | `10.28.32.0/20` | -| us-west-2 | `10.12.192.0/18` | `10.20.192.0/18` | `10.28.48.0/20` | -| ca-central-1 | `10.16.64.0/18` | `10.24.64.0/18` | `10.28.64.0/20` | -| eu-central-1 | `10.13.0.0/18` | `10.21.0.0/18` | `10.28.80.0/20` | -| eu-north-1 | `10.13.64.0/18` | `10.21.64.0/18` | `10.28.96.0/20` | -| eu-west-1 | `10.13.128.0/18` | `10.21.128.0/18` | `10.28.112.0/20` | -| eu-west-2 | `10.13.192.0/18` | `10.21.192.0/18` | `10.28.128.0/20` | -| eu-west-3 | `10.16.0.0/18` | `10.24.0.0/18` | `10.28.144.0/20` | -| sa-east-1 | `10.16.128.0/18` | `10.24.128.0/18` | `10.28.160.0/20` | -| ap-south-1 | `10.14.0.0/18` | `10.22.0.0/18` | `10.28.176.0/20` | -| ap-south-2 | `10.14.64.0/18` | `10.22.64.0/18` | `10.28.192.0/20` | -| ap-southeast-1 | `10.14.128.0/18` | `10.22.128.0/18` | `10.28.208.0/20` | -| ap-southeast-2 | `10.14.192.0/18` | `10.22.192.0/18` | `10.28.224.0/20` | -| ap-northeast-1 | `10.15.0.0/18` | `10.23.0.0/18` | `10.28.240.0/20` | -| ap-northeast-2 | `10.15.64.0/18` | `10.23.64.0/18` | `10.29.0.0/20` | -| ap-northeast-3 | `10.15.128.0/18` | `10.23.128.0/18` | `10.29.16.0/20` | - -### EKS `/16` grouping - -| `/16` | Group | Regions | -|---------|--------------------|------------------------------------------------------| -| `10.12` | US | us-east-1, us-east-2, us-west-1, us-west-2 | -| `10.13` | EU | eu-central-1, eu-north-1, eu-west-1, eu-west-2 | -| `10.14` | AP South/Southeast | ap-south-1, ap-south-2, ap-southeast-1, ap-southeast-2 | -| `10.15` | AP Northeast | ap-northeast-1, ap-northeast-2, ap-northeast-3 | -| `10.16` | Overflow | eu-west-3, ca-central-1, sa-east-1 | - -Talos mirrors this at `10.20`–`10.24`. Staging packs all 18 regions into `10.28`–`10.29` using `/20`s. - ---- - -## Existing VPCs (to be replaced) - -### Production Account (`unkey-production001-admin`) - -| Region | VPC CIDR | VPC ID | Name | Replacement | -|-----------------|-------------------|---------------------------|----------------------------|---------------------| -| us-east-1 | `10.10.128.0/18` | `vpc-00432dcffcbd4d56c` | production001-us-east-1 | `10.12.0.0/18` | -| eu-central-1 | `10.10.192.0/18` | `vpc-013b486366778d41d` | production001-eu-central-1 | `10.13.0.0/18` | -| us-east-1 | `10.0.0.0/20` | `vpc-0bba852b9d4ab2c97` | api-vpc | delete | -| us-east-1 | `10.1.0.0/20` | `vpc-066b62ae05602bd12` | api-automode-vpc | delete | -| us-east-1 | `10.2.0.0/20` | `vpc-0586cfeec90f9affb` | agent-vpc | delete | -| us-east-2 | `10.1.48.0/20` | `vpc-0be5283678cc479a8` | k8s-automode-us-east-2 | `10.12.64.0/18` | -| us-west-2 | `10.1.16.0/20` | `vpc-0a95c3ac9a9cb4847` | k8s-automode-us-west-2 | `10.12.192.0/18` | -| eu-central-1 | `10.1.32.0/20` | `vpc-09f2b8338c705b83a` | api-automode-vpc | delete | -| ap-south-1 | `10.0.64.0/20` | `vpc-07b45693f07c78a3a` | api-vpc | delete | -| ap-south-1 | `10.1.64.0/20` | `vpc-012a5e19a2f349e57` | k8s-automode-ap-south-1 | `10.14.0.0/18` | -| ap-northeast-1 | `10.1.112.0/20` | `vpc-0a0d7cc9ddc75af38` | k8s-am-ap-northeast-1 | `10.15.0.0/18` | -| ap-southeast-2 | `10.1.96.0/20` | `vpc-02aae9d6999f28763` | k8s-am-ap-southeast-2 | `10.14.192.0/18` | -| sa-east-1 | `10.1.128.0/20` | `vpc-033dc3007b9e4490f` | k8s-am-sa-east-1 | `10.16.128.0/18` | - -### Sandbox Account (`unkey-sandbox-admin`) - -| Region | VPC CIDR | VPC ID | Name | Replacement | -|---------------|-------------------|---------------------------|----------------------------|---------------------| -| eu-central-1 | `10.10.0.0/20` | `vpc-0ea9d0fa5b1774bbd` | staging-eu-central-1 | `10.28.80.0/20` | -| us-west-2 | `10.10.32.0/20` | `vpc-06cf32c205d4de0b2` | staging-us-west-2 | `10.28.48.0/20` | -| us-east-1 | `10.0.0.0/16` | `vpc-08e66ce4f301d7787` | eksctl-staging-us-east-1 | delete (orphaned) | -| us-west-1 | `10.1.0.0/16` | `vpc-04515147037b05454` | eksctl-staging-us-west-1 | delete (orphaned) | - -### AWS Profiles - -| Account | Profile | Purpose | -|------------|--------------------------------|------------------------| -| Production | `unkey-production001-admin` | EKS + Talos clusters | -| Sandbox | `unkey-sandbox-admin` | Staging clusters | diff --git a/docs/engineering/infra/deployments/argocd-debugging.mdx b/docs/engineering/infra/deployments/argocd-debugging.mdx deleted file mode 100644 index 1abb9a9ea6..0000000000 --- a/docs/engineering/infra/deployments/argocd-debugging.mdx +++ /dev/null @@ -1,318 +0,0 @@ ---- -title: ArgoCD Debugging Commands -description: Debugging ArgoCD apps, syncs, and clusters. ---- - -## Applications - -```bash -# List all applications and their sync/health status -kubectl get applications -n argocd - -# Get detailed status for a specific application -kubectl get application -n argocd -o yaml - -# Check why an application is failing to sync -kubectl get application -n argocd -o jsonpath='{.status.operationState.message}' - -# Check sync status and revision -kubectl get application -n argocd -o jsonpath='{.status.sync.status}' -kubectl get application -n argocd -o jsonpath='{.status.sync.revision}' - -# Check health status -kubectl get application -n argocd -o jsonpath='{.status.health.status}' - -# List all applications with their sync status in a table -kubectl get applications -n argocd -o custom-columns='NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status' - -# Find applications that are not synced -kubectl get applications -n argocd -o json | jq -r '.items[] | select(.status.sync.status != "Synced") | .metadata.name' - -# Find applications that are unhealthy -kubectl get applications -n argocd -o json | jq -r '.items[] | select(.status.health.status != "Healthy") | "\(.metadata.name): \(.status.health.status)"' -``` - -## ApplicationSets - -```bash -# List all applicationsets -kubectl get applicationsets -n argocd - -# Check an applicationset's generators and template -kubectl get applicationset -n argocd -o yaml - -# Check what applications an applicationset has generated -kubectl get applications -n argocd -l 'app.kubernetes.io/instance=' - -# Check applicationset controller logs for generation errors -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-applicationset-controller --tail=50 -``` - -## Syncing - -```bash -# Force a refresh (re-read from git without syncing) -kubectl patch application -n argocd --type merge -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"normal"}}}' - -# Force a hard refresh (clear cache and re-read) -kubectl patch application -n argocd --type merge -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' - -# Trigger a sync via kubectl -kubectl patch application -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{"revision":"HEAD"}}}' -``` - -## Logs - -```bash -# ArgoCD server logs (UI, API, sync operations) -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-server --tail=100 - -# Application controller logs (sync, health checks) -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-application-controller --tail=100 - -# Repo server logs (git clone, helm template errors) -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-repo-server --tail=100 - -# ApplicationSet controller logs -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-applicationset-controller --tail=100 - -# Filter logs for a specific application -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-application-controller --tail=200 | grep '' -``` - -## Repository - -```bash -# Check configured repositories -kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=repository - -# View repo connection details (redacted) -kubectl get secret -n argocd -l argocd.argoproj.io/secret-type=repository -o jsonpath='{.items[0].data.url}' | base64 -d - -# Check repo server connectivity -kubectl exec -n argocd deploy/argocd-repo-server -- ls /tmp -``` - -## Cluster - -```bash -# Check cluster labels (used by applicationset generators) -kubectl get secret -n argocd -l argocd.argoproj.io/secret-type=cluster -o json | jq '.items[] | {name: (.data.name // "in-cluster" | @base64d), labels: .metadata.labels}' - -# For in-cluster, labels are stored differently -kubectl get configmap argocd-cm -n argocd -o yaml -``` - -## Port-forward to ArgoCD UI - -```bash -# Access the UI locally -kubectl port-forward svc/argocd-server -n argocd 8443:443 - -# Then open https://localhost:8443 - -# Get the admin password (if initial secret still exists) -kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d -``` - -## Fresh Cluster Triage - -After `setup-cluster.sh` completes and configs have been pushed to git, walk through -these checks in order: - -### 1. Are ApplicationSets installed and generating apps? - -```bash -# ApplicationSets should exist (deployed by setup-argocd.sh) -kubectl get applicationsets -n argocd - -# Applications should be generated from the ApplicationSets -kubectl get applications -n argocd -o custom-columns='NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status' -``` - -If **no ApplicationSets** exist, `setup-argocd.sh` may not have completed — check its -output or re-run it. - -If ApplicationSets exist but **no applications** are generated, continue to step 2. - -### 2. Verify cluster labels - -ApplicationSets use a `clusters: {}` generator that reads labels from the in-cluster -secret. If labels are missing or wrong, no applications will be generated. - -```bash -# Check what labels are set -argocd cluster get in-cluster -o json | jq '.labels' -``` - -Required labels (set by `setup-argocd.sh`): - -| Label | Expected value (example) | -| --- | --- | -| `environment` | `production001` | -| `region` | `us-east-1` | -| `provider` | `aws` | -| `clusterSuffix` | (empty string unless coexisting with legacy) | - -If labels are wrong, fix them: -```bash -argocd cluster set in-cluster \ - --label environment=production001 \ - --label region=us-east-1 \ - --label provider=aws \ - --label clusterSuffix="" -``` - -### 3. Check promotion files point to the right revision - -Each ApplicationSet reads a promotion file to get the target revision: -`eks-cluster/promotions//.yaml` - -If the promotion file doesn't exist for an app, the git generator won't match and -no application is created. Verify: -```bash -ls eks-cluster/promotions/production001/ -``` - -**Critical for new clusters:** A new cluster requires a promotion of every app to a -commit that includes the new region's environment files. If the promotion files still -pin an older commit (from before the region config was added), ArgoCD will check out -that old commit and fail with "no such file or directory" for every values file. This -is the most common cause of all apps showing `Unknown` on a fresh cluster. - -```bash -# Check what revision an app is pinned to -cat eks-cluster/promotions/production001/core.yaml - -# If it's older than the commit that added the region config, promote all apps -# Use origin HEAD (not local) since ArgoCD fetches from the remote -./scripts/promote production001 $(git ls-remote origin main | awk '{print $1}') -git add eks-cluster/promotions/ && git commit -m "Promote all apps for new region" && git push -``` - -### 4. Apps exist but show `Unknown` sync status - -`Unknown` means the repo server failed to render manifests (helm template failed). -This is the most common issue on a fresh cluster. - -**Step A: Check repo server logs for the actual error** -```bash -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-repo-server --tail=200 \ - | grep 'level=error' | head -10 -``` - -Look at the error message at the end of each line. You'll typically see one of: -- `no such file or directory` — a values file is missing from the git repo -- `parse error` / `YAML` — a values file has invalid syntax -- `authentication required` — git credentials aren't working - -**Step B: If the error is "no such file or directory", check the promotion revision** - -This is the most common cause on a fresh cluster. Each app's `targetRevision` comes -from a promotion file (`eks-cluster/promotions//.yaml`), which pins a -specific git SHA. If the promotion file still points to a commit from *before* the -new region's env files were added, the repo server will check out that old commit -and the files genuinely won't exist. - -Check what revision an app is targeting: -```bash -kubectl get application -n argocd -o yaml | grep targetRevision -``` - -Compare that SHA with the commit that added your region's config files. If the -promotion revision is older, that's the problem — **update the promotion files** to -a revision that includes the new env files, then push. - -**Step C: If the promotion revision is correct, check for a stale repo cache** - -The repo server caches helm template results (including errors). If files were pushed -after ArgoCD first tried to render at the correct revision, the cache may be stale. - -```bash -# Force a hard refresh to clear the cache -for app in $(kubectl get applications -n argocd -o jsonpath='{.items[?(@.status.sync.status=="Unknown")].metadata.name}'); do - kubectl patch application "$app" -n argocd --type merge \ - -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' -done - -# If hard refresh doesn't help, restart the repo server to wipe the on-disk cache -kubectl rollout restart deployment argocd-repo-server -n argocd -``` - -After this, apps should transition from `Unknown` to `OutOfSync` or `Synced` within -a minute. If they stay `Unknown`, re-check the repo server logs — the files may -genuinely be missing or have syntax errors. - -### 5. Check applicationset controller logs - -```bash -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-applicationset-controller --tail=100 -``` - -Look for errors about git file discovery, cluster matching, or template rendering. - ---- - -## Common Issues - -### Application stuck in "Unknown" sync status -The repo server can't render the manifests. Check repo server logs: -```bash -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-repo-server --tail=100 -``` -Common causes: missing Helm values file, invalid chart, git auth failure. - -### Application stuck in "Unknown" after a fix has been pushed -The repo server caches manifest generation results, including failures. If a values -file was missing and you've since pushed the fix, ArgoCD may keep serving the cached -error. Force a hard refresh to clear it: -```bash -# Single application -kubectl patch application -n argocd --type merge -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' - -# All applications stuck in Unknown -for app in $(kubectl get applications -n argocd -o jsonpath='{.items[?(@.status.sync.status=="Unknown")].metadata.name}'); do - kubectl patch application "$app" -n argocd --type merge -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' -done -``` - -### Application sync stuck in "Running" forever -A sync operation can get permanently stuck (e.g. waiting on a health check that -passed but wasn't detected). Hard refreshes and controller restarts won't fix this -because the operation state is stored in the Application CR itself. - -If the application is managed by an ApplicationSet, delete it and let the -ApplicationSet recreate it with a clean state: -```bash -# Verify the applicationset exists first -kubectl get applicationset -n argocd - -# Delete the stuck application (the ApplicationSet will recreate it) -kubectl delete application -n argocd - -# Verify it was recreated and is syncing -kubectl get application -n argocd -o custom-columns='NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status' -``` -Do NOT do this for applications that aren't managed by an ApplicationSet — they -won't be recreated automatically. - -### ApplicationSet not generating applications -Check the controller logs and verify cluster labels match the generator selectors: -```bash -kubectl logs -n argocd -l app.kubernetes.io/name=argocd-applicationset-controller --tail=50 -kubectl get secret -n argocd -l argocd.argoproj.io/secret-type=cluster -o json | jq '.items[].metadata.labels' -``` - -### Sync failed with "one or more objects failed to apply" -Get the full error message: -```bash -kubectl get application -n argocd -o jsonpath='{.status.operationState.message}' -``` - -### Application synced but pods not running -ArgoCD sync succeeded but the workload is unhealthy. Check the target namespace: -```bash -kubectl get pods -n -o wide -kubectl describe pod -n -kubectl logs -n --tail=50 -``` diff --git a/docs/engineering/infra/deployments/cronjobs.mdx b/docs/engineering/infra/deployments/cronjobs.mdx deleted file mode 100644 index bcc1ab9ee7..0000000000 --- a/docs/engineering/infra/deployments/cronjobs.mdx +++ /dev/null @@ -1,97 +0,0 @@ ---- -title: CronJobs (Restate Scheduled Tasks) -description: How to define and deploy scheduled tasks that trigger Restate service endpoints. ---- - -## Overview - -We use Kubernetes CronJobs to trigger Restate service endpoints on a timer. Each cronjob is a tiny `curl` container that POSTs to the Restate Cloud ingress URL. Restate handles durable execution, retries, and idempotency — the cronjob just kicks it off. - -## Adding a new cronjob - -Edit the environment values file for the target environment(s): - -- `eks-cluster/helm-chart/restate/environments/staging-eu-central-1.yaml` -- `eks-cluster/helm-chart/restate/environments/production001/us-east-1.yaml` - -Add a new entry under `scheduledTasks.jobs`: - -```yaml -scheduledTasks: - jobs: - my-new-job: - schedule: "0 6 * * *" - urlPath: 'hydra.v1.MyService/global/DoThing/send' - idempotencyKey: 'my-job-$(date -u +%Y-%m-%d)' - body: '{"someParam": "value"}' -``` - -Merge to main → ArgoCD deploys the new CronJob. Done. - -## Field reference - -| Field | Description | -|------------------|-----------------------------------------------------------------------------| -| `schedule` | Standard 5-field cron expression. **All times are UTC.** | -| `urlPath` | Restate service path: `hydra.v1.{Service}/{key}/{Handler}/send` | -| `idempotencyKey` | Sent as `idempotency-key` header. Use shell date expansion to make it unique per run. | -| `body` | JSON string payload for the service handler. | - -## URL path format - -``` -hydra.v1.{ServiceName}/{key}/{HandlerName}/send -``` - -- `/send` makes it an async (fire-and-forget) invocation -- `{key}` can be static (`global`) or dynamic (`$(date -u +%Y-%m-%d)`) - -## Idempotency keys - -Use shell date expansion to generate unique keys per execution window: - -| Pattern | Example output | Use for | -|------------------------------------|----------------------|----------------------| -| `$(date -u +%Y-%m-%d)` | `2026-03-18` | Daily jobs | -| `$(date -u +%Y-%m)` | `2026-03` | Monthly jobs | -| `$(date -u +%Y-%m-%dT%H:%M)` | `2026-03-18T14:30` | Sub-daily jobs | - -If Restate sees the same idempotency key twice, it deduplicates — pick a granularity that matches your schedule. - -## Existing jobs - -| Job | Schedule | What it does | -|--------------------|-----------------|-------------------------------------------| -| `quota-check` | `0 15 * * *` | Daily quota verification | -| `cert-renewal` | `0 3 * * *` | Renews certificates expiring within 30 days | -| `key-refill` | `0 0 * * *` | Refills API key pools | -| `scale-down-idle` | `0 * * * *` / `*/15 * * * *` | Scales down idle preview deployments (hourly staging, every 15min prod) | -| `key-last-used-sync` | `* * * * *` | Syncs `lastUsedAt` timestamps from ClickHouse to MySQL every minute ([details](/architecture/services/control-plane/worker/workflows/key-last-used-sync)) | - -## Runtime details - -- **Image:** `curlimages/curl:8.12.1` -- **Timeout:** 300s per execution -- **Retries:** 3 attempts (`backoffLimit`) -- **Concurrency:** `Forbid` — won't start a new run if previous is still running -- **History:** 3 successful + 5 failed job records kept -- **Resources:** 10m CPU / 32Mi memory -- **Auth:** Bearer token from `restate-cloud-credentials` secret (AWS Secrets Manager) - -## Monitoring - -Prometheus alert `RestateCronJobFailing` fires after 15 min of failures: - -- **Staging:** `warning` -- **Production:** `critical` - -Config: `eks-cluster/helm-chart/observability/templates/prometheus-alerts-restate-cronjobs.yaml` - -## Files - -| File | Purpose | -|------|---------| -| `eks-cluster/helm-chart/restate/templates/cronjobs.yaml` | Helm template that generates CronJobs | -| `eks-cluster/helm-chart/restate/values.yaml` | Base values (jobs disabled by default) | -| `eks-cluster/helm-chart/restate/environments/*.yaml` | Per-environment job definitions | -| `eks-cluster/helm-chart/observability/templates/prometheus-alerts-restate-cronjobs.yaml` | Alerting rules | diff --git a/docs/engineering/infra/domain-connect.mdx b/docs/engineering/infra/domain-connect.mdx deleted file mode 100644 index 5429c179c4..0000000000 --- a/docs/engineering/infra/domain-connect.mdx +++ /dev/null @@ -1,130 +0,0 @@ ---- -title: Domain Connect -description: How our one-click custom domain DNS setup works, and how it's wired together. ---- - -## What is Domain Connect? - -[Domain Connect](https://www.domainconnect.org/) is an open protocol that lets service providers (us) configure DNS records on a user's domain with one click, instead of asking them to copy-paste CNAME and TXT values manually. The user gets redirected to their DNS provider (e.g. Cloudflare), approves the changes, and the records are created automatically. - -## How it works in our stack - -``` -User adds custom domain in dashboard - ↓ -ctrl API: AddCustomDomain - 1. Generates CNAME target + verification token - 2. Looks up domain's nameservers - 3. Checks if the DNS provider supports Domain Connect - (via _domainconnect.{provider} TXT lookup) - 4. If yes: builds a signed redirect URL using our private key - 5. Stores domain + DC provider/URL in custom_domains table - ↓ -Dashboard shows "Automatic setup available" card - → User clicks "Connect" - → Redirected to DNS provider's consent page - → Provider creates CNAME + TXT records - → Redirects back to app.unkey.com/{workspace}/projects/{project}/settings - ↓ -Existing verification worker picks up the records (polls every 1 min) - → ACME certificate issued - → Frontline route created - → Domain is live -``` - -## Components - -### Template - -The Domain Connect template defines what DNS records we need. It lives in the public [Domain-Connect/templates](https://github.com/Domain-Connect/templates) repo as [`unkey.com.custom-domain.json`](https://github.com/Domain-Connect/templates/blob/master/unkey.com.custom-domain.json). - -| Setting | Value | Why | -|---------|-------|-----| -| `providerId` | `unkey.com` | Our provider identifier | -| `serviceId` | `custom-domain` | Service identifier | -| `hostRequired` | `true` | Subdomains only (apex uses manual setup) | -| `syncBlock` | `false` | Synchronous flow (required by Cloudflare) | -| `syncPubKeyDomain` | `domainconnect.unkey.com` | Where providers fetch our public key | -| `syncRedirectDomain` | `app.unkey.com` | Where providers redirect after approval | - -Records created: - -| Type | Host | Value | -|------|------|-------| -| CNAME | `@` | `%target%` (full CNAME target, e.g. `abc123.unkey-dns.com`) | -| TXT | `_unkey` | `unkey-domain-verify=%verificationToken%` | - -To update the template, open a PR against [Domain-Connect/templates](https://github.com/Domain-Connect/templates). Use the [dc-template-linter](https://github.com/Domain-Connect/dc-template-linter) to validate: `dc-template-linter -cloudflare unkey.com.custom-domain.json`. - -### Signing keypair - -Domain Connect requires all requests to be digitally signed (RS256). We have an RSA keypair: - -- **Public key**: published as DNS TXT records at `_dcpubkeyv1.domainconnect.unkey.com`, split into two parts (`p=1` and `p=2`) due to TXT record size limits -- **Private key**: stored in AWS Secrets Manager under `unkey/control` as `UNKEY_DOMAIN_CONNECT_PRIVATE_KEY` (PEM format) - -The DNS records are managed via Pulumi in [`infra/pulumi/projects/dns/unkey-com/main.go`](https://github.com/unkeyed/infra/blob/main/pulumi/projects/dns/unkey-com/main.go). - -### Discovery library - -We use [`railwayapp/domainconnect-go`](https://pkg.go.dev/github.com/railwayapp/domainconnect-go) which handles: -- DNS provider discovery (NS lookup → `_domainconnect.{provider}` TXT check) -- Sync URL construction with all required parameters -- RS256 signing with our private key - -The wrapper is in `pkg/dns/domainconnect/discover.go`. - -### Code - -Discovery and signing live in `pkg/dns/domainconnect/`. The ctrl service calls `Discover()` during `AddCustomDomain` and persists the result. If no private key is configured, Domain Connect is silently disabled. - -## Supported DNS providers - -Any provider that publishes a `_domainconnect.{provider-domain}` TXT record is automatically supported. As of now: - -| Provider | Notes | -|----------|-------| -| Cloudflare | Template onboarded via email to domain-connect@cloudflare.com | -| Vercel DNS | Auto-discovered | -| DigitalOcean, Name.com, Hostinger, Dynadot, Namesilo | Use Cloudflare under the hood | -| IONOS | Own Domain Connect endpoint | - -To onboard with a new provider, they need to pull our template from the templates repo. Some providers (like Cloudflare and Vercel) require manual registration via email. - -## Key rotation - -If you need to rotate the signing keypair: - -1. Generate a new keypair: -```bash -openssl genrsa -out domain-connect-private.pem 2048 -openssl rsa -in domain-connect-private.pem -pubout -outform DER \ - | base64 | tr -d '\n' > domain-connect-pubkey.b64 -``` - -2. Update the DNS TXT records at `_dcpubkeyv1.domainconnect.unkey.com` with the new public key chunks - -3. Update `UNKEY_DOMAIN_CONNECT_PRIVATE_KEY` in AWS Secrets Manager (`unkey/control`) - -4. Restart ctrl service to pick up the new key - -Existing signed URLs in the database will become invalid. Users will need to re-add their domain to get a new signed URL. - -## Verifying the setup - -### Check public key is published -```bash -dig TXT _dcpubkeyv1.domainconnect.unkey.com -``` - -### Verify a signature -Go to [exampleservice.domainconnect.org/sig](https://exampleservice.domainconnect.org/sig), enter: -- **Key**: `_dcpubkeyv1` -- **Domain**: `domainconnect.unkey.com` -- Paste the query string and signature from a generated URL - -### Validate template -```bash -go install github.com/Domain-Connect/dc-template-linter@latest -dc-template-linter -cloudflare unkey.com.custom-domain.json -``` diff --git a/docs/engineering/infra/infra-work.mdx b/docs/engineering/infra/infra-work.mdx deleted file mode 100644 index d925f12d19..0000000000 --- a/docs/engineering/infra/infra-work.mdx +++ /dev/null @@ -1,175 +0,0 @@ ---- -title: Working on Infrastructure -description: Planning, promotions, and production deploys. ---- - -Since there are only three of us working on infra across our respective timezones.. That makes how we plan, communicate, and ship more important than it would be on a colocated team or a team in the same timezone. This document covers how we should approach infra work — from planning through to production. - -## Planning work - -Not everything in infra is a quick fix. Rolling out to a new region, changing how networking works, adjusting capacity, rethinking cost — these are changes that ripple. They affect multiple clusters, multiple services, and sometimes multiple AWS accounts. They can't be developed and shipped in a day, and they shouldn't be. - -Before starting significant work, write up what you're planning and get it in front of the rest of the team. This doesn't need to be formal — a GitHub issue, a doc, even a detailed Slack thread in **#infrastructure** — but it needs to exist somewhere the others can read it async and weigh in. The goal is to make sure we're aligned before anyone starts writing code, not after it's already in a PR. - -Things worth planning together: - -- **New region rollouts** — cluster provisioning, secret replication, DNS, gossip functionality, environment files for every chart. There's a checklist but it still touches a lot of surface area. -- **Capacity changes** — node group sizing, scaling limits, instance types. These have cost implications and we must be deliberate about them. -- **Cost-related changes** — anything that changes what we're paying for or how much. Reserved instances, savings plans, storage classes, new managed services. -- **Architectural changes** — swapping out a component, changing how traffic routes, adding or removing a dependency. These are the ones where getting a second opinion early saves the most time. -- **Cross-cutting changes** — anything that touches the promotion files, ArgoCD ApplicationSets, or shared Helm chart structure. If it affects how we deploy, we all need to know. - -The point isn't to create bureaucracy. It's that the three of us are rarely online at the same time, and it's a lot cheaper to catch a problem in a plan than in a rollback. - -## Promotions - -This is how we get code from a PR into staging and production. The process is lightweight on purpose — most of the time it stays out of your way. The few rules that do exist are there because most of us have been on the wrong end of a production incident with no context, and none of us want to be there again. - -### Why we do it this way - -A common alternative is a long-running staging branch where merging staging into main means "deploy to production." That works until it doesn't. If five changes land on staging and one of them is broken, we're stuck — we can't promote the four good ones without also shipping the bad one or doing **a lot** of `git cherry-pick`in'. Everything queues up behind the fix to the one bad change. The branches drift apart, merge conflicts accumulate, and rolling back means untangling which of a dozen changes in a staging→main merge caused the problem. - -By pinning production to a specific commit SHA, staging and production are decoupled. A broken change on staging doesn't touch production at all — production is still sitting on the last known-good SHA while we sort things out. We pick exactly which commit to promote and when, per component if we need to. Rolling back is one revert commit. And there's no second branch to maintain — everything lives on main. - -### How it works - -Every environment has a promotion in a `promotions//.yaml` file that controls which Git revision ArgoCD deploys. ArgoCD ApplicationSets read these files and set `targetRevision` on each Application. - -- **Staging** tracks `main` — every merge to main auto-deploys within minutes. -- **Production** is pinned to a specific commit SHA. Deploying to production requires explicitly updating this SHA and getting the change reviewed. - -### Merging to main - -Merges to main **do not require a review** unless the change modifies a file under `eks-cluster/promotions/`. Everything merged to main deploys to staging automatically. We'll get some kind of gating mechanism put into place eventually as the infrastructure stabilizes. - -### Why good PRs and commit messages matter - -A merge to main is the unit of work that eventually gets promoted to production. The PR title, description, and commit messages become the primary audit trail for everything that runs in production. - -When something goes wrong at 2am, the first thing we do is check what changed recently. The promotion file points to a SHA, that SHA points to a merge commit, and that merge commit points back to a PR. If the PR says "fix stuff" with no description, whoever is triaging has to read every line of the diff to figure out what changed and whether it could be related. If the PR clearly explains what was changed, why, and what it affects, we can make that call in seconds. - -This matters beyond incidents too. When reviewing a promotion PR, we're deciding whether this change is safe for production. We follow the SHA back to the original PR to understand what we're approving. A well-documented PR lets us review with confidence. A poorly documented one forces us to either rubber-stamp it or spend time reverse-engineering the intent from code. - -**Write PRs into main as if one of us will need to triage a production issue using only the PR title and description.** Include: - -- What changed and why -- What services or components are affected -- Any risks or things to watch after deployment -- Links to related issues, docs, or prior discussion - -### Deploying to production - -Production promotions require review. It's on whoever is promoting to make the reviewer's job easy — link back to the original PR, use the right SHA, and make sure the context is all there. If the reviewer has to go digging to understand what they're approving, we've already failed at the process. - -#### Step by step - -1. **Merge your change to main.** This __really__ should be a PR, but as long as the change is well-documented — clear title, description of what changed and why, and any relevant context... you can just push to main... but a PR is required for production, always. Violators will be prosecuted to the fullest extent of the law. - -2. **Verify it works in staging.** Wait for ArgoCD to deploy the change to staging and confirm it behaves correctly. - -3. **Get the commit SHA from the merged PR.** This is the merge commit on main that you validated in staging: - - ```bash - git log --oneline main - ``` - -4. **Run the promote script:** - -It is preferred to update the specific component you're updating and not blanket promote service. - - ```bash - # Promote a single component - ./scripts/promote production001 frontline - ``` - -5. **Commit, push, and open a promotion PR.** The promotion PR should: - - - Set the `revision` to the SHA of the change being promoted - - Link to the original PR that introduced the change (the one merged to main), or write up an explanation. - - Be scoped narrowly — one promotion per PR when possible - - Example commit message and PR body: - - ``` - promote production001 to 311420 - - Promotes https://github.com/unkeyed/infra/pull/311 to production. - ``` - -6. **Get a review and merge.** You've done the work to make this easy to review — the reviewer just needs to confirm the SHA matches what was tested in staging and that the linked PR tells the full story. - -7. **ArgoCD picks up the change** and deploys. - -#### What the reviewer should check - -- The `revision` value is a real commit SHA (not a branch name) -- The promotion PR links to the original PR that introduced the change -- That change was already deployed and validated in staging -- The scope makes sense (all components vs. single component) - -### Rolling back - -Revert the promotion commit and push: - -```bash -git revert -git push -``` - -ArgoCD rolls back to the prior revision within a few minutes. - -### The `promote` script - -```bash -# Promote all components in an environment -./scripts/promote - -# Override a single component -./scripts/promote - -# Clear a single-component override -./scripts/promote --clear - -# Regenerate per-component files without changing anything -./scripts/promote --generate -``` - -The script updates the root promotion file (`promotions/.yaml`) and regenerates per-component files that ArgoCD reads. You still need to commit and push the result. - -### Component overrides - -You can pin a single component to a different revision than the environment default: - -```bash -# Pin frontline to a specific SHA in production -./scripts/promote production001 frontline abc123 - -# Clear the override when done -./scripts/promote production001 frontline --clear -``` - -Overrides are stored in the root promotion file: - -```yaml -revision: 1d9c3076d63027f5fa770b43d08a4453318b2f8e - -overrides: - frontline: abc123 -``` - -Components without overrides use the default `revision`. - -### Currently valid components - -``` -argocd control-api control-worker core -external-dns frontline krane networking -observability reloader restate runtime -sentinel thanos vault vector-logs -``` - ---- - -Hopefully, none of this is heavy process. It's a few small habits — plan before building, write a good PR description, link the SHA, get a review for production — that help create codebase where any of us can pick up the thread at any point and understand what's running and why. - -When you need a promotion reviewed and nobody is online yet, or you want a second opinion on an approach before you start, use **#infrastructure** on Slack. It's a lot easier to coordinate async when the plans and PRs are well-documented — and when they're not, that's when things fall through the cracks and make things harder for each other. diff --git a/docs/engineering/infra/legacy-2025/github-actions-deploy-role-setup.mdx b/docs/engineering/infra/legacy-2025/github-actions-deploy-role-setup.mdx deleted file mode 100644 index be1dc2b443..0000000000 --- a/docs/engineering/infra/legacy-2025/github-actions-deploy-role-setup.mdx +++ /dev/null @@ -1,119 +0,0 @@ ---- -title: Configuring GitHubActionsDeployRole -description: Creating the GitHubActionsDeployRole IAM role. ---- - -This replaces `UnkeyPulumiAWSExecutor` as we deprecate Pulumi. The trust policies are already created in this repo, so this is mostly just running commands. - -## Prerequisites - -Grab the `Basic ~/.aws/config for AdministratorAccess` from [1password](https://engineering.unkey.com/infrastructure/1password). - -## Creating the role in each account - -The trust policy files are already in this directory (`github-actions-deploy-role-{sandbox,canary,production001}-trust-policy.json`). They allow the `GitHubActionsOIDCRole` from the management account and the `AdministratorAccess` SSO role to assume this role. - -Create the role in each account... - -``` -for account in sandbox canary production001; do - aws iam create-role \ - --profile "unkey-${account}-admin" \ - --role-name GitHubActionsDeployRole \ - --assume-role-policy-document file://docs/github-actions-deploy-role-${account}-trust-policy.json \ - --no-cli-pager -done -``` - -Now create and attach the permissions policy. This is the same across all accounts. - -``` -for account in sandbox canary production001; do - POLICY_ARN=$(aws iam create-policy \ - --profile "unkey-${account}-admin" \ - --policy-name GitHubActionsDeployPolicy \ - --policy-document file://docs/github-actions-deploy-role-policy.json \ - --query 'Policy.Arn' --output text) - - aws iam attach-role-policy \ - --profile "unkey-${account}-admin" \ - --role-name GitHubActionsDeployRole \ - --policy-arn "${POLICY_ARN}" \ - --no-cli-pager -done -``` - -If you need to update the policy later, create a new version... - -``` -for account in sandbox canary production001; do - POLICY_ARN="arn:aws:iam::$(aws sts get-caller-identity --profile "unkey-${account}-admin" --query Account --output text):policy/GitHubActionsDeployPolicy" - - aws iam create-policy-version \ - --profile "unkey-${account}-admin" \ - --policy-arn "${POLICY_ARN}" \ - --policy-document file://docs/github-actions-deploy-role-policy.json \ - --set-as-default \ - --no-cli-pager -done -``` - -## Update the management account - -The `GitHubActionsOIDCRole` needs permission to assume the new role. Create a new cross-account policy for it... - -``` -aws iam create-policy \ - --profile unkey-root-admin \ - --policy-name GitHubActionsDeployCrossAccount \ - --policy-document file://docs/github-actions-deploy-role-cross-account-policy.json - -aws iam attach-role-policy \ - --profile unkey-root-admin \ - --role-name GitHubActionsOIDCRole \ - --policy-arn "arn:aws:iam::333769656712:policy/GitHubActionsDeployCrossAccount" -``` - -## EKS access - -For kubectl to work, the role needs an EKS access entry. Do this for each cluster you want to deploy to (sorry about the names lol) - -For `beautiful-dance-crab` in eu-central-1... - -``` -aws eks create-access-entry \ - --cluster-name beautiful-dance-crab \ - --principal-arn arn:aws:iam::222634365038:role/GitHubActionsDeployRole \ - --type STANDARD \ - --region eu-central-1 \ - --profile unkey-production001-admin - -aws eks associate-access-policy \ - --cluster-name beautiful-dance-crab \ - --principal-arn arn:aws:iam::222634365038:role/GitHubActionsDeployRole \ - --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \ - --access-scope type=cluster \ - --region eu-central-1 \ - --profile unkey-production001-admin -``` - -For `adorable-jazz-gopher` in us-east-1... - -``` -aws eks create-access-entry \ - --cluster-name adorable-jazz-gopher \ - --principal-arn arn:aws:iam::222634365038:role/GitHubActionsDeployRole \ - --type STANDARD \ - --region us-east-1 \ - --profile unkey-production001-admin - -aws eks associate-access-policy \ - --cluster-name adorable-jazz-gopher \ - --principal-arn arn:aws:iam::222634365038:role/GitHubActionsDeployRole \ - --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \ - --access-scope type=cluster \ - --region us-east-1 \ - --profile unkey-production001-admin -``` - -For additional clusters, just change `--cluster-name` and `--region`. diff --git a/docs/engineering/infra/legacy-2025/github-oidc.mdx b/docs/engineering/infra/legacy-2025/github-oidc.mdx deleted file mode 100644 index d2b2882cf2..0000000000 --- a/docs/engineering/infra/legacy-2025/github-oidc.mdx +++ /dev/null @@ -1,226 +0,0 @@ ---- -title: Configuring GitHub OIDC -description: GitHub Actions OIDC setup with AWS. ---- - -## Methods for getting AWS account IDs - -Grab the `Basic ~/.aws/config for AdministratorAccess` from [1password](https://engineering.unkey.com/infrastructure/1password). - -## In the management/root account - -``` -aws iam create-open-id-connect-provider \ - --profile unkey-root-admin \ - --url https://token.actions.githubusercontent.com \ - --client-id-list sts.amazonaws.com \ - --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1 -``` - -To get the ARN for the OIDC provider... (you don't need to do this as it's embedded in the policy below, but it's here for reference) - -``` -aws iam list-open-id-connect-providers --profile unkey-root-admin --output text --no-cli-pager -``` - -Create a `github-actions-trust-policy.json` for Github Actions role in the root/management account for the `unkeyed/infra:*` repo for all (`*`) branches. This already has the ARN from the above command in place. - -``` -cat > github-actions-trust-policy.json < cross-account-policy.json <-trust-policy.json` we'll create... this policy says that anyone with the `AWSReservedSSO_AdministratorAccess_*` role from `` and the `GithubActionsOIDCRole` from the root/management account can assume this role. - -``` -# I'm sorry for this bash... -for account in sandbox canary production001; do -ROLE_ID=$(aws iam list-roles \ - --profile "unkey-${account}-admin" \ - --query "Roles[?contains(RoleName, 'AWSReservedSSO_AdministratorAccess')].{Arn:Arn}" --output text) -cat > "pulumi-executor-${account}-trust-policy.json" < unkey-pulumi-policy.json << 'EOF' -{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Action": [ - "cloudformation:*", - "cloudwatch:*", - "ec2:*", - "ecr:*", - "ecs:*", - "elasticache:*", - "elasticloadbalancing:*", - "globalaccelerator:*", - "iam:AttachRolePolicy", - "iam:CreateRole", - "iam:DeleteRole", - "iam:DeleteRolePolicy", - "iam:DetachRolePolicy", - "iam:GetRole", - "iam:GetRolePolicy", - "iam:PassRole", - "iam:PutRolePolicy", - "iam:ListRolePolicies", - "iam:ListAttachedRolePolicies", - "iam:ListInstanceProfilesForRole", - "kms:*", - "logs:*", - "ssm:*" - ], - "Resource": "*" - } - ] -} -EOF -``` - -Now add the policy to each account and attach it to the role... - -``` -for account in sandbox canary production001; do -POLICY_ARN=$(aws iam create-policy \ - --profile "unkey-${account}-admin" \ - --policy-name UnkeyPulumiPolicy \ - --policy-document file://unkey-pulumi-policy.json \ - --query 'Policy.Arn' --output text) - -# Attach the policy to the role -aws iam attach-role-policy \ - --profile "unkey-${account}-admin" \ - --no-cli-pager \ - --role-name UnkeyPulumiAWSExecutor \ - --policy-arn "${POLICY_ARN}"; -done -``` diff --git a/docs/engineering/infra/legacy-2025/pulumi-ecs-fargate-api.png b/docs/engineering/infra/legacy-2025/pulumi-ecs-fargate-api.png deleted file mode 100644 index 0d81a3f17f..0000000000 Binary files a/docs/engineering/infra/legacy-2025/pulumi-ecs-fargate-api.png and /dev/null differ diff --git a/docs/engineering/infra/legacy-2025/pulumi-iac-esc.mdx b/docs/engineering/infra/legacy-2025/pulumi-iac-esc.mdx deleted file mode 100644 index 6beacfe9ef..0000000000 --- a/docs/engineering/infra/legacy-2025/pulumi-iac-esc.mdx +++ /dev/null @@ -1,101 +0,0 @@ ---- -title: How Pulumi IaC and ESC Work Together -description: Pulumi stacks, ESC environments, and secrets. ---- - -## Prerequisites - -Install the Pulumi and ESC cli - -``` -brew update && \ -brew install pulumi/tap/esc pulumi/tap/pulumi -``` - -## Creating a stack and environment - -The naming convention for a stack and environment is `unkey//-`. - -This document uses the `api` project as an example. Note that there exists a `global` environment for each AWS account to hold configuration items that don't change across regions. Right now this is only the DSN for connecting to Planetscale. We can override these at the region level but we don't need that quite yet. - -``` -project: api -aws_account_name: canary -aws_region_shorthand: use1 -``` - -Let's start by first creating a stack and environment. - -``` -# Create the stack -pulumi stack init unkey/api/aws-canary-us-east-1 - -# Create the ESC global environment -esc env init unkey/api/canary-global - -# Create the ESC region environment -esc env init unkey/api/aws-canary-us-east-1 -``` - -## Setting secrets - -To set a secret named `databasePrimaryDsn` for the `unkey/api/canary-global` environment, you would execute: - -``` - esc env set --secret unkey/api/canary-global pulumiConfig.api:databasePrimaryDsn "thesecretgoeshere" -``` - - - -## Configure the stack to use a secret - -In the `projects/api` directory, there exists a number of `Pulumi.*.yaml` files that reference the stack name, which is just the `-` between the `.`. - -For example in `Pulumi.aws-canary-us-east-1.yaml` we have: -``` -imports: - - api/canary-global - -environment: - - api/aws-canary-us-east-1 - -# This is blank as it's inherited via the environment in Pulumi but this can be used -# should we want to track any of these values in git -values: - pulumiConfig: -``` -which imports the `api/canary-global` environment, and sets the stacks environment to `api/aws-canary-us-east-1`. See [Importing other environments](https://www.pulumi.com/docs/esc/environments/imports/) for more information. - -In the go code, you reference the secret by first creating a config object and then assigning the config value name to a variable like so -```go -func main() { - pulumi.Run(func(ctx *pulumi.Context) error { - // config holds configuration from Pulumi IaC and ESC - config := config.New(ctx, "") - - databasePrimaryDsn := config.RequireSecret("databasePrimaryDsn") - - // rest of IaC... - } -} -``` - -## Inspecting config/secret values - -At the global level you can see we have only the `databasePrimaryDsn` defined... - -``` -$ pulumi env open unkey/api/canary-global -f yaml -pulumiConfig: - api:databasePrimaryDsn: [secret] -``` - -But when you render the environment for the `aws-canary-us-east-1` stack, you see the merged config of the global environment into - -``` -$ pulumi env open unkey/api/aws-canary-us-east-1 -f yaml -pulumiConfig: - api:databasePrimaryDsn: [secret] - aws:profile: unkey-canary-admin - aws:region: us-east-1 -``` diff --git a/docs/engineering/infra/legacy-2025/pulumi-infrastructure-architecture.mdx b/docs/engineering/infra/legacy-2025/pulumi-infrastructure-architecture.mdx deleted file mode 100644 index 9ed620e022..0000000000 --- a/docs/engineering/infra/legacy-2025/pulumi-infrastructure-architecture.mdx +++ /dev/null @@ -1,638 +0,0 @@ ---- -title: Unkey Infrastructure as Code Documentation -description: Pulumi AWS infrastructure architecture and workflows. ---- - -## Table of Contents -
-Expand for menu... - -* [Unkey Infrastructure as Code Documentation](#unkey-infrastructure-as-code-documentation) - * [Table of Contents](#table-of-contents) - * [Overview](#overview) - * [Architecture](#architecture) - * [Authentication & Authorization Flow](#authentication--authorization-flow) - * [GitHub OIDC Authentication Explained](#github-oidc-authentication-explained) - * [Setup & Configuration](#setup--configuration) - * [GitHub OIDC Configuration](#github-oidc-configuration) - * [Cross\-Account Access Policy](#cross-account-access-policy) - * [Executor Role Trust Policy](#executor-role-trust-policy) - * [Pulumi Permissions Policy](#pulumi-permissions-policy) - * [Pulumi Environment, Secrets and Configuration (ESC)](#pulumi-environment-secrets-and-configuration-esc) - * [ESC Configuration items](#esc-configuration-items) - * [Required Configuration Items](#required-configuration-items) - * [Required Secret Configuration Items](#required-secret-configuration-items) - * [Environment\-Specific Items from ESC](#environment-specific-items-from-esc) - * [Global Environment (e\.g\., unkey/api/canary\-global)](#global-environment-eg-unkeyapicanary-global) - * [Regional Environment (e\.g\., unkey/api/aws\-canary\-us\-east\-1)](#regional-environment-eg-unkeyapiaws-canary-us-east-1) - * [Configuration for Database Password Resources](#configuration-for-database-password-resources) - * [Example Stack Configuration](#example-stack-configuration) - * [Environment Configurations](#environment-configurations) - * [Working with Stacks and Environments](#working-with-stacks-and-environments) - * [Setting Secrets](#setting-secrets) - * [Accessing Configuration in Code](#accessing-configuration-in-code) - * [Role Assumption in Pulumi Code](#role-assumption-in-pulumi-code) - * [Deployed Resources](#deployed-resources) - * [Making Changes to Infrastructure](#making-changes-to-infrastructure) - * [Automated Deployment via GitHub Actions](#automated-deployment-via-github-actions) - * [Prerequisites](#prerequisites) - * [Workflow](#workflow) - * [Troubleshooting CI/CD Deployments](#troubleshooting-cicd-deployments) - * [Manual Deployment by us hoomans](#manual-deployment-by-us-hoomans) - * [Prerequisites](#prerequisites-1) - * [Workflow](#workflow-1) - * [Managing Secrets Manually](#managing-secrets-manually) - * [Common Manual Operations](#common-manual-operations) - * [Best Practices for Manual Deployments](#best-practices-for-manual-deployments) - * [Common Workflows](#common-workflows) - * [Adding a New Secret](#adding-a-new-secret) - * [Working with Multiple AWS Accounts through automation](#working-with-multiple-aws-accounts-through-automation) - * [Troubleshooting](#troubleshooting) - -
- -## Overview - -This documentation provides a comprehensive guide to Unkey's infrastructure setup using Pulumi for AWS deployments. The infrastructure is designed with a multi-account architecture, leveraging GitHub Actions for CI/CD, and implements cross-account role assumption for secure and scalable infrastructure management. - -## Architecture - -Unkey's infrastructure is organized into multiple AWS accounts: -- **Management/Root account**: Central account for authentication and cross-account access management -- **Sandbox account**: Development environment -- **Canary account**: Testing environment -- **Production account**: Production environment - -Each account has a dedicated role (`UnkeyPulumiAWSExecutor`) that Pulumi can assume to deploy resources. GitHub Actions workflows use OIDC authentication to assume a role in the management account, which then assumes the executor roles in target accounts. - -## Authentication & Authorization Flow - -
- Expand for all the gory details... - -### GitHub OIDC Authentication Explained - -OpenID Connect (OIDC) integration between GitHub Actions and AWS works as follows: - -1. **Token-based Authentication**: Instead of storing long-lived AWS credentials as secrets in GitHub, GitHub Actions generates a short-lived OIDC token during workflow runs. - -2. **Trust Relationship**: AWS is configured to trust GitHub Actions as an identity provider. The trust is established through: - - An OIDC provider in AWS IAM that points to `token.actions.githubusercontent.com` - - A role (`GitHubActionsOIDCRole`) with a trust policy that validates the OIDC token - -3. **Conditional Access**: The trust policy includes conditions that verify: - - The token audience (`aud`) is `sts.amazonaws.com` - - The token subject (`sub`) matches `repo:unkeyed/infra:*`, meaning it came from workflows in the specified repository - -4. **Role Assumption**: When the GitHub Actions workflow runs, it: - - Requests an OIDC token from GitHub's token issuer - - Sends this token to AWS STS (Security Token Service) using the `AssumeRoleWithWebIdentity` API - - If the token is valid and meets the conditions, AWS returns temporary credentials - -This approach eliminates the need for storing AWS access keys, enhances security by using short-lived credentials, and simplifies credential management. - -The full authentication flow is: - -1. GitHub Actions authenticates via OIDC to AWS -2. GitHub Actions assumes the GitHubActionsOIDCRole in the Management Account -3. The GitHubActionsOIDCRole assumes the UnkeyPulumiAWSExecutor role in the Target Account -4. Pulumi uses the assumed role to deploy AWS resources in the Target Account - -
- -## Setup & Configuration - -### GitHub OIDC Configuration - -
- Expand for all the gory details... - -The management account is configured with an OIDC provider for GitHub Actions, allowing secure authentication without storing long-lived credentials: - -
-Exammple policy - -```json -{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Principal": { - "Federated": "arn:aws:iam::333769656712:oidc-provider/token.actions.githubusercontent.com" - }, - "Action": "sts:AssumeRoleWithWebIdentity", - "Condition": { - "StringEquals": { - "token.actions.githubusercontent.com:aud": "sts.amazonaws.com" - }, - "StringLike": { - "token.actions.githubusercontent.com:sub": "repo:unkeyed/infra:*" - } - } - } - ] -} -``` -
- -### Cross-Account Access Policy - -The `CrossAccountAssumeRole` policy attached to the `GitHubActionsOIDCRole` allows it to assume the `UnkeyPulumiAWSExecutor` role in target accounts: - -
-Example policy - -```json -{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Action": "sts:AssumeRole", - "Resource": [ - "arn:aws:iam::343218208612:role/UnkeyPulumiAWSExecutor", - "arn:aws:iam::920373003756:role/UnkeyPulumiAWSExecutor", - "arn:aws:iam::222634365038:role/UnkeyPulumiAWSExecutor" - ] - }, - { - "Effect": "Allow", - "Action": ["ec2:DescribeAvailabilityZones", "ec2:DescribeRegions"], - "Resource": "*" - } - ] -} -``` -
- -### Executor Role Trust Policy - -The `UnkeyPulumiAWSExecutor` role in each target account has a trust policy allowing assumption by the `GitHubActionsOIDCRole` from the management account and the Administrator role in the target account: - -
-Example policy - -```json -{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Principal": { - "AWS": [ - "arn:aws:iam::222634365038:role/aws-reserved/sso.amazonaws.com/AWSReservedSSO_AdministratorAccess_231cb3ec4a4e5945", - "arn:aws:iam::333769656712:role/GitHubActionsOIDCRole" - ] - }, - "Action": "sts:AssumeRole" - } - ] -} -``` -
- -### Pulumi Permissions Policy - -The `UnkeyPulumiPolicy` attached to the `UnkeyPulumiAWSExecutor` role grants permissions to manage AWS resources: - -
-UnkeyPulumiPolicy (Click to expand) - -```json -{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Action": [ - "cloudformation:*", - "cloudwatch:*", - "ec2:*", - "ecr:*", - "ecs:*", - "elasticache:*", - "elasticloadbalancing:*", - "globalaccelerator:*", - "iam:AttachRolePolicy", - "iam:CreateRole", - "iam:DeleteRole", - "iam:DeleteRolePolicy", - "iam:DetachRolePolicy", - "iam:GetRole", - "iam:GetRolePolicy", - "iam:PassRole", - "iam:PutRolePolicy", - "iam:ListRolePolicies", - "iam:ListAttachedRolePolicies", - "iam:ListInstanceProfilesForRole", - "kms:*", - "logs:*", - "route53:*", - "ssm:*" - ], - "Resource": "*" - } - ] -} -``` -
- -
- -## Pulumi Environment, Secrets and Configuration (ESC) - -Unkey uses Pulumi ESC to manage configuration and secrets across environments. The naming convention for stacks and environments follows a pattern: - -``` -unkey//-- -``` - -Example: -- Project: `api` -- Cloud: `aws` -- AWS Account: `canary` -- Region: `us-east-1` -- Stack name: `unkey/api/aws-canary-us-east-1` -- Environment: `unkey/api/aws-canary-us-east-1` - -There is also a global environment for each AWS account to hold configuration items that don't change across regions: `unkey/api/canary-global`. - -### ESC Configuration items - -The following configuration items are explicitly required in the Pulumi code and would cause a panic if not set. These are retrieved using `config.Require()` which will panic if the value is not found. - -#### Required Configuration Items - -| Config Key | Description | Usage | -|------------|-------------|-------| -| `roleToAssumeARN` | ARN of the AWS role to assume for deployments | Used to create the privileged AWS provider for cross-account access | -| `cidrBlock` | CIDR block for the VPC | Defines the IP address range for the VPC | -| `hostedZoneID` | Route53 hosted zone ID | Used for DNS record creation for certificate validation | -| `certificateDomain` | Domain name for the SSL certificate | Used to create and validate the ACM certificate | -| `awsRegion` | AWS region for deployment | Used in container environment variables | -| `clickhouseUrl` | URL for ClickHouse database | Passed as environment variable to containers | - -#### Required Secret Configuration Items - -These items are retrieved using `config.RequireSecret()` which will also panic if not found: - -| Secret Config Key | Description | Usage | -|-------------------|-------------|-------| -| `clickhouseUrl` | URL for the ClickHouse database | Used as an environment variable in the container | -| `OTEL_EXPORTER_OTLP_HEADERS` | Headers for OpenTelemetry exporter | Used as an environment variable in the container | -| `OTEL_EXPORTER_OTLP_ENDPOINT` | Endpoint for OpenTelemetry exporter | Used as an environment variable in the container | - -### Environment-Specific Items from ESC - -These configuration items would be set in Pulumi ESC environments and imported into stacks: - -#### Global Environment (e.g., `unkey/api/canary-global`) -| Config Key | Description | -|------------|-------------| -| `databasePrimaryDsn` | Connection string for the primary database | - -#### Regional Environment (e.g., `unkey/api/aws-canary-us-east-1`) -| Config Key | Description | -|------------|-------------| -| `aws:profile` | AWS profile name (e.g., `unkey-canary-admin`) | -| `aws:region` | AWS region (e.g., `us-east-1`) | - -## Configuration for Database Password Resources - -The code also creates database password resources with these parameters: -- Database: `"unkey"` -- Branch: `"main"` -- Name: Dynamically generated based on project and stack -- Replica: Determined by stack name (false for canary stacks) -- Role: `"readwriter"` for primary, `"reader"` for replica - -## Example Stack Configuration - -Here's an example of what should be in a complete stack configuration file (`Pulumi.aws-canary-us-east-1.yaml`): - -```yaml -imports: - - api/canary-global - -environment: - - api/aws-canary-us-east-1 - -values: - pulumiConfig: -``` - -## Environment Configurations - -In the global environment: -
-Global Environment Configuration Commands (Click to expand) - -```bash -esc env set --secret unkey/api/canary-global pulumiConfig.api:databasePrimaryDsn "yourDsnHere" -esc env set --secret unkey/api/canary-global pulumiConfig.api:clickhouseUrl "yourClickhouseUrlHere" -esc env set --secret unkey/api/canary-global pulumiConfig.api:OTEL_EXPORTER_OTLP_HEADERS "yourHeadersHere" -esc env set --secret unkey/api/canary-global pulumiConfig.api:OTEL_EXPORTER_OTLP_ENDPOINT "yourEndpointHere" -esc env set unkey/api/canary-global pulumiConfig.api:roleToAssumeARN "roleARNHere" -``` -
- -In the regional environment: -```bash -esc env set unkey/api/aws-canary-us-east-1 pulumiConfig.aws:profile unkey-canary-admin -esc env set unkey/api/aws-canary-us-east-1 pulumiConfig.aws:region us-east-1 -``` - -### Working with Stacks and Environments - -```bash -# Create a stack -pulumi stack init unkey/api/aws-canary-us-east-1 - -# Create the global environment -esc env init unkey/api/canary-global - -# Create the region environment -esc env init unkey/api/aws-canary-us-east-1 -``` - -### Setting Secrets - -Secrets are managed using ESC: - -```bash -esc env set --secret unkey/api/canary-global pulumiConfig.api:databasePrimaryDsn "thesecretgoeshere" -``` - -### Accessing Configuration in Code - -In Go code, configuration values are accessed using the Pulumi config system: - -```go -func main() { - pulumi.Run(func(ctx *pulumi.Context) error { - // Initialize configuration - config := config.New(ctx, "") - - // Access regular config - cidrBlock := config.Require("cidrBlock") - - // Access secrets - databasePrimaryDsn := config.RequireSecret("databasePrimaryDsn") - - // ... rest of code - }) -} -``` - -## Role Assumption in Pulumi Code - -The Pulumi code uses role assumption to obtain the necessary permissions in the target AWS account: - -```go -roleToAssumeARN := config.Require("roleToAssumeARN") - -provider, err := aws.NewProvider(ctx, "privileged", &aws.ProviderArgs{ - AssumeRole: &aws.ProviderAssumeRoleArgs{ - RoleArn: pulumi.StringPtr(roleToAssumeARN), - SessionName: pulumi.String("NameYourSession"), - ExternalId: pulumi.String("SomeNameIsUsefulHere"), - }, - Region: pulumi.String(region), -}) -``` - -## Deployed Resources - -The Pulumi code deploys several AWS resources: - -1. **VPC and Networking**: - - VPC with public subnets - - Security groups for ALB, Fargate tasks, and Redis - -2. **Serverless Redis** (Valkey): - - Elasticache Serverless Cache with Valkey engine - -3. **Load Balancing**: - - Application Load Balancer - - Target groups and listeners for HTTP/HTTPS traffic - - SSL/TLS certificate with DNS validation - -4. **ECS Fargate Service**: - - ECS Cluster - - Fargate service with task definition - - Container configuration with environment variables - -5. **Secrets Management**: - - Database credentials for primary and replica databases - -## Making Changes to Infrastructure - -There are two primary methods for deploying infrastructure changes: automated deployment through GitHub Actions and manual deployment by human operators. Each approach has specific workflows and considerations. - -### Automated Deployment via GitHub Actions - -GitHub Actions is the primary method for deploying infrastructure changes to all environments. This approach provides consistency, auditability, and reduces the risk of human error. - -{/* Diagram image removed — original file (Pulumi-ECS-Fargate---API.png) not available */} - -#### Prerequisites - -1. **GitHub Repository Access**: Ensure you have appropriate access to the `unkeyed/infra` repository. -2. **Pull Request Process**: All changes should follow the standard PR review process. - -#### Workflow - -1. **Create a Feature Branch**: - ```bash - git checkout -b feature/my-infrastructure-change - ``` - -2. **Make Your Changes**: - - Update Pulumi code in Go files - - Modify stack configuration in `Pulumi.*.yaml` files - -3. **Test Locally** (if possible): - ```bash - # Preview changes without applying them - pulumi preview --stack unkey/api/aws-sandbox-us-east-1 - ``` - -4. **Commit and Push Changes**: - -You should have git commit signing enabled, and use it! - ```bash - git add . - git commit -S -s -m "fix: description of infrastructure changes" - # follow conventional commits - git push origin feature/my-infrastructure-change - ``` - -5. **Create Pull Request**: - - Open a PR against the main branch - - Include a detailed description of the changes - - Request reviews from appropriate team members - -6. **CI/CD Pipeline Execution**: - - Coordinating changes across stacks might be something to consider! - - GitHub Actions will automatically run the Pulumi workflow - - The workflow will: - - Authenticate to AWS using OIDC - - Assume the necessary roles - - Deploy stacks when either: - - The github action workflow file Changes - - Any of the code in the project of the workflows - -7. **Deployment Order**: - - Changes are typically deployed to sandbox first - - Once verified, they're deployed to canary - - Finally, they're deployed to production - -8. **Monitor Deployments**: - - Check GitHub Actions logs for deployment status - - Verify resources in AWS Console - - Check application functionality - -#### Troubleshooting CI/CD Deployments - -- **Authentication Issues**: Verify the OIDC trust relationship is correctly configured -- **Permission Errors**: Check that the assumed roles have the necessary policies -- **Failed Deployments**: Review the GitHub Actions logs for specific error messages - -### Manual Deployment by us hoomans - -In some scenarios, you may need to deploy changes manually. This approach is typically used for emergency fixes or when testing new infrastructure components. - -#### Prerequisites - -1. **AWS CLI and Credentials**: - ```bash - # Install AWS CLI - brew install awscli - - # Configure credentials - aws configure - ``` - -2. **Pulumi and ESC CLI**: - ```bash - brew update && brew install pulumi/tap/esc pulumi/tap/pulumi - ``` - -3. **AWS SSO Access**: Ensure you have SSO access to the relevant AWS accounts with Administrator permissions. - -#### Workflow - -1. **Clone the Repository**: - ```bash - git clone https://github.com/unkeyed/infra.git - cd infra - ``` - -2. **Switch to the Appropriate Branch**: - ```bash - git checkout main # Or feature branch if testing - ``` - -3. **Login to AWS SSO**: - ```bash - # Login to the relevant account - aws sso login --profile unkey-sandbox-admin - ``` - -4. **Set Up Pulumi Stack**: - ```bash - # Select the appropriate stack - AWS_PROFILE=unkey-sandbox-admin pulumi stack select unkey/api/aws-sandbox-us-east-1 - ``` - -5. **Configure Role Assumption**: - - The `roleToAssumeARN` config is set at the project's global stack. - - Your SSO role gives you the ability to assume the `UnkeyPulumiAWSExecutor` role. - -6. **Preview Changes**: - ```bash - AWS_PROFILE=unkey-sandbox-admin pulumi preview - ``` - -7. **Apply Changes**: - ```bash - AWS_PROFILE=unkey-sandbox-admin pulumi up - ``` - -8. **Verify Deployment**: - - Check [AWS Console](https://unkey.awsapps.com/start/) in the account you're working in for deployed resources - - Test functionality - - Monitor logs and metrics (ECS tasks for API are a good place to start, for example) - -9. **Document the Changes**: - - Create necessary tickets... - - Update documentation - - Create a PR to formalize the changes once the incident has passed. - -10. **Thoughts for testing**: - - When testing GitHub Actions changes, use a test branch like `workflow-testing` or one of your choosing. - -### Managing Secrets Manually - -When working with secrets manually, use the ESC CLI: - -```bash -# View existing secrets (will not show values) -esc env open unkey/api/aws-sandbox-us-east-1 -f yaml - -# Set a new secret -esc env set --secret unkey/api/aws-sandbox-global pulumiConfig.api:myNewSecret "secretvalue" - -# Update an existing secret -esc env set --secret unkey/api/aws-sandbox-global pulumiConfig.api:existingSecret "newsecretvalue" -``` - -#### Common Manual Operations - -1. **Emergency Resource Updates**: - - If you can't wait around you can run locally.. but ONLY use this in extreme cases. - ```bash - pulumi up --skip-preview -y - ``` - -#### Best Practices for Manual Deployments - -1. **Communicate Changes**: Notify @imeyer and @chronark before and after manual deployments -2. **Document Everything**: Record all manual changes in appropriate tickets or documentation -3. **Transfer to CI/CD**: Move manual changes to the CI/CD pipeline as soon as possible -4. **Limit Scope**: Make the smallest possible change needed to resolve the issue -5. **Test First**: Always test changes in sandbox before applying to production -6. **Note how often none of this is followed**: Ahem. - - -## Common Workflows - -### Adding a New Secret - -```bash -# Set the secret in the global environment -esc env set --secret unkey/api/canary-global pulumiConfig.api:newSecret "secretvalue" - -# Access the secret in code -newSecret := config.RequireSecret("newSecret") -``` - -### Working with Multiple AWS Accounts through automation - -The infrastructure is designed to support multiple AWS accounts through role assumption: - -1. GitHub Actions assumes the `GitHubActionsOIDCRole` in the management account -2. The Pulumi code assumes the `UnkeyPulumiAWSExecutor` role in the target account -3. Project resources are created in the target account - -### Troubleshooting - -If you encounter issues: - -1. Check the GitHub Actions workflow logs -2. Verify the role assumption chain is working correctly (TODO) -3. Ensure the `UnkeyPulumiAWSExecutor` role has the necessary permissions - -Number 3 is the most likely to happen right now as things grow. diff --git a/docs/engineering/infra/legacy-2025/pulumi-workflow.mdx b/docs/engineering/infra/legacy-2025/pulumi-workflow.mdx deleted file mode 100644 index 8a1f3c2dff..0000000000 --- a/docs/engineering/infra/legacy-2025/pulumi-workflow.mdx +++ /dev/null @@ -1,78 +0,0 @@ ---- -title: Pulumi Workflow -description: Day-to-day Pulumi deployment workflow. ---- - -## Table of contents - -* [Pulumi Workflow](#pulumi-workflow) - * [Overview](#overview) - * [Prerequisites](#prerequisites) - * [Steps](#steps) - * [1\. Create a Feature Branch](#1-create-a-feature-branch) - * [2\. Make Your Changes](#2-make-your-changes) - * [3\. Test Locally (if possible)](#3-test-locally-if-possible) - * [4\. Commit and Push Changes](#4-commit-and-push-changes) - * [5\. Create Pull Request](#5-create-pull-request) - * [6\. CI/CD Pipeline Execution](#6-cicd-pipeline-execution) - * [7\. Deployment Order](#7-deployment-order) - * [8\. Monitor Deployments](#8-monitor-deployments) - -## Overview - -Audience: For day-to-day operations and development... - -{/* Diagram image removed — original file (Pulumi-ECS-Fargate---API.png) not available */} - -### Prerequisites -- **GitHub Repository Access**: Ensure you have appropriate access to the `unkeyed/infra` repository. -- **Pull Request Process**: All changes should follow the standard PR review process. - -## Steps - -### 1. Create a Feature Branch -```bash -git checkout -b feature/my-infrastructure-change -``` - -### 2. Make Your Changes -Each project's deployment workflow file looks for file changes to itself, and changes to the project files. For example, the API workflow has the following... -```yaml -paths: - - .github/workflows/pulumi-deploy-api-infrastructure.yaml - - pulumi/projects/api/** -branches: - - main -``` -Any change to those files results in a workflow trigger, but only if it's to the branch `main`.. so make egregious use of sandbox to validate your Pulumi changes. - -### 3. Test Locally (if possible) -```bash -# Preview changes without applying them -pulumi preview --stack unkey/api/aws-sandbox-us-east-1 -``` - -### 4. Commit and Push Changes -You should have git commit signing enabled, and use it! -```bash -git add . -git commit -S -s -m "fix: description of infrastructure changes" -# follow conventional commits -git push origin feature/my-infrastructure-change -``` - -### 5. Create Pull Request - - Open a PR against the main branch - - Include a detailed description of the changes - - Request reviews from appropriate team members - -### 6. CI/CD Pipeline Execution -Coordinating changes across stacks might be something to consider! If you're doing maintenance on a cluster, does it need to be taken out of Global Accelerator first? DNS changes made before work can be made? - -### 7. Deployment Order - - Changes should typically be deployed to sandbox first to ensure the pulumi code doesn't break but go with your gut - -### 8. Monitor Deployments - - Check GitHub Actions logs for deployment status - - Verify resources in AWS Console - - Check application functionality diff --git a/docs/engineering/infra/metering/gauge.mdx b/docs/engineering/infra/metering/gauge.mdx deleted file mode 100644 index 6c3aad25b2..0000000000 --- a/docs/engineering/infra/metering/gauge.mdx +++ /dev/null @@ -1,453 +0,0 @@ ---- -title: metric collection -description: How we can collect resource usage metrics for billing. ---- - -## Overview - -soonTM = I dont have a good name yet. - -**soonTM** is a Go DaemonSet that collects per-deployment CPU, memory, and network egress metrics from Kubernetes nodes and writes them to ClickHouse for billing. -It also tracks deployment lifecycle events (start, stop, scale) with millisecond-precise timestamps. - -soonTM runs once per node in the cluster, on every node that hosts customer deployments (`untrusted` nodepool). - -## Architecture - - - -soonTM scrapes two kubelet endpoints on the local node at a configurable interval (default 15s) and watches pod events via a K8s informer. It also queries Cilium Hubble for public vs internal egress classification. - -| Source | What | Endpoint | -| --- | --- | --- | -| kubelet | CPU + memory | `/metrics/resource` | -| kubelet | Network tx/rx bytes | `/stats/summary` | -| K8s API | Lifecycle events (start/stop/scale) | Pod informer (watch) | -| Cilium Hubble | Public egress classification | gRPC via `hubble.relay` | - - -Every sample and lifecycle event is written to a **disk WAL** (EBS volume) before anything else. Billing data is never held only in memory. - - -A background loop reads completed WAL segments and batch-inserts them into ClickHouse. On success, the segment is deleted. On failure, it stays on disk for retry. - - -If the disk fills up, the oldest segments are uploaded to S3. Once ClickHouse recovers, re-ingest from S3: - -```sql -INSERT INTO default.container_resources_raw_v1 -SELECT * FROM s3('s3://bucket/metering-wal/**/*.ndjson', 'JSONEachRow') -``` - - - -## What soonTM Collects - -### Resource Usage Samples (configurable interval, default 15s) - -For each krane-managed pod on the node, soonTM records: - -| Metric | Source | Description | -| --- | --- | --- | -| `cpu_millicores` | `/metrics/resource` | Actual CPU usage rate (computed from cumulative nanosecond counter delta) | -| `memory_working_set_bytes` | `/metrics/resource` | Actual memory working set | -| `cpu_request_millicores` | Pod spec | Requested CPU (scheduling guarantee) | -| `cpu_limit_millicores` | Pod spec | CPU limit (hard cap) | -| `memory_request_bytes` | Pod spec | Requested memory | -| `memory_limit_bytes` | Pod spec | Memory limit | -| `network_tx_bytes` | `/stats/summary` | Total egress bytes since last sample (delta) | -| `network_tx_bytes_public` | Cilium Hubble | Public-only egress (non-RFC1918 destinations) | - -Each sample is tagged with `workspace_id`, `project_id`, `app_id`, `environment_id`, `deployment_id`, `instance_id` (pod name), `region`, and `platform`. - -### Lifecycle Events (real-time) - -soonTM watches for pod state changes via Kubernetes informers and emits events with millisecond-precise timestamps: - -| Event | When | Why it matters | -| --- | --- | --- | -| `started` | Pod appears and is running | Billing window begins | -| `stopped` | Pod is removed | Billing window ends | -| `scaled` | ReplicaSet replica count changes | Allocated billing changes mid-period | - -Each lifecycle event records the deployment's resource allocation at that moment (replicas, cpu_limit, memory_limit). - -## Why Two Data Streams - -Usage samples and lifecycle events answer different questions: - -- **Usage samples** → "how much CPU/memory did this pod actually consume in this collection interval?" -- **Lifecycle events** → "exactly when did this deployment start, stop, or change its allocation?" - -### Why usage samples alone aren't enough - -Usage samples arrive every collection interval. But deployments don't start and stop on interval boundaries: - -``` -Timeline: - :00.000 ─── nothing ─── - :00.100 ← pod starts (no sample knows about this) - :15.000 ← first usage sample arrives - :30.000 ← second sample - :45.000 ← last sample - :52.300 ← pod stops (no sample captures this either) -``` - - -Without lifecycle events, those first 14.9 seconds and last 7.3 seconds are invisible. For both billing models that's unacceptable — we'd be rounding to 15s boundaries when we have the data to be ms-precise. - - -### What lifecycle events enable - - - -Usage samples give us the actual consumption rate. Lifecycle events give us the exact billing window. We know the pod started at `:00.100` and stopped at `:52.300`, so we prorate the first and last intervals to the millisecond instead of snapping to the nearest 15s sample. - - -Charges for what the customer _reserved_, not what they used. A pod sitting completely idle still costs money because it's holding resources. Usage samples would show 0 CPU — but the customer still has 500m CPU reserved. - -Lifecycle events give us: -- Exact start/stop times (ms precision) -- Replica count at each point in time -- Resource limits (cpu_limit, memory_limit) at each point in time - - - -### How allocated billing works - -Each lifecycle event marks a change in what's reserved. Between two events, the allocation is constant: - -``` -Timeline for deployment X: - - 14:00:00.100 started │ 2 replicas × 500m CPU × 256Mi mem - │ - 14:32:17.483 scaled │ 4 replicas × 500m CPU × 256Mi mem - │ - 15:07:44.917 stopped │ 0 - -Billing: - Interval 1: 14:00:00.100 → 14:32:17.483 = 1,937,383ms - allocated_cpu = 2 × 500m × 1,937,383ms = 1,937,383,000 millicores·ms - - Interval 2: 14:32:17.483 → 15:07:44.917 = 2,127,434ms - allocated_cpu = 4 × 500m × 2,127,434ms = 4,254,868,000 millicores·ms - - Total: 6,192,251,000 millicores·ms = ~1.72 CPU-hours -``` - - -The billing service fetches lifecycle events from ClickHouse, walks them chronologically per deployment, and computes `replicas × limit × duration` for each interval. This is done in Go, not SQL — ClickHouse stores the events, the billing service does the math. - - -## How CPU is Measured - -The kernel tracks CPU as a **cumulative nanosecond counter** — it only ever goes up. The kubelet reads this via cAdvisor and exposes it through its API. We don't read cgroups directly. - -Every collection interval, soonTM grabs the counter from the kubelet. With two consecutive readings we compute the actual usage: - -``` -Example: - t0 (00:00:00): counter = 5,000,000,000 ns - t1 (00:00:15): counter = 5,750,000,000 ns - - delta = 750,000,000 ns consumed in 15s - rate = 750,000,000 / 15,000,000,000 = 0.05 cores = 50 millicores -``` - - -This is **not** an instantaneous snapshot — it's exactly how much CPU was consumed between two readings. No spikes are missed, no idle time is overcounted. The kernel counted every nanosecond. - - -For billing, we store each of these computed rates as a sample. The total CPU consumed over a billing period is the **sum of all samples** — not an average. Each sample represents actual usage over its collection interval. - -### Edge Windows: Start and Stop - -Computing CPU rate requires two consecutive readings. This creates a blind spot at pod start (no previous reading) and pod stop (no next reading). This is a physical limitation — every metrics tool has it. - - - -We do an **immediate kubelet read** when the pod informer fires `AddFunc`: - -``` -:00.100 pod starts → informer AddFunc fires -:00.120 soonTM immediately reads kubelet → counter = X (reading #1) -:15.000 regular tick → counter = Y (reading #2) - → cpu for :00.120 → :15.000 = (Y - X) / 14.88s ✓ -``` - -The blind spot shrinks from ~15s to milliseconds (however fast we can hit the kubelet API after the informer event). - - -Harder — the container may be gone before we can read. Our approach: - -1. The informer fires `UpdateFunc` when the pod transitions to **Terminating** (before the container is actually killed, during the graceful shutdown window) -2. soonTM races to read the kubelet one last time -3. If we win the race: we get a real final delta, full coverage -4. If we lose (container already dead): fall back to billing the **allocated rate** (`cpu_limit`) for the gap between the last real sample and the lifecycle stop event - -``` -Best case (win the race): - :45.000 regular tick → counter = A - :52.100 SIGTERM → pod becomes Terminating - :52.110 informer fires → soonTM reads kubelet → counter = B ✓ - :52.300 container dies - → :45 → :52.110 = real usage ✓ - → :52.110 → :52.300 = billed at allocated rate - -Worst case (lose the race): - :45.000 regular tick → counter = A (last real sample) - :52.300 container dies before we can read - → :45 → :52.300 = 7.3s at allocated rate -``` - - - - -The worst-case gap is one collection interval billed at the allocated rate instead of actual usage. For a pod running hours or days, this is negligible. The allocated rate is also the ceiling — the customer is never charged more than what they reserved. - -Even after a container dies, cAdvisor retains its stats in memory for up to 2 minutes ([`--storage_duration`](https://github.com/google/cadvisor/blob/master/docs/runtime_options.md)), so the "race to read on Terminating" has a decent safety margin. The kubelet's container GC used to be tunable via `--minimum-container-ttl-duration` but that flag has been [removed](https://github.com/kubernetes/kubernetes/issues/40044). - - -## How Network Egress is Split - -Total egress comes from the kubelet Summary API (`txBytes` counter delta). To split internal vs public: - - - -soonTM queries **Cilium Hubble** (gRPC API via `hubble.relay`) for each pod's outbound network flows. - - -Flows to RFC1918 destinations (`10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`) are **internal**. Everything else is **public**. - - -`network_tx_bytes_public` = sum of tx bytes to non-RFC1918 destinations. -Internal egress = `network_tx_bytes - network_tx_bytes_public` (computed at query time). - - - -Only public egress is typically billed. Internal egress (pod-to-pod, pod-to-service) is free. - -## Durability: Disk WAL + S3 Overflow - - -Billing data cannot be dropped. Unlike analytics (where losing a few events is acceptable), missing billing data means lost revenue or overcharging. - - -soonTM uses a **write-ahead log (WAL)** on a dedicated EBS volume: - - - -Every metric sample and lifecycle event is written to a **segment file on disk** before anything else. - - -A background drain loop reads completed segments and batch-inserts them into ClickHouse. On success, the segment file is deleted. - - -If ClickHouse is down, the segment stays on disk and is retried on the next loop. Data is never lost. - - -If the disk fills up (EBS volume approaching capacity), the oldest segments are uploaded to S3 and deleted locally. - - -Once ClickHouse recovers, data from S3 can be re-ingested with zero custom code: - -```sql -INSERT INTO default.container_resources_raw_v1 -SELECT * FROM s3( - 's3://bucket/metering-wal/resources/**/*.ndjson', - 'JSONEachRow' -) -``` - - - -Segment files use **NDJSON format** (one JSON object per line) so ClickHouse can read them directly from S3 without any transformation. - -## ClickHouse Data Model - -### Raw Tables - -| Table | Rows | TTL | Purpose | -| --- | --- | --- | --- | -| `container_resources_raw_v1` | 1 per instance per collection interval | 90 days | Raw usage samples | -| `deployment_lifecycle_events_v1` | 1 per start/stop/scale | 365 days | Lifecycle events | - -### Table Schemas - - -```sql -CREATE TABLE default.container_resources_raw_v1 -( - `time` Int64, -- unix milliseconds - `workspace_id` String, - `project_id` String, - `app_id` String, - `environment_id` String, - `deployment_id` String, - `instance_id` String, -- pod name - `region` LowCardinality(String), - `platform` LowCardinality(String), -- "aws", "gcp", etc. - - -- Actual usage (from kubelet) - `cpu_millicores` Float64, -- computed rate from counter delta - `memory_working_set_bytes` Int64, - - -- Allocated resources (from pod spec) - `cpu_request_millicores` Int32, - `cpu_limit_millicores` Int32, - `memory_request_bytes` Int64, - `memory_limit_bytes` Int64, - - -- Network egress (deltas since last sample) - `network_tx_bytes` Int64, -- total egress - `network_tx_bytes_public` Int64 -- public-only (non-RFC1918) -) -ENGINE = MergeTree() -ORDER BY (workspace_id, app_id, deployment_id, time) -PARTITION BY toDate(fromUnixTimestamp64Milli(time)) -TTL toDateTime(fromUnixTimestamp64Milli(time)) + INTERVAL 90 DAY DELETE -``` - - - -```sql -CREATE TABLE default.deployment_lifecycle_events_v1 -( - `time` Int64, -- unix milliseconds (ms-precise) - `workspace_id` String, - `project_id` String, - `app_id` String, - `environment_id` String, - `deployment_id` String, - `region` LowCardinality(String), - `platform` LowCardinality(String), - - `event` LowCardinality(String), -- "started", "stopped", "scaled" - `replicas` Int32, -- replica count at this moment - `cpu_limit_millicores` Int32, -- per-replica CPU limit - `memory_limit_bytes` Int64 -- per-replica memory limit -) -ENGINE = MergeTree() -ORDER BY (workspace_id, app_id, deployment_id, time) -PARTITION BY toDate(fromUnixTimestamp64Milli(time)) -TTL toDateTime(fromUnixTimestamp64Milli(time)) + INTERVAL 365 DAY DELETE -``` - - - -All three aggregation tables share the same column structure, differing only in time granularity and TTL: - -```sql --- Same structure for per_minute (30d TTL), per_hour (90d TTL), per_day (365d TTL), per_month (no TTL) -CREATE TABLE default.container_resources_per_{hour,day,month}_v1 -( - `time` DateTime, -- or Date for day/month - `workspace_id` String, - `project_id` String, - `app_id` String, - `environment_id` String, - `deployment_id` String, - - `cpu_millicores_sum` SimpleAggregateFunction(sum, Float64), - `memory_bytes_max` SimpleAggregateFunction(max, Int64), - `memory_bytes_sum` SimpleAggregateFunction(sum, Float64), - `cpu_limit_millicores_max` SimpleAggregateFunction(max, Int32), - `memory_limit_bytes_max` SimpleAggregateFunction(max, Int64), - `network_tx_bytes_sum` SimpleAggregateFunction(sum, Int64), - `network_tx_bytes_public_sum` SimpleAggregateFunction(sum, Int64), - `sample_count` SimpleAggregateFunction(sum, Int64) -) -ENGINE = AggregatingMergeTree() -ORDER BY (workspace_id, app_id, deployment_id, time) -``` - -Each level is populated by a materialized view that aggregates from the level below (raw → minute → hour → day → month). - - -### Materialized View Aggregation Chain - -``` -container_resources_raw_v1 - │ (MV: group by minute, full hierarchy) - ▼ -container_resources_per_minute_v1 ← 30 day TTL - │ (MV: group by hour) - ▼ -container_resources_per_hour_v1 ← 90 day TTL - │ (MV: group by day) - ▼ -container_resources_per_day_v1 ← 365 day TTL - │ (MV: group by month) - ▼ -container_resources_per_month_v1 ← no TTL -``` - - -All aggregation tables preserve the **full hierarchy**: `workspace_id`, `project_id`, `app_id`, `environment_id`, `deployment_id`. This enables queries at any level — per-deployment, per-app, or per-workspace. - -These MVs exist for **dashboard performance** — pre-aggregating so time-series graphs don't scan millions of raw rows. Billing uses the raw tables + lifecycle events directly. - - -### What Each Aggregation Level Stores - -| Column | Meaning | How to use | -| --- | --- | --- | -| `cpu_millicores_sum` | Sum of all CPU samples | `/ sample_count` = avg CPU | -| `memory_bytes_max` | Peak memory in the window | For peak-based billing | -| `memory_bytes_sum` | Sum of all memory samples | `/ sample_count` = avg memory | -| `cpu_limit_millicores_max` | Max allocated CPU | For allocated billing | -| `memory_limit_bytes_max` | Max allocated memory | For allocated billing | -| `network_tx_bytes_sum` | Total egress bytes | For egress billing | -| `network_tx_bytes_public_sum` | Public egress bytes | For public egress billing | -| `sample_count` | Number of samples | For computing averages | - -## Billing Models Supported - - - -Bill for what was actually consumed. The billing service: - -1. Fetches lifecycle events to get the exact billing window (ms-precise start/stop) -2. Queries raw samples within that window for actual CPU/memory/egress consumed -3. Prorates the first and last intervals to the millisecond using the `started`/`stopped` event timestamps -4. For edge windows where no CPU sample exists, bills at the allocated rate (`cpu_limit`) - - -Bill for what was reserved, regardless of actual usage. The billing service: - -1. Fetches lifecycle events for the billing period -2. Walks them chronologically per deployment -3. Each `started`/`scaled` event marks the beginning of an interval with known allocation (`replicas × cpu_limit`) -4. Each `stopped`/`scaled` event marks the end of that interval -5. Computes `allocated_cpu_ms = sum(replicas × cpu_limit_millicores × interval_duration_ms)` across all intervals - -This is done in Go, not SQL — ClickHouse stores the events, the billing service does the math. - - - -## Deployment - -soonTM runs as a **Kubernetes DaemonSet** on all `untrusted` nodes (the same nodes that run customer deployments). - -| Resource | Value | -| --- | --- | -| CPU request | 20m | -| CPU limit | 50m | -| Memory request | 32Mi | -| Memory limit | 64Mi | -| Disk (EBS) | 5-10Gi gp3 | -| Collection interval | configurable (default 15s) | - -## Known Considerations - - -Customer deployments run under gVisor (`runtimeClassName: gvisor`). gVisor sandboxes container processes from the host kernel, which can affect cAdvisor's ability to read cgroup metrics. The kubelet's CRI stats provider (used by `/metrics/resource`) works independently of cAdvisor and should report correctly for gVisor pods. - -**This needs to be verified on a staging cluster before shipping.** - - - -- `/metrics/resource` is the officially recommended endpoint for CPU/memory. Future-proof. -- `/stats/summary` has been "planned for deprecation" since 2018 with no concrete action ([kubernetes#106080](https://github.com/kubernetes/kubernetes/issues/106080)). We use it only for network stats (which `/metrics/resource` doesn't provide). If it's ever actually deprecated, we can fall back to Cilium Hubble for all network data. - diff --git a/docs/engineering/infra/observability/alerting.mdx b/docs/engineering/infra/observability/alerting.mdx deleted file mode 100644 index edbe0b5ca9..0000000000 --- a/docs/engineering/infra/observability/alerting.mdx +++ /dev/null @@ -1,311 +0,0 @@ ---- -title: Alerting -description: Alert routing, severities, and adding rules. ---- - -How our alerts work, what severity means, and how to add new ones without making the on-call person's life worse. - -## How alerts get to people - -Prometheus evaluates alerting rules in each cluster. When something fires, Alertmanager sends it to incident.io, which decides what to do based on two things: the **source** (which cluster sent it) and the **severity label** on the alert. - -| Severity | What happens | Route | -|----------|-------------|-------| -| `critical` | Pages on-call via Engineering On-Call escalation path | Production Alerts | -| `warning` | Posts to #alerts in Slack, no page | Production Warnings | -| _(missing)_ | Posts to #alerts with no page — safety net so you know the label is missing | Unrouted Alerts | - -There's a catch-all route ("Unrouted Alerts") that picks up Alertmanager alerts with no `severity` label. It posts to #alerts so someone notices and fixes the missing label. It doesn't page anyone. If your alert has a severity value that isn't `critical` or `warning` (like `severity=info`), it won't match any of the three routes and will be silently dropped. Don't do that. - -Staging alerts never page anyone regardless of severity. They go to #alerts and that's it. - -**Always include a `severity` label set to `critical` or `warning`.** Anything else and the alert is effectively invisible. - -## When to use which severity - -**`critical`** — would you wake someone up for this? Then it's critical. - -- Active customer impact (elevated error rates, full outage) -- Data loss or risk of data loss -- Pods that should be running but aren't -- Anything where waiting until morning makes it worse - -**`warning`** — is this something we should know about but can wait? - -- Elevated latency that hasn't crossed into "customers are mad" territory -- Resource (high memory, goroutine counts, connection pools filling up) -- Indicators that _could_ become critical if left alone -- Anything you'd look at during business hours but wouldn't lose sleep over - -If you're not sure, start with `warning`. It's easy to promote to `critical` later. Going the other direction means someone already got woken up for nothing. - -## Current alerts - -### Custom (defined in this repo) - -| Alert | Severity | What it catches | File | -|-------|----------|----------------|------| -| PodNotRunning | critical | Pod stuck in `CrashLoopBackOff`, `ImagePullBackOff`, etc. for 5m | `observability/templates/prometheus-alerts.yaml` | -| **Frontline** | | | | -| FrontlinePlatformErrorRateHigh | warning | Platform (our) error rate > 2% for 2m — excludes customer/user errors | `observability/templates/prometheus-alerts-frontline.yaml` | -| FrontlineRoutingErrorsHigh | warning | Sustained routing errors > 0.5/s for 2m | `observability/templates/prometheus-alerts-frontline.yaml` | -| FrontlineProxyConnectionFailures | warning | Connection failures (timeout, refused, reset, DNS) > 5% for 2m | `observability/templates/prometheus-alerts-frontline.yaml` | -| FrontlineSentinel5xxRate | warning | Sentinel-sourced 5xx > 5% for 2m (sentinel itself is broken) | `observability/templates/prometheus-alerts-frontline.yaml` | -| FrontlineP99LatencyHigh | warning | P99 latency > 2.5s for 5m | `observability/templates/prometheus-alerts-frontline.yaml` | -| FrontlineP95LatencyHigh | warning | P95 latency > 1s for 5m | `observability/templates/prometheus-alerts-frontline.yaml` | -| FrontlineRoutingLatencyHigh | warning | Routing P99 > 1.5s for 5m | `observability/templates/prometheus-alerts-frontline.yaml` | -| FrontlineBackendLatencyHigh | warning | Backend P99 > 5s for 5m | `observability/templates/prometheus-alerts-frontline.yaml` | -| FrontlineHighActiveRequests | warning | Active requests > 100 for 2m | `observability/templates/prometheus-alerts-frontline.yaml` | -| FrontlineExcessiveHops | warning | P99 cross-region hops >= 2 for 5m | `observability/templates/prometheus-alerts-frontline.yaml` | -| FrontlineGoroutineLeak | warning | Goroutines > 1000 for 10m | `observability/templates/prometheus-alerts-frontline.yaml` | -| **Sentinel** | | | | -| SentinelPlatformErrorRateHigh | warning | Platform (our) error rate > 2% for 2m — excludes customer/user errors | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelProxyErrorsHigh | warning | Proxy errors > 1/s for 2m (excl. client cancellations) | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelEngineEvaluationErrors | warning | Policy engine error rate > 1% for 3m | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelRoutingFailures | warning | Deployment/instance routing failures > 0.5/s for 3m | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelP99LatencyHigh | warning | P99 latency > 2.5s for 5m | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelP95LatencyHigh | warning | P95 latency > 1s for 5m | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelEngineEvaluationLatencyHigh | warning | Policy eval P99 > 1.5s for 5m | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelRoutingLatencyHigh | warning | Routing P99 > 1.5s for 5m | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelUpstreamLatencyHigh | warning | Upstream P99 > 10s for 5m | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelHighActiveRequests | warning | Active requests > 100 for 2m | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelNoRunningInstancesSustained | warning | No running instances > 0.5/s for 5m | `observability/templates/prometheus-alerts-sentinel.yaml` | -| SentinelGoroutineLeak | warning | Goroutines > 1000 for 10m | `observability/templates/prometheus-alerts-sentinel.yaml` | - -### kube-prometheus-stack built-ins - -We get a bunch of default alerting rules from kube-prometheus-stack. We've disabled the ones that don't apply to EKS (etcd, apiserver, scheduler, controller-manager, cert rotation, RAID, bonding — all managed by AWS) and a couple we replaced with custom rules. The full list of disabled rules is in `values.yaml` under `defaultRules.disabled`. - -Disabled for EKS (AWS manages these): -- All etcd alerts -- KubeAPIDown, KubeAPIErrorBudgetBurn, KubeAPITerminatedRequests -- KubeAggregatedAPIDown, KubeAggregatedAPIErrors -- KubeControllerManagerDown, KubeSchedulerDown, KubeProxyDown -- KubeClientCertificateExpiration, KubeVersionMismatch -- KubeletClient/ServerCertificateExpiration and RenewalErrors -- NodeRAIDDegraded, NodeRAIDDiskFailure, NodeBondingDegraded - -Disabled because we replaced them: -- **KubePodCrashLooping** — replaced by PodNotRunning which excludes krane-managed customer workloads -- **KubePodNotReady** — was firing for customer deployment containers - -Everything below is what's still active. These come from upstream [kubernetes-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin). We didn't write them, but they fire in our clusters and you should know what they are. Runbook links are included where available. - -#### Kubernetes — critical - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| KubePersistentVolumeErrors | PV provisioning is broken | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeerrors) | -| KubePersistentVolumeFillingUp | PV has < 3% space left. Check if storageclass allows expansion, resize PVC, or clean up data | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup) | -| KubePersistentVolumeInodesFillingUp | PV has < 3% inodes left | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeinodesfillingup) | - -#### Kubernetes — warning - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| KubeCPUOvercommit | Cluster CPU requests exceed capacity | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuovercommit) | -| KubeCPUQuotaOvercommit | CPU quota overcommitted | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuquotaovercommit) | -| KubeClientErrors | API server client is seeing errors | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclienterrors) | -| KubeContainerWaiting | Container stuck waiting for > 1 hour. Check events, logs, and resource availability (configmaps, secrets, volumes) | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontainerwaiting) | -| KubeDaemonSetMisScheduled | DaemonSet pods landed on wrong nodes | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetmisscheduled) | -| KubeDaemonSetNotScheduled | DaemonSet pods not getting scheduled | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetnotscheduled) | -| KubeDaemonSetRolloutStuck | DaemonSet rollout stalled | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetrolloutstuck) | -| KubeDeploymentGenerationMismatch | Deployment generation mismatch — possible failed rollback | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentgenerationmismatch) | -| KubeDeploymentReplicasMismatch | Deployment doesn't have expected replica count | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentreplicasmismatch) | -| KubeDeploymentRolloutStuck | Deployment rollout not progressing | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentrolloutstuck) | -| KubeHpaMaxedOut | HPA running at max replicas | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout) | -| KubeHpaReplicasMismatch | HPA hasn't reached desired replicas | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpareplicasmismatch) | -| KubeJobFailed | Job failed. Check `kubectl describe job` and pod logs | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed) | -| KubeJobNotCompleted | Job didn't finish in time | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobnotcompleted) | -| KubeMemoryOvercommit | Cluster memory requests exceed capacity | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubememoryovercommit) | -| KubeMemoryQuotaOvercommit | Memory quota overcommitted | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubememoryquotaovercommit) | -| KubeNodeNotReady | Node not ready. Check `kubectl get node $NODE -o yaml`, fix or terminate the instance | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready) | -| KubeNodeReadinessFlapping | Node keeps flipping between ready and not ready | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodereadinessflapping) | -| KubeNodeUnreachable | Node is unreachable | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodeunreachable) | -| KubePdbNotEnoughHealthyPods | PDB doesn't have enough healthy pods — blocks voluntary disruptions | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepdbnotenoughhealthypods) | -| KubePersistentVolumeFillingUp | PV filling up (warning threshold, predicted to fill in 4 days) | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup) | -| KubePersistentVolumeInodesFillingUp | PV inodes filling up (warning threshold) | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeinodesfillingup) | -| KubeQuotaExceeded | Namespace quota exceeded | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaexceeded) | -| KubeStatefulSetGenerationMismatch | StatefulSet generation mismatch | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetgenerationmismatch) | -| KubeStatefulSetReplicasMismatch | StatefulSet doesn't have expected replicas | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetreplicasmismatch) | -| KubeStatefulSetUpdateNotRolledOut | StatefulSet update hasn't rolled out | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetupdatenotrolledout) | - -#### Kubelet — critical - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| KubeletDown | Kubelet target gone | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletdown) | - -#### Kubelet — warning - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| KubeletPlegDurationHigh | Pod lifecycle event generator taking too long | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletplegdurationhigh) | -| KubeletPodStartUpLatencyHigh | Pods taking too long to start | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh) | - -#### Node — critical - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| NodeFileDescriptorLimit | Kernel predicted to run out of file descriptors soon | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit) | -| NodeFilesystemAlmostOutOfFiles | < 3% inodes remaining | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles) | -| NodeFilesystemAlmostOutOfSpace | < 3% disk space remaining | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace) | -| NodeFilesystemFilesFillingUp | Predicted to run out of inodes in 4 hours | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup) | -| NodeFilesystemSpaceFillingUp | Predicted to run out of space in 4 hours | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup) | - -#### Node — warning - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| NodeClockNotSynchronising | NTP not syncing | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodeclocknotsynchronising) | -| NodeClockSkewDetected | Clock skew on node | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodeclockskewdetected) | -| NodeDiskIOSaturation | Disk IO queue is high | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodediskiosaturation) | -| NodeFileDescriptorLimit | FD limit approaching (warning threshold) | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit) | -| NodeFilesystemAlmostOutOfFiles | < 5% inodes remaining | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles) | -| NodeFilesystemAlmostOutOfSpace | < 5% disk space remaining | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace) | -| NodeFilesystemFilesFillingUp | Predicted to run out of inodes in 24 hours | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup) | -| NodeFilesystemSpaceFillingUp | Predicted to run out of space in 24 hours | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup) | -| NodeHighNumberConntrackEntriesUsed | Conntrack table getting full | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodehighnumberconntrackentriesused) | -| NodeMemoryHighUtilization | Node running low on memory | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodememoryhighutilization) | -| NodeMemoryMajorPagesFaults | Heavy major page faults — something is swapping hard | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodememorymajorpagesfaults) | -| NodeNetworkInterfaceFlapping | NIC keeps going up and down | [link](https://runbooks.prometheus-operator.dev/runbooks/general/nodenetworkinterfaceflapping) | -| NodeNetworkReceiveErrs | Lots of receive errors on a NIC | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworkreceiveerrs) | -| NodeNetworkTransmitErrs | Lots of transmit errors on a NIC | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworktransmiterrs) | -| NodeSystemSaturation | Load per core is very high | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodesystemsaturation) | -| NodeSystemdServiceCrashlooping | A systemd service keeps restarting | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodesystemdservicecrashlooping) | -| NodeSystemdServiceFailed | A systemd service has failed | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodesystemdservicefailed) | -| NodeTextFileCollectorScrapeError | Node exporter text file collector failed | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodetextfilecollectorscrapeerror) | - -#### Alertmanager — critical - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| AlertmanagerClusterCrashlooping | Half or more Alertmanager instances crashlooping | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerclustercrashlooping) | -| AlertmanagerClusterDown | Half or more Alertmanager instances down | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerclusterdown) | -| AlertmanagerClusterFailedToSendAlerts | Failed to send to a critical integration | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerclusterfailedtosendalerts) | -| AlertmanagerConfigInconsistent | Alertmanager instances have different configs | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerconfiginconsistent) | -| AlertmanagerFailedReload | Config reload failed | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerfailedreload) | -| AlertmanagerMembersInconsistent | Cluster member can't find other members | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagermembersinconsistent) | - -#### Alertmanager — warning - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| AlertmanagerClusterFailedToSendAlerts | Failed to send to a non-critical integration | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerclusterfailedtosendalerts) | -| AlertmanagerFailedToSendAlerts | An instance failed to send notifications | [link](https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerfailedtosendalerts) | - -#### Prometheus — critical - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| PrometheusBadConfig | Config reload failed. Check `kubectl logs` on the prometheus pod | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusbadconfig) | -| PrometheusErrorSendingAlertsToAnyAlertmanager | > 3% errors sending alerts to all Alertmanagers | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheuserrorsendingalertstoanyalertmanager) | -| PrometheusRemoteStorageFailures | Failing to send samples to remote storage | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusremotestoragefailures) | -| PrometheusRemoteWriteBehind | Remote write is falling behind | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusremotewritebehind) | -| PrometheusRuleFailures | Rule evaluations failing | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusrulefailures) | -| PrometheusTargetSyncFailure | Target sync failed | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheustargetsyncfailure) | - -#### Prometheus — warning - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| PrometheusNotConnectedToAlertmanagers | Can't reach any Alertmanagers | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotconnectedtoalertmanagers) | -| PrometheusNotIngestingSamples | Not ingesting samples | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotingestingsamples) | -| PrometheusHighQueryLoad | Hitting max concurrent query capacity | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheushighqueryload) | -| PrometheusDuplicateTimestamps | Dropping samples with duplicate timestamps | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusduplicatetimestamps) | -| PrometheusOutOfOrderTimestamps | Dropping out-of-order samples | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusoutofordertimestamps) | -| PrometheusNotificationQueueRunningFull | Alert queue predicted to fill up within 30m | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusnotificationqueuerunningfull) | -| PrometheusErrorSendingAlertsToSomeAlertmanagers | Errors sending to some (not all) Alertmanagers | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheuserrorsendingalertstosomealertmanagers) | -| PrometheusRemoteWriteDesiredShards | Remote write wants more shards than configured | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusremotewritedesiredshards) | -| PrometheusTSDBCompactionsFailing | Block compaction failing | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheustsdbcompactionsfailing) | -| PrometheusTSDBReloadsFailing | Block reload failing | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheustsdbreloadsfailing) | -| PrometheusKubernetesListWatchFailures | SD list/watch requests failing | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheuskuberneteslistwatchfailures) | -| PrometheusMissingRuleEvaluations | Rule group evaluation too slow, skipping evaluations | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusmissingruleevaluations) | -| PrometheusSDRefreshFailure | Service discovery refresh failing | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheussdrefreshfailure) | -| PrometheusLabelLimitHit | Dropping targets that exceed label limits | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheuslabellimithit) | -| PrometheusTargetLimitHit | Dropping targets that exceed target limits | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheustargetlimithit) | -| PrometheusScrapeBodySizeLimitHit | Dropping targets that exceed body size limit | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusscrapebodysizelimithit) | -| PrometheusScrapeSampleLimitHit | Dropping scrapes that exceed sample limit | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus/prometheusscrapesamplelimithit) | - -#### Prometheus Operator — warning - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| ConfigReloaderSidecarErrors | Config reloader sidecar failing for 10m | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/configreloadersidecarerrors) | -| PrometheusOperatorListErrors | List operation errors | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorlisterrors) | -| PrometheusOperatorWatchErrors | Watch operation errors | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorwatcherrors) | -| PrometheusOperatorSyncFailed | Last reconciliation failed | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorsyncfailed) | -| PrometheusOperatorReconcileErrors | Reconciliation errors | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorreconcileerrors) | -| PrometheusOperatorNodeLookupErrors | Node lookup errors during reconciliation | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatornodelookuperrors) | -| PrometheusOperatorNotReady | Operator not ready | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatornotready) | -| PrometheusOperatorRejectedResources | Resources rejected by operator | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorrejectedresources) | -| PrometheusOperatorStatusUpdateErrors | Status update errors | [link](https://runbooks.prometheus-operator.dev/runbooks/prometheus-operator/prometheusoperatorstatusupdateerrors) | - -#### kube-state-metrics — critical - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| KubeStateMetricsListErrors | List operations failing | [link](https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricslisterrors) | -| KubeStateMetricsWatchErrors | Watch operations failing | [link](https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricswatcherrors) | -| KubeStateMetricsShardingMismatch | Sharding misconfigured | [link](https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricsshardingmismatch) | -| KubeStateMetricsShardsMissing | Shards missing | [link](https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricsshardsmissing) | - -#### General - -| Alert | Severity | What it means | Runbook | -|-------|----------|--------------|---------| -| TargetDown | warning | A Prometheus scrape target is unreachable. Check `/targets` in Prometheus UI, verify ServiceMonitor config and network policies | [link](https://runbooks.prometheus-operator.dev/runbooks/general/targetdown) | -| Watchdog | none | Always-firing deadman switch — if this stops, Alertmanager is broken | [link](https://runbooks.prometheus-operator.dev/runbooks/general/watchdog) | -| InfoInhibitor | none | Suppresses info-level alerts when higher-severity alerts are already firing for the same target | [link](https://runbooks.prometheus-operator.dev/runbooks/general/infoinhibitor) | - -#### Info - -These don't page or post to Slack. They exist for dashboards and as context when other alerts are firing. - -| Alert | What it means | Runbook | -|-------|--------------|---------| -| CPUThrottlingHigh | Processes getting CPU-throttled | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/cputhrottlinghigh) | -| KubeNodeEviction | Node is evicting pods | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodeeviction) | -| KubeNodePressure | Node has an active pressure condition (memory, disk, PID) | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodepressure) | -| KubeQuotaAlmostFull | Namespace quota approaching limit | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaalmostfull) | -| KubeQuotaFullyUsed | Namespace quota fully consumed | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotafullyused) | -| KubeletTooManyPods | Kubelet running at pod capacity | [link](https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubelettoomanypods) | -| NodeCPUHighUsage | High CPU usage on node | [link](https://runbooks.prometheus-operator.dev/runbooks/node/nodecpuhighusage) | - -## Adding a new alert - -Alerts live in PrometheusRule manifests under `eks-cluster/helm-chart/observability/templates/`. Add yours to an existing file if it fits, or create a new `prometheus-alerts-.yaml` if you're adding a group for a new service. - -```yaml -apiVersion: monitoring.coreos.com/v1 -kind: PrometheusRule -metadata: - name: -alerts - namespace: {{ .Release.Namespace }} - labels: - app.kubernetes.io/name: -alerts -spec: - groups: - - name: -health - rules: - - alert: SomethingBad - expr: - for: 5m - labels: - severity: warning - annotations: - summary: "Short description with {{ "{{" }} $labels.region {{ "}}" }}" - description: "Longer explanation of what's happening." -``` - -### Things to keep in mind - -- **Always set `severity`** to `critical` or `warning`. No severity = no routing. -- **Exclude customer workloads** if your alert is pod-level. Add `container!="deployment"` to the metric selector and use `unless on(namespace, pod) kube_pod_labels{label_app_kubernetes_io_managed_by="krane"}` to skip krane-managed pods. We don't want to page ourselves for a customer's broken container. -- **Include region in the summary** when the data has it. Getting paged with "SomethingBad (us-east-1 / api)" is a lot more useful than just "SomethingBad." -- **Set a reasonable `for` duration.** Too short and you get flapping alerts. Too long and you find out late. 2-5 minutes is a good starting point for most things. -- **Think about staging vs. production.** The `environment` external label is available on all metrics (`production001` or `staging`). If you want different severity per environment, write two rules with different `environment` filters — one critical for production, one warning for staging. - -## incident.io - -Alert routing config and API usage is documented in [incident.io](/infra/observability/incident-io). diff --git a/docs/engineering/infra/observability/checkly.mdx b/docs/engineering/infra/observability/checkly.mdx deleted file mode 100644 index 9f65d8943e..0000000000 --- a/docs/engineering/infra/observability/checkly.mdx +++ /dev/null @@ -1,104 +0,0 @@ ---- -title: Checkly Alerts -description: Configuration and management of alerts in checklyhq.com ---- - -Synthetic monitoring for the Unkey API. Checkly runs checks from real locations worldwide and alerts via incident.io when something fails. - -## Checks - -### API checks - -These run every minute from multiple regions, hitting the production API and validating the response. - -| Check | Endpoint | Locations | Degraded | Max | -|-------|----------|-----------|----------|-----| -| `/v2/keys.verifyKey` | `POST https://api.unkey.com/v2/keys.verifyKey` | 22 | 500ms | 10s | -| `/v2/ratelimit.limit` | `POST https://api.unkey.com/v2/ratelimit.limit` | 7 | 100ms | 1s | -| `/v2/liveness` | `GET https://api.unkey.com/v2/liveness` | 22 | 5s | 20s | - -- **Degraded**: response time above this threshold marks the check as degraded (not failed) -- **Max**: response time above this is a failure - -All three use run-based escalation with a threshold of 1 — a single failed run triggers the alert. Alerts go to incident.io via the webhook channel. - -### Heartbeat checks - -Heartbeats work the other way around — instead of Checkly calling an endpoint, a service pings Checkly on a schedule. If Checkly doesn't hear back within the expected window, it alerts. - -#### Production - -| Check | What it monitors | Period | Grace | Alert channel | -|-------|-----------------|--------|-------|---------------| -| Certificate Workflow | Restate cert renewal cron job | 1 day | 2 hours | Incident.io (Production) | -| Workflows: Refill | Restate key refill cron job | 1 day | 1 hour | none | -| Workflows: Count Keys | Restate count keys job | 5 min | 1 min | none | -| Quota Check | Restate quota check cron job | 1 day | 1 hour | none | - -"Workflows: Refill", "Workflows: Count Keys", and "Quota Check" don't have alert channel subscriptions — they'll show as failed in the Checkly dashboard but won't page anyone. - -#### Staging - -| Check | What it monitors | Period | Grace | Alert channel | -|-------|-----------------|--------|-------|---------------| -| Certificate Workflow (Staging) | Staging cert renewal cron job | 1 day | 2 hours | Incident.io (Staging) | -| Key Refill (Staging) | Staging key refill cron job | 1 day | 1 hour | Incident.io (Staging) | -| Quota Check (Staging) | Staging quota check cron job | 1 day | 1 hour | Incident.io (Staging) | - -Staging heartbeat ping URLs (configure in the staging Restate CronJobs): - -| Check | Ping URL | -|-------|----------| -| Certificate Workflow (Staging) | `https://ping.checklyhq.com/2b65541f-7de4-4fa6-8c07-44105fac729a` | -| Quota Check (Staging) | `https://ping.checklyhq.com/e3979a6b-18d5-4e79-ad0d-9ea36044b922` | -| Key Refill (Staging) | `https://ping.checklyhq.com/8077fc90-3c4a-453a-9b40-ed10c69f7bf7` | - -Each CronJob needs to ping its URL after a successful run — a `curl -s "${PING_URL}"` at the end of the job script. - -## Alert channels - -Two webhook alert channels, one per environment. Both send failure and recovery events (`sendFailure: true`, `sendRecovery: true`). Degraded events are not sent. The payload template includes the check ID as the deduplication key, so incident.io groups alerts per check and auto-resolves when the recovery comes in. - -| Channel | ID | incident.io source | URL | -|---------|----|--------------------|-----| -| Incident.io (Production) | `218874` | Checkly (Production) | `https://api.incident.io/v2/alert_events/checkly/01HZ7GE7CASMF15RWV1QFCMKTQ` | -| Incident.io (Staging) | `273211` | Checkly (Staging) | `https://api.incident.io/v2/alert_events/checkly/01KKZJERF1HQFJRE3XWFZ0NV19` | - -The authorization token for each webhook is the corresponding incident.io alert source's `secret_token`. If you need to rotate one, get the new token from https://app.incident.io/unkey/settings/alert-sources, then update the webhook header in Checkly's alert channel settings. - -## How it routes through incident.io - -- **Production checks** → Checkly (Production) source → Production Alerts route → pages on-call -- **Staging checks** → Checkly (Staging) source → Staging Notifications route → Slack #alerts only, no page - -See [incident.io](/infra/observability/incident-io) for the full routing table. - -## Environment variables - -| Variable | Purpose | -|----------|---------| -| `UNKEY_KEY` | API key used by checks to authenticate with the Unkey API | -| `UNKEY_ROOT_KEY` | Root key used by checks that need elevated access | - -Both are marked as secrets in Checkly — the API won't return their values. If you need to rotate them, update them in the Checkly UI under Account Settings → Environment Variables. - -## Backups - -```bash -CHECKLY_API_KEY="cu_..." CHECKLY_ACCOUNT_ID="bff70c2d-4206-4e3f-a447-4eeacd4eb03e" ./contrib/backup-checkly.sh -``` - -Restore: - -```bash -CHECKLY_API_KEY="cu_..." CHECKLY_ACCOUNT_ID="bff70c2d-4206-4e3f-a447-4eeacd4eb03e" ./contrib/restore-checkly.sh -``` - -## Adding a new check - -Easiest through the Checkly UI at https://app.checkly.com. After creating it: - -1. Subscribe it to the right alert channel: - - Production checks → `218874` (Incident.io Production) — pages on-call - - Staging checks → `273211` (Incident.io Staging) — Slack only -2. Run the backup script and commit the updated `backups/checkly/checks.json` diff --git a/docs/engineering/infra/observability/incident-io.mdx b/docs/engineering/infra/observability/incident-io.mdx deleted file mode 100644 index 55786f9a19..0000000000 --- a/docs/engineering/infra/observability/incident-io.mdx +++ /dev/null @@ -1,122 +0,0 @@ ---- -title: incident.io -description: Alert routing config and API usage. ---- - -Alert routing lives in incident.io, not in this repo. If you need to create or modify alert routes, you'll need an API key. - -## Creating an API key - -1. Go to https://app.incident.io/unkey/settings/api-keys/create -2. Give it a descriptive name (e.g. "Alert route updates - yourname") -3. Enable these scopes: - - **View data, like public incidents and organization settings** - - **Create and manage on-call resources** -4. Create the key and save it somewhere safe - -That's enough for reading and modifying alert routes. Please revoke it when you're done — don't leave keys sitting around. - -## Current alert routes - -| Route | Source | Condition | Escalation | Slack | -|-------|--------|-----------|------------|-------| -| Production Alerts | Alertmanager (Prod), Checkly (Prod), sev0@, Grafana Cloud | `severity=critical` (Alertmanager only) | Engineering On-Call | #alerts | -| Production Warnings | Alertmanager (Prod) | `severity=warning` | None | #alerts | -| Unrouted Alerts | Alertmanager (Prod) | Labels missing `severity` key | None | #alerts | -| Database Alerts | PlanetScale | `branch=main` | Engineering On-Call | #alerts | -| Staging Notifications | Alertmanager (Staging), Checkly (Staging) | — | None | #alerts | -| Informational | Axiom, Status Page Views | — | None | #alerts | - -## Secrets - -Alertmanager authenticates with incident.io using per-source tokens. Each alert source in incident.io has a `secret_token` — Alertmanager sends this in every webhook payload to prove it's legit. - -### Where the tokens live - -| Secret | Where | What | -|--------|-------|------| -| `unkey/alertmanager-incident-io` | AWS Secrets Manager (production) | Contains `alert_source_token` — the token for the Alertmanager (Production) source | -| `unkey/alertmanager-incident-io` | AWS Secrets Manager (staging) | Same key name, different value — the token for the Alertmanager (Staging) source | - -These are pulled into each cluster as a Kubernetes secret (`alertmanager-incidentio`) by External Secrets Operator. The ExternalSecret is defined in `eks-cluster/helm-chart/observability/templates/incidentio-external-secret.yaml`. - -Alertmanager reads the token from a file mount at `/etc/alertmanager/secrets/alertmanager-incidentio/token`. - -### Where to find the tokens - -If you need to rotate or re-create a token: - -1. Go to https://app.incident.io/unkey/settings/alert-sources -2. Click the source (e.g. "Alertmanager (Production)") -3. The `secret_token` is shown in the source config - -Then update the AWS secret: - -```bash -aws secretsmanager put-secret-value \ - --secret-id "unkey/alertmanager-incident-io" \ - --secret-string "{\"alert_source_token\": \"${NEW_TOKEN}\"}" \ - --region us-east-1 \ - --profile "${AWS_PROFILE}" -``` - -External Secrets polls every hour (`refreshInterval: 1h`). To force an immediate sync: - -```bash -kubectl annotate externalsecret alertmanager-incidentio -n monitoring \ - force-sync=$(date +%s) --overwrite -``` - -Then restart Alertmanager to pick up the new token: - -```bash -kubectl delete pod -n monitoring -l app.kubernetes.io/name=alertmanager -``` - -### Alert source URLs - -Each environment's Alertmanager points to a different incident.io alert source. The URL is in the Alertmanager config under `incidentio_configs`: - -| Environment | Alert source | URL | -|-------------|-------------|-----| -| Production | Alertmanager (Production) | `https://api.incident.io/v2/alert_events/alertmanager/01KHSNW67SZJ1KGWEMA9C0GPT7` | -| Staging | Alertmanager (Staging) | `https://api.incident.io/v2/alert_events/alertmanager/01KH7FNXWPMH6KPCQTYJJY947G` | -| Production | Checkly (Production) | `https://api.incident.io/v2/alert_events/checkly/01HZ7GE7CASMF15RWV1QFCMKTQ` | -| Staging | Checkly (Staging) | `https://api.incident.io/v2/alert_events/checkly/01KKZJERF1HQFJRE3XWFZ0NV19` | - -The base values file (`values.yaml`) defaults to the staging Alertmanager source. Production environment files override this to point to the production source. Checkly sources are configured via webhook alert channels in the Checkly UI — see [Checkly](/infra/observability/checkly). - -## API reference - -The incident.io API docs are at https://docs.incident.io/api-reference/introduction. The alert routes endpoints you'll mostly use: - -- `GET /v2/alert_routes/{id}` — get a route's full config (including the `version` field you need for updates) -- `PUT /v2/alert_routes/{id}` — update a route (requires `version: current_version + 1`) -- `POST /v2/alert_routes` — create a new route -- `GET /v2/alert_routes?page_size=25` — list all routes - -The PUT version field is annoying — the list endpoint returns stale version numbers, so always fetch from the detail endpoint before updating. - -## Backing up route configs - -Before making changes, dump the current state so you can restore if something goes wrong. There's a script for this: - -```bash -INCIDENT_IO_TOKEN="inc_..." ./contrib/backup-incident-io.sh -``` - -By default it writes to `backups/incident.io/`. You can pass a different directory as the first argument: - -```bash -INCIDENT_IO_TOKEN="inc_..." ./contrib/backup-incident-io.sh /tmp/my-backup -``` - -Backups are JSON and should be committed to the repo so we have a record of the state. - -### Restoring - -```bash -INCIDENT_IO_TOKEN="inc_..." ./contrib/restore-incident-io.sh -``` - -It reads from `backups/incident.io/` by default (or pass a different directory), shows you what it's about to restore, and asks for confirmation before overwriting anything. It handles the version bookkeeping automatically. diff --git a/docs/engineering/infra/secrets/aws-secrets.mdx b/docs/engineering/infra/secrets/aws-secrets.mdx deleted file mode 100644 index b8db25c496..0000000000 --- a/docs/engineering/infra/secrets/aws-secrets.mdx +++ /dev/null @@ -1,216 +0,0 @@ ---- -title: AWS Secrets Manager - Required Secrets -description: Required secrets for EKS cluster infrastructure. ---- - -This document lists all AWS Secrets Manager secrets required for the EKS cluster infrastructure. - -**Note:** All secrets are stored as JSON objects. ExternalSecrets extract individual properties from these JSON secrets. - -## Shared Secrets - -**AWS Secret Name:** `unkey/shared` - -```json -{ - "UNKEY_DATABASE_PRIMARY": "mysql://...", - "UNKEY_DATABASE_REPLICA": "mysql://...", - "UNKEY_CLICKHOUSE_URL": "https://...", - "UNKEY_VAULT_MASTER_KEYS": "...", - "UNKEY_VAULT_S3_URL": "https://...", - "UNKEY_VAULT_S3_BUCKET": "...", - "UNKEY_VAULT_S3_ACCESS_KEY_ID": "...", - "UNKEY_VAULT_S3_ACCESS_KEY_SECRET": "...", - "UNKEY_REGISTRY_URL": "ghcr.io/...", - "UNKEY_REGISTRY_USERNAME": "...", - "UNKEY_REGISTRY_PASSWORD": "...", - "GRAFANA_ADMIN_USER": "admin", - "GRAFANA_ADMIN_PASSWORD": "..." -} -``` - -| Property | Used By | Description | -|----------|---------|-------------| -| `UNKEY_DATABASE_PRIMARY` | control, frontline, sentinel | Primary database connection string | -| `UNKEY_DATABASE_REPLICA` | frontline, sentinel | Replica database connection string | -| `UNKEY_CLICKHOUSE_URL` | control | ClickHouse analytics database URL | -| `UNKEY_VAULT_MASTER_KEYS` | control, krane, frontline | Vault encryption master keys | -| `UNKEY_VAULT_S3_URL` | control, krane, frontline | Vault S3 storage endpoint URL | -| `UNKEY_VAULT_S3_BUCKET` | control, krane, frontline | Vault S3 bucket name | -| `UNKEY_VAULT_S3_ACCESS_KEY_ID` | control, krane, frontline | Vault S3 access key ID | -| `UNKEY_VAULT_S3_ACCESS_KEY_SECRET` | control, krane, frontline | Vault S3 secret access key | -| `UNKEY_REGISTRY_URL` | control, krane | Container registry URL | -| `UNKEY_REGISTRY_USERNAME` | control, krane | Container registry username | -| `UNKEY_REGISTRY_PASSWORD` | control, krane | Container registry password | -| `GRAFANA_ADMIN_USER` | observability | Grafana admin username | -| `GRAFANA_ADMIN_PASSWORD` | observability | Grafana admin password | - -## Service-Specific Secrets - -### control (Control Plane) - -**AWS Secret Name:** `unkey/control` - -```json -{ - "UNKEY_AUTH_TOKEN": "...", - "UNKEY_BUILD_S3_URL": "https://...", - "UNKEY_BUILD_S3_BUCKET": "...", - "UNKEY_BUILD_S3_ACCESS_KEY_ID": "...", - "UNKEY_BUILD_S3_ACCESS_KEY_SECRET": "...", - "UNKEY_RESTATE_API_KEY": "...", - "UNKEY_ACME_ROUTE53_ENABLED": "true", - "UNKEY_ACME_ROUTE53_ACCESS_KEY_ID": "...", - "UNKEY_ACME_ROUTE53_SECRET_ACCESS_KEY": "...", - "UNKEY_ACME_ROUTE53_REGION": "us-east-1" -} -``` - -| Property | Description | -|----------|-------------| -| `UNKEY_AUTH_TOKEN` | Authentication token for control API | -| `UNKEY_BUILD_S3_URL` | Build artifacts S3 endpoint URL | -| `UNKEY_BUILD_S3_BUCKET` | Build artifacts S3 bucket name | -| `UNKEY_BUILD_S3_ACCESS_KEY_ID` | Build S3 access key ID | -| `UNKEY_BUILD_S3_ACCESS_KEY_SECRET` | Build S3 secret access key | -| `UNKEY_RESTATE_API_KEY` | Restate Cloud admin API token | -| `UNKEY_ACME_ROUTE53_ENABLED` | Set to "true" to enable Route53 provider (optional) | -| `UNKEY_ACME_ROUTE53_ACCESS_KEY_ID` | Route53 access key ID for ACME (optional) | -| `UNKEY_ACME_ROUTE53_SECRET_ACCESS_KEY` | Route53 secret access key for ACME (optional) | -| `UNKEY_ACME_ROUTE53_REGION` | Route53 region (optional) | - -### krane - -**AWS Secret Name:** `unkey/krane` - -```json -{ - "UNKEY_CONTROL_PLANE_BEARER": "..." -} -``` - -| Property | Description | -|----------|-------------| -| `UNKEY_CONTROL_PLANE_BEARER` | Bearer token for control plane authentication | - -### argocd - -**AWS Secret Name:** `unkey/argocd` - -```json -{ - "github-webhook-secret": "...", - "slack-token": "...", - "admin.password": "$2a$10$..." -} -``` - -| Property | Used By | Description | -|----------|---------|-------------| -| `github-webhook-secret` | argocd | GitHub webhook secret for ArgoCD notifications | -| `slack-token` | argocd | Slack token for ArgoCD notifications | -| `admin.password` | argocd | bcrypt-hashed admin password (merged into `argocd-secret`) | - -**Note:** `admin.password` must be a bcrypt hash. See [Rotating the ArgoCD Admin Password](#rotating-the-argocd-admin-password) below. - -## Secret Usage by Service - -| Service | JSON Secret | Properties Used | -|---------|-------------|-----------------| -| **control** | `unkey/shared` | DATABASE_PRIMARY, CLICKHOUSE_URL, VAULT_*, REGISTRY_* | -| **control** | `unkey/control` | AUTH_TOKEN, BUILD_S3_*, RESTATE_API_KEY, ACME_ROUTE53_* | -| **krane** | `unkey/shared` | VAULT_*, REGISTRY_* | -| **krane** | `unkey/krane` | CONTROL_PLANE_BEARER | -| **frontline** | `unkey/shared` | DATABASE_PRIMARY, DATABASE_REPLICA, VAULT_* | -| **sentinel** | `unkey/shared` | DATABASE_PRIMARY, DATABASE_REPLICA | -| **observability** | `unkey/shared` | GRAFANA_ADMIN_USER, GRAFANA_ADMIN_PASSWORD | -| **restate** | `unkey/control` | RESTATE_API_KEY | -| **argocd** | `unkey/argocd` | github-webhook-secret, slack-token, admin.password | - -## Rotating the ArgoCD Admin Password - -The ArgoCD admin password is stored as a bcrypt hash in AWS Secrets Manager and synced into each cluster by External Secrets Operator. To rotate it: - -### 1. Generate a bcrypt hash of the new password - -```bash -# Using htpasswd (if installed) -HASH=$(htpasswd -nbBC 10 "" "NEW_PASSWORD_HERE" | tr -d ':\n' | sed 's/$2y/$2a/') - -# Or using docker -HASH=$(docker run --rm httpd:2-alpine htpasswd -nbBC 10 "" "NEW_PASSWORD_HERE" | tr -d ':\n' | sed 's/$2y/$2a/') -``` - -### 2. Update the AWS secret - -Replace `PROFILE` and `REGION` for your target environment: - -- **Staging:** `--profile unkey-sandbox-admin --region eu-central-1` -- **Production:** `--profile unkey-production001-admin --region us-east-1` - -```bash -CURRENT=$(aws secretsmanager get-secret-value \ - --profile PROFILE --region REGION \ - --secret-id "unkey/argocd" --query SecretString --output text) - -UPDATED=$(echo "$CURRENT" | jq --arg h "$HASH" '.["admin.password"] = $h') - -aws secretsmanager put-secret-value \ - --profile PROFILE --region REGION \ - --secret-id "unkey/argocd" --secret-string "$UPDATED" -``` - -The secret automatically replicates to other regions via AWS-native replication. - -### 3. Wait for ESO to sync (or force it) - -The ArgoCD ExternalSecrets poll every 1 minute (most services use `1m`, some like incident.io and restate use `1h`). Verify the sync: - -```bash -kubectl --context CONTEXT -n argocd get externalsecret argocd-admin-password -``` - -### 4. Restart argocd-server - -ArgoCD caches the password in memory, so a restart is required: - -```bash -kubectl --context CONTEXT -n argocd rollout restart deployment argocd-server -``` - -### 5. Verify - -Log in to the ArgoCD UI with username `admin` and the new password. - -## Creating/Updating Secrets - -Create a JSON secret: - -```bash -aws secretsmanager create-secret \ - --name "unkey/shared" \ - --secret-string '{"UNKEY_DATABASE_PRIMARY":"mysql://...","UNKEY_DATABASE_REPLICA":"mysql://..."}' \ - --region us-east-1 -``` - -Update an existing JSON secret: - -```bash -aws secretsmanager put-secret-value \ - --secret-id "unkey/shared" \ - --secret-string '{"UNKEY_DATABASE_PRIMARY":"mysql://...","UNKEY_DATABASE_REPLICA":"mysql://..."}' \ - --region us-east-1 -``` - -To update a single property without replacing the entire secret: - -```bash -# Get current secret -CURRENT=$(aws secretsmanager get-secret-value --secret-id "unkey/shared" --query SecretString --output text) - -# Update single property with jq -UPDATED=$(echo "$CURRENT" | jq '.UNKEY_DATABASE_PRIMARY = "new-value"') - -# Put updated secret -aws secretsmanager put-secret-value --secret-id "unkey/shared" --secret-string "$UPDATED" -```