Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
8ae7070
feat: add cloudflare-metrics worker that exports graphql analytics to…
zackpollard Apr 10, 2026
483c6f2
feat(cloudflare-metrics): add grafana overview dashboard
zackpollard Apr 10, 2026
547c1e3
fix(cloudflare-metrics): hardcode analytics read permission group uuid
zackpollard Apr 10, 2026
f993bca
fix(cloudflare-metrics): inject analytics api token instead of creati…
zackpollard Apr 10, 2026
c22bbdc
chore(cloudflare-metrics): defer analytics token to follow-up
zackpollard Apr 10, 2026
8277172
chore(cloudflare-metrics): add empty preview_url output for ci previe…
zackpollard Apr 10, 2026
fb0fe8f
feat(cloudflare-metrics): provision analytics token via terraform usi…
zackpollard Apr 10, 2026
1f25745
fix(cloudflare-metrics): pin api token value in terraform_data to sur…
zackpollard Apr 10, 2026
c2ab590
fix(cloudflare-metrics): force token recreation to capture fresh valu…
zackpollard Apr 10, 2026
c1a57fa
fix(cloudflare-metrics): force api token replacement via generation t…
zackpollard Apr 10, 2026
811d5b7
chore(cloudflare-metrics): bump token generation to trigger rotation
zackpollard Apr 10, 2026
a114cb3
debug(cloudflare-metrics): use token value directly and add length di…
zackpollard Apr 10, 2026
71d0fbc
fix(cloudflare-metrics): mark diagnostic output as sensitive
zackpollard Apr 10, 2026
7b97fe8
debug(cloudflare-metrics): expose token value length via nonsensitive…
zackpollard Apr 10, 2026
69091f2
debug(cloudflare-metrics): expose account id length and env outputs
zackpollard Apr 10, 2026
75d6b1d
chore: ignore local deployment/.env.* files
zackpollard Apr 10, 2026
9132095
debug(cloudflare-metrics): add diagnostic subrequest beacon with bind…
zackpollard Apr 10, 2026
e366336
fix(cloudflare-metrics): look up fetch lazily at call time via global…
zackpollard Apr 10, 2026
80c25de
chore(cloudflare-metrics): remove debug diagnostics now that pipeline…
zackpollard Apr 10, 2026
e9fb58a
feat(cloudflare-metrics): enrich d1/queue/zone metrics with resource …
zackpollard Apr 10, 2026
aea3522
fix(cloudflare-metrics): handle duplicate permission group names in l…
zackpollard Apr 10, 2026
74039d2
fix(cloudflare-metrics): resolve pages zones individually via /zones/…
zackpollard Apr 10, 2026
00c7c79
fix(cloudflare-metrics): grant zone read permission for per-zone lookups
zackpollard Apr 10, 2026
4858b42
fix(cloudflare-metrics): use match-all wildcard for zone read resourc…
zackpollard Apr 10, 2026
55a3f99
feat(cloudflare-metrics): 1-minute granularity and 4 new datasets
zackpollard Apr 10, 2026
85de619
fix(cloudflare-metrics): avoid subrequest limit by caching zones and …
zackpollard Apr 10, 2026
84bd8d0
feat(cloudflare-metrics): batch graphql queries to avoid subrequest l…
zackpollard Apr 10, 2026
0738acb
feat(cloudflare-metrics): add dashboard panels for new datasets
zackpollard Apr 10, 2026
210d356
fix(cloudflare-metrics): remove hardcoded 5m interval from dashboard …
zackpollard Apr 10, 2026
84558ea
docs(cloudflare-metrics): add readme covering scope, datasets, and todos
zackpollard Apr 10, 2026
e4011c4
feat(cloudflare-metrics): every-minute cron, cross-invocation caches,…
zackpollard Apr 10, 2026
9e9e61b
feat(cloudflare-metrics): expand account-scope dataset coverage
zackpollard Apr 10, 2026
497ac30
feat(cloudflare-metrics): add zone-scope dataset coverage
zackpollard Apr 10, 2026
9887753
feat(cloudflare-metrics): restructure dashboards into per-product layout
zackpollard Apr 11, 2026
dc34bf1
fix(cloudflare-metrics): chunk account batches under graphql node limit
zackpollard Apr 11, 2026
bd412f5
fix(cloudflare-metrics): align dashboards with actual emitted metrics
zackpollard Apr 11, 2026
f9f2651
feat(cloudflare-metrics): add grafana alerts for collector and flush …
zackpollard Apr 11, 2026
d5b15c9
fix(cloudflare-metrics): invert "collector not running" alert logic
zackpollard Apr 11, 2026
b8e61f8
fix(cloudflare-metrics): simplify "collector not running" alert
zackpollard Apr 11, 2026
55d1700
feat(cloudflare-metrics): link every alert to its exporter-health panel
zackpollard Apr 11, 2026
03c5d63
refactor(cloudflare-metrics): drop /collect manual trigger endpoint
zackpollard Apr 11, 2026
b42adce
test(cloudflare-metrics): split monolithic test file into per-module …
zackpollard Apr 11, 2026
9720a0b
test(cloudflare-metrics): cover flush self-telemetry and scheduled ha…
zackpollard Apr 11, 2026
ac40872
ci(cloudflare-metrics): run integration tests on path-filtered PRs
zackpollard Apr 11, 2026
dec8c19
fix(cloudflare-metrics): align scheduled handler signature + skip int…
zackpollard Apr 11, 2026
e22a0fd
refactor(cloudflare-metrics): split metrics.ts by concern
zackpollard Apr 12, 2026
c0da950
refactor(cloudflare-metrics): split graphql query builders from trans…
zackpollard Apr 12, 2026
8abe01c
refactor(cloudflare-metrics): split collector into resource-cache + e…
zackpollard Apr 12, 2026
3b30d6a
refactor(cloudflare-metrics): move handlers out of index.ts
zackpollard Apr 12, 2026
a12f263
refactor(cloudflare-metrics): migrate call sites to canonical import …
zackpollard Apr 12, 2026
34a871c
chore: bump github actions to latest + remove unnecessary comments
zackpollard Apr 12, 2026
585ff7a
docs(cloudflare-metrics): rewrite readme to reflect current state
zackpollard Apr 12, 2026
0002809
feat(cloudflare-metrics): emit isolate age metric to track recycling
zackpollard Apr 12, 2026
61c993d
fix(cloudflare-metrics): use range query for collector liveness alert
zackpollard Apr 12, 2026
9daed09
fix(cloudflare-metrics): set usage_model=standard to avoid 50ms cpu l…
zackpollard Apr 12, 2026
d6ae40b
fix(cloudflare-metrics): backfill gaps on cold start + alert on worke…
zackpollard Apr 12, 2026
b08c322
fix: use increase() instead of rate() for dataset errors graph
zackpollard Apr 13, 2026
23cfdbf
fix: gap-aware backfill, raise cpu limit to 30s, add cpu time alert
zackpollard Apr 13, 2026
6e7878c
feat: add error detail tags and error details table to dashboards
zackpollard Apr 13, 2026
3f787d8
fix: retry graphql chunks when all fields error
zackpollard Apr 13, 2026
26a73d5
fix: don't count retried error responses in error_responses metric
zackpollard Apr 13, 2026
3454a9a
fix: reduce retry sleep to 250ms, add retry metrics and error logging
zackpollard Apr 13, 2026
0f7adf8
fix: add zone analytics read permission to api token
zackpollard Apr 14, 2026
4d56c27
fix: trigger worker version replacement on api token recreation
zackpollard Apr 14, 2026
7de6e69
fix: use account analytics read for zone-scoped policy, fix formatting
zackpollard Apr 14, 2026
05665fd
fix: use zone-scoped analytics read permission for zone analytics
zackpollard Apr 14, 2026
19d033b
fix: remove crossZoneSubrequests field from http_requests_detail dataset
zackpollard Apr 14, 2026
43656ce
fix: use date granularity for r2 storage dataset
zackpollard Apr 14, 2026
f59adee
fix: use date granularity for all storage/snapshot datasets
zackpollard Apr 14, 2026
f2816c8
chore: add debug logging for storage dataset collection
zackpollard Apr 14, 2026
8744e44
fix: wrap storage metric queries with last_over_time to fill gaps
zackpollard Apr 14, 2026
a0d499b
feat: add estimated billing panels to all dashboards
zackpollard Apr 14, 2026
54317a0
fix: only collect date-granularity datasets once per hour
zackpollard Apr 14, 2026
44cf5aa
fix: limit date-granularity datasets to 100 rows with DESC order
zackpollard Apr 14, 2026
03e5860
refactor: drop date granularity for storage datasets
zackpollard Apr 14, 2026
ca896b2
perf: reduce cpu usage in hot path
zackpollard Apr 14, 2026
f86a398
fix: prevent backfill death spiral
zackpollard Apr 14, 2026
695f3e3
perf: direct line protocol + tighter flush buffer cap
zackpollard Apr 15, 2026
4c6928e
chore: fix lint errors in line protocol escaping
zackpollard Apr 15, 2026
bb6dfcf
chore: format escape helpers
zackpollard Apr 15, 2026
838bc74
perf: precompute per-dataset tag enricher to avoid per-row switch
zackpollard Apr 15, 2026
3ba4936
chore: bump build marker to force fresh isolate
zackpollard Apr 15, 2026
c30bc43
fix: work around cloudflare terraform provider cpu_ms bug
zackpollard Apr 15, 2026
16608c7
fix: force standard usage_model on cloudflare-metrics service-env
zackpollard Apr 15, 2026
5e5f139
revert: roll back cpu-limit safety hacks now that usage_model is fixed
zackpollard Apr 15, 2026
dd50bc6
chore: set cpu_ms back to 30000
zackpollard Apr 15, 2026
1c375a0
fix: make service-env PATCH non-fatal in deploy
zackpollard Apr 15, 2026
ffbae55
feat: enable workers observability on cloudflare-metrics
zackpollard Apr 15, 2026
5ae515f
fix: use scheduledTime and widen window to prevent cron-miss gaps
zackpollard Apr 15, 2026
e5933b8
fix: subrequests-by-status panel had wrong label and rate() on gauge
zackpollard Apr 15, 2026
716282f
fix: replace rate() with raw metrics on all workers dashboard panels
zackpollard Apr 16, 2026
ce03a34
fix: align exporter-health dashboard queries to 1-minute intervals
zackpollard Apr 16, 2026
df19149
fix: connect data points in workers dashboard timeseries panels
zackpollard Apr 16, 2026
74b2194
fix: sweep all dashboards — remove rate(), connect data points
zackpollard Apr 16, 2026
969f7ed
fix: wrap all dashboard metrics with max_over_time to handle multi-co…
zackpollard Apr 16, 2026
f57810e
fix: use 1m window for max_over_time (was 2m)
zackpollard Apr 16, 2026
8f4f3cb
fix: replace all increase() with sum_over_time() for gauge metrics
zackpollard Apr 16, 2026
66e3f37
fix: correct units on all dashboard panels
zackpollard Apr 16, 2026
b982cf5
fix: guard against NaN/Infinity, div-by-zero, and add missing tests
zackpollard Apr 16, 2026
c33675c
chore: trigger redeploy to test usage_model regression
zackpollard Apr 17, 2026
380ec17
fix: set usage_model=standard on worker_version, drop unreliable post…
zackpollard Apr 17, 2026
953028a
test: add cpu-test worker with no usage_model to verify default
zackpollard Apr 17, 2026
1758b15
test: set usage_model=standard on cpu-test worker_version
zackpollard Apr 17, 2026
d65c0ca
chore: remove cpu-test scaffolding — experiment complete
zackpollard Apr 17, 2026
9f7b07a
test: redeploy cpu-test + force new cloudflare-metrics version
zackpollard Apr 17, 2026
1aeb1e1
test: force cloudflare-metrics redeploy to validate revert pattern
zackpollard Apr 18, 2026
7ba393a
test: redeploy cloudflare-metrics after cloudflare runtime fix
zackpollard Apr 21, 2026
b53dfc5
fix: repair failing format, tsc, and unit test checks
zackpollard Apr 22, 2026
01b5219
chore: remove cpu-test worker and FORCE_NEW_VERSION debug constant
zackpollard Apr 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
- name: Setup pnpm
uses: pnpm/action-setup@a7487c7e89a18df4991f7f222e4898a00d66ddda # v4.1.0
uses: pnpm/action-setup@fc06bc1257f339d1d5d8b3a19a8cae5388b55320 # v5.0.0
- name: Setup Node
uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0
with:
node-version-file: '.nvmrc'
- name: Run pnpm install
Expand All @@ -34,7 +34,7 @@ jobs:
run: pnpm build

- name: Upload build output
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
with:
name: build-output
path: 'dist'
Expand All @@ -52,18 +52,18 @@ jobs:
OP_SERVICE_ACCOUNT_TOKEN: ${{ secrets.OP_TF_DEV_ENV }}
steps:
- name: Checkout code
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false

- name: Get build artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1
with:
name: 'build-output'
path: '${{ github.workspace }}/dist'

- name: Install 1Password CLI
uses: 1password/install-cli-action@9a0c9dd934086b7ab1d90115d455bda1c53c2bdb # v2.0.2
uses: 1password/install-cli-action@8d006a0d0a4fd505af7f7ce589e7f768385ff5e4 # v3.0.0

- name: Setup Mise
uses: immich-app/devtools/actions/use-mise@697a75e2c3186d3c037c2c159855cf2d566542ba # use-mise-action-0.0.1
Expand Down Expand Up @@ -100,7 +100,7 @@ jobs:

- name: Comment preview URLs
if: ${{ github.event_name == 'pull_request' }}
uses: actions-cool/maintain-one-comment@4b2dbf086015f892dcb5e8c1106f5fccd6c1476b # v3.2.0
uses: actions-cool/maintain-one-comment@772214ac4bff3c578c76eed4ac19d08dc787c050 # v3.2.1
with:
number: ${{ github.event.number }}
body: ${{ steps.preview-urls.outputs.body }}
Expand All @@ -120,18 +120,18 @@ jobs:
OP_SERVICE_ACCOUNT_TOKEN: ${{ secrets.OP_TF_PROD_ENV }}
steps:
- name: Checkout code
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false

- name: Get build artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1
with:
name: 'build-output'
path: '${{ github.workspace }}/dist'

- name: Install 1Password CLI
uses: 1password/install-cli-action@9a0c9dd934086b7ab1d90115d455bda1c53c2bdb # v2.0.2
uses: 1password/install-cli-action@8d006a0d0a4fd505af7f7ce589e7f768385ff5e4 # v3.0.0

- name: Setup Mise
uses: immich-app/devtools/actions/use-mise@697a75e2c3186d3c037c2c159855cf2d566542ba # use-mise-action-0.0.1
Expand Down
69 changes: 66 additions & 3 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,13 @@
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
- name: Setup pnpm
uses: pnpm/action-setup@a7487c7e89a18df4991f7f222e4898a00d66ddda # v4.1.0
uses: pnpm/action-setup@fc06bc1257f339d1d5d8b3a19a8cae5388b55320 # v5.0.0
- name: Setup Node
uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0
with:
node-version-file: '.nvmrc'

Expand All @@ -41,3 +43,64 @@
- name: Run unit tests
run: pnpm run test
if: ${{ !cancelled() }}

integration-test-cloudflare-metrics:
name: Integration Test (cloudflare-metrics)
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false

- name: Detect relevant changes
id: changes
uses: dorny/paths-filter@fbd0ab8f3e69293af611ebaee6363fc25e6d187d # v4.0.1
with:
filters: |
cloudflare-metrics:
- 'apps/cloudflare-metrics/**'
- '.github/workflows/test.yml'

- name: Setup pnpm
if: steps.changes.outputs.cloudflare-metrics == 'true' || github.event_name != 'pull_request'
uses: pnpm/action-setup@fc06bc1257f339d1d5d8b3a19a8cae5388b55320 # v5.0.0

- name: Setup Node
if: steps.changes.outputs.cloudflare-metrics == 'true' || github.event_name != 'pull_request'
uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0
with:
node-version-file: '.nvmrc'

- name: Run pnpm install
if: steps.changes.outputs.cloudflare-metrics == 'true' || github.event_name != 'pull_request'
run: pnpm install --frozen-lockfile

- name: Install 1Password CLI
if: steps.changes.outputs.cloudflare-metrics == 'true' || github.event_name != 'pull_request'
uses: 1password/install-cli-action@8d006a0d0a4fd505af7f7ce589e7f768385ff5e4 # v3.0.0

- name: Check integration credentials are available
id: creds
if: steps.changes.outputs.cloudflare-metrics == 'true' || github.event_name != 'pull_request'
continue-on-error: true
env:
OP_SERVICE_ACCOUNT_TOKEN: ${{ secrets.OP_TF_DEV_ENV }}
Comment thread
github-advanced-security[bot] marked this conversation as resolved.
Fixed
Comment thread
github-advanced-security[bot] marked this conversation as resolved.
Fixed
Comment thread
github-advanced-security[bot] marked this conversation as resolved.
Fixed
run: |
if op run --env-file=apps/cloudflare-metrics/.integration.env -- sh -c 'test -n "$CLOUDFLARE_API_TOKEN" && test -n "$CLOUDFLARE_ACCOUNT_ID"' 2>/dev/null; then
echo "available=true" >> "$GITHUB_OUTPUT"
else
echo "available=false" >> "$GITHUB_OUTPUT"
echo "::warning::Skipping integration test — credentials not configured in 1Password"
fi

- name: Run integration tests
if: >-
(steps.changes.outputs.cloudflare-metrics == 'true' || github.event_name != 'pull_request')
&& steps.creds.outputs.available == 'true'
env:
OP_SERVICE_ACCOUNT_TOKEN: ${{ secrets.OP_TF_DEV_ENV }}
run: |
op run \
--env-file=apps/cloudflare-metrics/.integration.env \
-- pnpm --filter @immich-services/cloudflare-metrics run test:integration
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ build/
.wrangler/
.dev.vars

# Local-only env files (the tracked one is `deployment/.env`)
deployment/.env.*

# Logs
logs/
*.log
Expand Down Expand Up @@ -42,3 +45,4 @@ npm-debug.log*
# Misc
.cache/
tmp/
.claude/
12 changes: 12 additions & 0 deletions apps/cloudflare-metrics/.integration.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# 1Password-templated env file loaded by `op run` in CI.
#
# To run integration tests locally:
# op run --env-file=apps/cloudflare-metrics/.integration.env -- \
# pnpm --filter @immich-services/cloudflare-metrics run test:integration
#
# The 1Password item and field names below are placeholders — set them up
# in the service account's vault before the CI job can run. Both need to
# map to a Cloudflare API token that has `Account Analytics:Read` and
# `Zone:Read` for the dev account.
CLOUDFLARE_API_TOKEN="op://services-cf-workers-dev/cloudflare-metrics-integration/api_token"
CLOUDFLARE_ACCOUNT_ID="op://services-cf-workers-dev/cloudflare-metrics-integration/account_id"
190 changes: 190 additions & 0 deletions apps/cloudflare-metrics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# cloudflare-metrics

Cloudflare Worker that pulls analytics from the Cloudflare GraphQL API every minute, enriches with resource names from the REST API, and writes to VictoriaMetrics as InfluxDB line protocol.

## Architecture

```
Cloudflare GraphQL Analytics API ──► CloudflareGraphQLClient
Cloudflare REST API (D1/queues/zones) ──► CloudflareRestClient
CloudflareMetricsCollector
├─ ResourceCacheService (id → name lookups)
├─ emit.ts (row → Metric translation)
└─ graphql-builders.ts (query construction)
InfluxMetricsProvider ──► VictoriaMetrics /write
```

## Collection window

| Setting | Value | Why |
| ------- | ----------- | ------------------------------------------------------------- |
| Cron | `* * * * *` | Every minute |
| Lag | 5 min | Cloudflare's analytics pipeline is 2–5 min behind real time |
| Window | 3 min | Overlaps consecutive ticks so a missed cron doesn't drop data |
| Dedup | Free | VictoriaMetrics dedupes on `(series, timestamp)` |

## Datasets

78 datasets across account-scope and zone-scope, covering:

- **Workers**: invocations, subrequests, overview, scheduled (client-side aggregated), analytics engine, builds, VPC, placement, workflows
- **D1**: queries (summary + detail with p50/p95/p99), storage
- **R2**: operations, storage, sippy
- **KV**: operations, storage
- **Durable Objects**: invocations, periodic, storage, SQL storage, subrequests
- **Queues**: operations, backlog, consumer concurrency
- **Hyperdrive**: queries (with count), pool sizes
- **HTTP (zone-scope)**: overview, detail (per-zone batched), cache reserve, logpush health
- **Pages Functions**: invocations with CPU/duration p50/p99
- **AI**: inference, gateway (requests/cache/errors/size), search, autoRAG
- **Vectorize**: operations, queries, storage, writes
- **Browser**: rendering API/sessions/time/events, isolation sessions/actions
- **Stream/Video/Calls**: minutes viewed, CMCD, buffer/playback/quality events, live input, realtime kit, calls usage/TURN
- **Images/toMarkdown**: request counts, conversion stats
- **RUM**: pageload, performance (with FCP/page-load p50/p95/p99), web vitals (CLS/FCP/FID/INP/LCP/TTFB averages + p75/p95)
- **Pipelines**: ingestion, delivery, operator, sink
- **Containers, Turnstile**
- **DNS, Email Routing/Sending, DMARC** (zone-scope)
- **API Gateway sessions, Workers Zone invocations/subrequests** (zone-scope)

See `src/datasets.ts` for the full registry. Each dataset declares its GraphQL field, dimensions, aggregation blocks, tag mappings, and field mappings.

### Skipped datasets

| Dataset | Reason |
| ------------------------------------- | -------------------------------------------------------------------------- |
| `firewallEventsAdaptiveGroups` | Plan-gated (Business/Enterprise only) |
| `cdnNetworkAnalyticsAdaptiveGroups` | Plan-gated |
| `zarazTrack/TriggersAdaptiveGroups` | Incompatible filter shape (`datetimeMinute_geq` instead of `datetime_geq`) |
| `cloudchamberMetricsAdaptiveGroups` | Duplicate schema of `containersMetrics` |
| `cacheReserveRequestsAdaptiveGroups` | Plan-gated (zone-scope) |
| `healthCheckEventsAdaptiveGroups` | Plan-gated (zone-scope) |
| `loadBalancingRequestsAdaptiveGroups` | Plan-gated (zone-scope) |
| `nelReportsAdaptiveGroups` | Plan-gated (zone-scope) |
| `pageShieldReportsAdaptiveGroups` | Plan-gated (zone-scope) |
| `waitingRoomAnalyticsAdaptiveGroups` | Plan-gated (zone-scope) |

## Query batching

Cloudflare Workers caps subrequests at 50 per invocation. Account-scope datasets are batched into chunks of 25 using GraphQL aliases. Zone-scope datasets are batched across all zones in a single request per dataset.

| Metric | Cold start | Warm (cached) |
| ------------------------------ | ---------- | -------------------- |
| REST lookups (D1/queues/zones) | 3 | 0 (10-min TTL cache) |
| GraphQL account batches | 3 chunks | 3 chunks |
| GraphQL date-granularity batch | 1 | 1 |
| Zone-scope datasets | ~11 | ~11 |
| Metric flush | 1 | 1 |
| **Total subrequests** | **~19** | **~16** |

## Resource name enrichment

IDs in the analytics API are enriched with human-readable names via REST lookups:

- `database_name` on `cf_d1_*` metrics (from `/accounts/{id}/d1/database`)
- `queue_name` on `cf_queue_*` metrics (from `/accounts/{id}/queues`)
- `zone_name` on `cf_http_*` metrics (from `/zones?account.id={id}` + per-zone fallback for Pages projects)

Module-level caches survive across isolate invocations (typically 10+ minutes on the paid plan). A 10-minute TTL triggers periodic re-fetches. Failed lookups fall back to stale cached names.

## Self-telemetry

The worker emits its own health metrics alongside the Cloudflare data:

- `cloudflare_metrics_cron_summary` — datasets/points/errors per tick
- `cloudflare_metrics_cron_error{reason}` — early-exit errors
- `cloudflare_metrics_collector_dataset{dataset,status}` — per-dataset rows/points/duration/errors
- `cloudflare_metrics_resource_lookup{resource,status}` — REST lookup outcomes
- `cloudflare_metrics_graphql_client` — requests/error_responses per tick
- `cloudflare_metrics_flush{status}` — bytes/duration/pending buffers (from previous tick)
- `cloudflare_metrics_http_response{method,path,status}` — HTTP handler counts
- `cloudflare_metrics_handle_request` — handler duration/invocation

## Dashboards

20 Grafana dashboards managed via Terraform, in `deployment/.../dashboards/`:

| Dashboard | Template variables |
| ---------------------------------------------- | ------------------------------- |
| Account Overview | — |
| Workers | `$script_name`, `$status` |
| Workers Scheduled | `$script_name`, `$cron` |
| D1 | `$database_name` |
| R2 | `$bucket_name` |
| KV | `$namespace_id` |
| Durable Objects | `$script_name`, `$namespace_id` |
| Queues | `$queue_name` |
| Hyperdrive | `$config_id` |
| HTTP / Zones | `$zone_name` |
| Pages Functions | `$script_name` |
| AI, Vectorize, Browser, Stream, RUM, Pipelines | — |
| DNS | `$zone_name` |
| Email | `$zone_name` |
| Exporter Health | — |

## Alerts

7 Grafana alert rules in `deployment/.../alerts.tf`, all linked to the exporter-health dashboard:

| Rule | Condition | Severity |
| ---------------------------------- | ---------------------------------- | -------- |
| Collector Not Running | No `cron_summary_datasets` for 10m | 1 |
| Collector Cron Error | `cron_error_count > 0` in 10m | 1 |
| Dataset Errors Sustained | `>10` errors in 15m | 3 |
| Metrics Flush Failing | `>3` flush errors in 15m | 1 |
| Pending Flush Buffer Growing | `>3` stashed bodies for 10m | 3 |
| GraphQL Subrequest Budget Near Cap | `>40` requests/tick for 10m | 3 |
| GraphQL Error Responses Sustained | `>5` error responses in 15m | 3 |

## File structure

```
src/
index.ts Entry point (wires handlers)
handlers/
http.ts /health endpoint
scheduled.ts Cron handler (collect + flush)
collector.ts Orchestration (collectAll → batched account/zone fetches)
resource-cache.ts REST resource lookups + module-level caching
emit.ts DatasetRow → Metric translation + tag enrichment
graphql-client.ts CloudflareGraphQLClient (HTTP transport + chunking)
graphql-builders.ts Query construction (pure functions)
cloudflare-api.ts CloudflareRestClient (D1/queues/zones REST)
metrics.ts CloudflareMetricsRepository facade
metric.ts Metric data class
metric-providers.ts InfluxMetricsProvider + HeaderMetricsProvider
flush-state.ts Retry buffer + last-flush stats (module-level state)
datasets.ts 78 DatasetQuery definitions
types.ts Shared types
deferred.ts DeferredRepository (waitUntil helper)
monitor.ts monitorAsyncFunction (duration/invocation wrapper)
```

## Development

```bash
pnpm run dev # wrangler dev (local)
pnpm run test # 71 unit tests
pnpm run test:integration # live API tests (needs CLOUDFLARE_API_TOKEN + CLOUDFLARE_ACCOUNT_ID)
pnpm run check # tsc --noEmit
pnpm run build # wrangler deploy --dry-run
```

Local `.dev.vars`:

```
CLOUDFLARE_API_TOKEN=...
CLOUDFLARE_ACCOUNT_ID=...
VMETRICS_API_TOKEN=...
ENVIRONMENT=dev
```

## Infrastructure

- **Worker**: `apps/cloudflare-metrics/` — TypeScript, Wrangler, every-minute cron
- **Terraform**: `deployment/modules/cloudflare/workers/cloudflare-metrics/` — worker, version, deployment, cron trigger, dashboards, alerts, and a scoped API token with Account Analytics Read + D1 Read + Queues Read + Zone Read
- **CI**: unit tests on every PR, path-filtered integration tests (gated on 1Password credentials), build + deploy-dev on PR, deploy-prod on merge to main
17 changes: 17 additions & 0 deletions apps/cloudflare-metrics/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"name": "@immich-services/cloudflare-metrics",
"version": "1.0.0",
"private": true,
"type": "module",
"scripts": {
"dev": "wrangler dev",
"build": "wrangler deploy --dry-run --outdir ../../dist/cloudflare-metrics",
"tail": "wrangler tail",
"test": "vitest run --exclude 'src/integration.test.ts'",
"test:integration": "vitest run --config vitest.integration.config.ts",
"check": "tsc --noEmit"
},
"dependencies": {
"@influxdata/influxdb-client": "^1.34.0"
}
}
Loading
Loading