Skip to content

BE-519: Set up OpenTelemetry across hash-api and Temporal workers#8681

Merged
TimDiekmann merged 17 commits intomainfrom
t/be-519-otel-workers
May 11, 2026
Merged

BE-519: Set up OpenTelemetry across hash-api and Temporal workers#8681
TimDiekmann merged 17 commits intomainfrom
t/be-519-otel-workers

Conversation

@TimDiekmann
Copy link
Copy Markdown
Member

🌟 What is the purpose of this PR?

Wires OpenTelemetry through hash-api, hash-ai-worker-ts, and hash-integration-worker so traces, logs, and metrics flow into the existing OTLP collector. Caller-side trace context (Express HTTP spans, Rust temporal-client::start_workflow) is propagated into Temporal workflow start headers so the worker-side RunWorkflow and RunActivity spans chain off the caller's trace in Tempo.

Outbound fetch calls (OpenAI, Anthropic, Linear, …) are now traced via @opentelemetry/instrumentation-undici with a shared peer.service mapping so Tempo's service_graphs processor renders external dependencies as edges in the service map.

🔗 Related links

  • BE-519
  • BE-520 — follow-up for dropping the v1↔v2 SDK adapter once @temporalio/interceptors-opentelemetry-v2 is released
  • SRE-676 — follow-up for Temporal Server self-instrumentation

🚫 Blocked by

  • none

🔍 What does this change?

New shared OTEL module (@local/hash-backend-utils/opentelemetry):

  • registerOpenTelemetry({ endpoint, serviceName, instrumentations }) registers a global NodeTracerProvider / LoggerProvider / MeterProvider against an OTLP/gRPC collector.
  • createUndiciInstrumentation() and httpRequestSpanNameHook shared across all three services.
  • peer.service mapping (discriminated exact/suffix host rules) for OpenAI, Anthropic, Linear, Google Cloud.
  • BatchSpanProcessor / BatchLogRecordProcessor keep export off the request path. Shutdown surfaces Promise.allSettled rejections and applies a 2-second per-provider timeout.

Temporal trace propagation:

  • Workflow client interceptor (OpenTelemetryWorkflowClientInterceptor) attached in createTemporalClient injects the active context into workflow start headers.
  • Activity-side interceptors (activity factory shape) extract context and stamp trace_id / span_id on activity log lines for Loki ↔ Tempo correlation.
  • Workflow modules registered both in bundleWorkflowCode (production path) AND in Worker.create (dev path) — without the bundle registration, prebuilt bundles ignore the runtime workflowModules config and prod ships zero workflow spans.
  • wrapWorkflowSpanExporter / makeV2WorkflowSink bridges v1-shaped ReadableSpan from @temporalio/interceptors-opentelemetry@1 to our v2 stack (synthesises instrumentationScope and parentSpanContext); marked TODO(BE-520) and quarantined to one helper.

Rust trace context propagation (hash-temporal-client::ai):

  • Rewrites start_ai_workflow to use the low-level WorkflowService::start_workflow_execution so the proto header field is exposed.
  • build_otel_header() injects the active trace context as a _tracer-data payload; emits a once-per-process tracing::warn! when the carrier is empty.
  • Caller span annotated with otel.kind = "producer" for the asynchronous fire-and-forget shape.

Worker bootstrap helper (@local/hash-backend-utils/temporal/worker-bootstrap):

  • runWorker(opts) collapses the previously-duplicated bootstrap logic in both worker main.ts files. Sentry.init stays per-worker because of ESM import-ordering.
  • WorkflowSource discriminated union ({ kind: "bundle" } | { kind: "path" }) replaces Partial<WorkerOptions>; ExtraWorkerOptions = Omit<WorkerOptions, …> excludes helper-owned and source-owned keys from the escape hatch.
  • SIGTERM/SIGINT handler awaits worker.run() so in-flight activities drain before OTEL providers shut down. Exit code 0 on graceful signals.

Sentry coexistence:

  • All three services pass skipOpenTelemetrySetup: !!otelSetup so Sentry shares our NodeTracerProvider. If Sentry registered its own, it would shadow the global provider and the OTEL workflow client interceptor would silently break trace context propagation.

Metrics scrape:

  • apps/hash-external-services/opentelemetry-collector/otel-collector-config.yaml now scrapes Temporal Server's prometheus endpoint (temporal:8000) with a temporal_ prefix relabel.

Pre-Merge Checklist 🚀

🚢 Has this modified a publishable library?

This PR:

  • does not modify any publishable blocks or libraries, or modifications do not need publishing

📜 Does this require a change to the docs?

The changes in this PR:

  • are internal and do not require a docs change

🕸️ Does this require a change to the Turbo Graph?

The changes in this PR:

  • do not affect the execution graph

⚠️ Known issues

  • The v1↔v2 SDK adapter is a workaround until @temporalio/interceptors-opentelemetry-v2 lands — tracked in BE-520.
  • Caller→workflow tracing uses parent-child semantics (what @temporalio/interceptors-opentelemetry produces). The OTEL spec for async messaging recommends PRODUCER/CONSUMER + Span Links, but neither v1 nor v2 of the upstream package implement that. Doesn't affect our metrics today: workflow spans are INTERNAL kind and start after the caller span ends, so Tempo's service-graphs processor doesn't inflate edge latencies. Captured as a follow-up note in BE-520.
  • HASH_OTLP_ENDPOINT must be set on ALL backend services or NONE — a partial config silently breaks caller↔worker context propagation. No runtime check (deferred — infra-level concern).

🐾 Next steps

  • BE-520: switch to @temporalio/interceptors-opentelemetry-v2 once released, drop the workflow span adapter.
  • SRE-676: instrument Temporal Server itself.
  • BE-518: broader logger cleanup sweep (separate PR).

🛡 What tests cover this?

23 unit tests in libs/@local/hash-backend-utils/:

  • src/opentelemetry.test.ts (12 tests) — resolvePeerService (exact/suffix matching, lookalike-domain rejection, suffix-dot-boundary), httpRequestSpanNameHook (incoming, outgoing, Express-wrapped, query-stripped, missing-method paths).
  • src/temporal/workflow-span-adapter.test.ts (11 tests) — v1→v2 normalisation (instrumentationLibrary → instrumentationScope, parentSpanId → parentSpanContext), already-v2 passthrough identity check, attribute / event preservation through the rewrite path, mixed batch handling, the parentSpanId === "" edge case, existing-parentSpanContext precedence, result-callback propagation, shutdown / forceFlush delegation.

Drain semantics on SIGTERM are verified manually — TestWorkflowEnvironment integration tests deferred.

❓ How to test this?

  1. Set HASH_OTLP_ENDPOINT=http://localhost:4317 and start the OTEL collector + Tempo + Mimir + Loki via the apps/hash-external-services docker-compose.
  2. yarn dev:backend + yarn start:worker:ai.
  3. From the frontend, create an entity (triggers the updateEntityEmbeddings workflow).
  4. Grafana → Tempo → search by service Node API. Expected trace shape:
    • Node API POST /graphql (~150ms) at the root.
    • start_ai_workflow (~12ms) as child, marked Producer kind.
    • AI Worker RunWorkflow:updateEntityEmbeddings (~2.5s) as child of the producer.
    • RunActivity:createAndStoreEntityEmbeddingsActivity nested under it.
    • OpenAI POST /v1/embeddings as external-service span (undici instrumentation).
  5. Service map should show nodes for Node API, Graph API, AI Worker, OpenAI, Postgres, with an AI Worker → OpenAI edge.
  6. Drain check: kill -TERM <ai-worker-pid> while a workflow is mid-flight. Logs show Received SIGTERM, exiting…, then activity completion logs, then OTEL flush, then exit 0.

📹 Demo

Tempo screenshots from local testing — see Linear BE-519 for image attachments.

Two rounds of multi-agent review surfaced 5 Critical and 8 Important
items, plus a Critical regression introduced by the first cleanup
pass. This commit addresses all of them.

Production telemetry fixes:
- Bundle scripts now register the OTEL workflow interceptor as a
  workflow-bundle module; without this, prod workers shipped no
  workflow spans because `Worker.create.interceptors.workflowModules`
  is ignored when a prebuilt bundle is loaded.
- `BatchSpanProcessor` / `BatchLogRecordProcessor` replace the
  Simple variants so OTLP exports stay off the request path.
- `registerOpenTelemetry` shutdown surfaces `Promise.allSettled`
  rejections to stderr and applies a 2-second per-provider timeout.
- `instrument.mjs` (and the worker shims) wrap the bootstrap in
  try/catch so a misconfigured collector falls back to no telemetry
  rather than crashing the process before the logger is wired.

Worker shutdown:
- Workers now call `worker.shutdown()` and await `worker.run()` from
  the SIGTERM handler so in-flight activities drain cleanly. The
  previous draft called `process.exit` before the drain completed.
- Exit code is 0 on graceful signals, 1 only on actual failure.
- `OpenTelemetryActivityOutboundInterceptor` is wired so log lines
  carry `trace_id` / `span_id` for Loki ↔ Tempo correlation.

Refactor:
- New `runWorker(opts)` helper in `@local/hash-backend-utils/temporal/worker-bootstrap`
  collapses the previously-duplicated bootstrap logic in both worker
  `main.ts` files. Sentry init stays per-worker (ESM ordering).
- `WorkflowSource` discriminated union (`{ kind: "bundle" | "path" }`)
  replaces `Partial<WorkerOptions>` for workflow-source config, and
  `ExtraWorkerOptions = Omit<WorkerOptions, ...>` excludes helper-owned
  keys from the escape hatch.
- `makeV2WorkflowSink(setup)` quarantines the `as unknown as Parameters<...>`
  casts to a single helper, ready to drop with BE-520.
- `httpRequestSpanNameHook` is shared across the three Node bootstraps.
- The `getActiveOpenTelemetrySetup` singleton is gone; `instrument.mjs`
  exports `otelSetup` directly.
- Rust `build_otel_header` emits a once-per-process warning when the
  trace-context carrier is empty, surfacing missing-propagator
  configuration instead of silently producing parent-less workflows.

Tests (23 unit tests covering the v1↔v2 normalisation, peer.service
resolution, and HTTP hook behaviour):
- `wrapWorkflowSpanExporter` / `normaliseSpan` against v1-shaped
  spans, mixed batches, the existing-`parentSpanContext` precedence
  case, the `parentSpanId === ""` edge case, and result-callback
  propagation.
- `resolvePeerService` exact / suffix matching and lookalike-domain
  rejection.
- `httpRequestSpanNameHook` for incoming, outgoing, Express-wrapped,
  query-stripped, and missing-method paths.

The drain semantics are verified manually — `TestWorkflowEnvironment`
integration tests would be a follow-up.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

3 Skipped Deployments
Project Deployment Actions Updated (UTC)
hash Ignored Ignored Preview Apr 30, 2026 3:50pm
hashdotdesign-tokens Ignored Ignored Preview Apr 30, 2026 3:50pm
petrinaut Skipped Skipped Comment Apr 30, 2026 3:50pm

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 30, 2026

PR Summary

Medium Risk
Touches cross-cutting telemetry and worker bootstrap/shutdown paths plus Temporal client/workflow interceptors; mistakes could break context propagation or alter worker lifecycle, but changes are largely additive and guarded by HASH_OTLP_ENDPOINT.

Overview
Wires OpenTelemetry (traces + logs + metrics) across hash-api, the AI worker, and the integration worker via a new shared @local/hash-backend-utils/opentelemetry module (HTTP/Express/GraphQL/gRPC/undici instrumentations, peer-service labeling, batched exporters, and graceful shutdown hooks).

Adds end-to-end trace context propagation into Temporal by attaching the OTEL workflow client interceptor in createTemporalClient, bundling/activating workflow-side OTEL interceptors, and exporting workflow-sandbox spans through a v1→v2 span adapter so workflow/activity spans parent off caller requests; both workers are refactored to use a new shared runWorker bootstrap (health check server, signal handling/drain, Sentry+OTEL coexistence).

Updates the Rust hash-temporal-client AI workflow starter to call the low-level Temporal start API so it can inject OTEL trace headers, and extends infra to scrape Temporal server Prometheus metrics (with a temporal_ prefix) in the OTEL collector; also tweaks proxy logging to preserve message bodies for OTLP logs.

Reviewed by Cursor Bugbot for commit 98041ff. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions github-actions Bot added area/deps Relates to third-party dependencies (area) area/apps > hash* Affects HASH (a `hash-*` app) area/apps > hash-api Affects the HASH API (app) area/libs Relates to first-party libraries/crates/packages (area) type/eng > backend Owned by the @backend team area/apps labels Apr 30, 2026
@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented Apr 30, 2026

🤖 Augment PR Summary

Summary: This PR standardizes OpenTelemetry across the Node API and Temporal workers, and propagates trace context into Temporal workflow starts so caller → workflow/activity spans join into one trace.

Changes:

  • Introduced a shared @local/hash-backend-utils/opentelemetry module to register global trace/log/metric providers, undici fetch instrumentation, consistent HTTP span naming, and peer.service host→service mapping.
  • Updated hash-api to use the shared OTEL bootstrap in instrument.mjs, flush OTEL on graceful shutdown, and improve proxy log forwarding for structured OTLP log bodies.
  • Added OTEL bootstrap entrypoints for hash-ai-worker-ts and hash-integration-worker (imported first) and included the OTEL workflow interceptor in bundled workflow code.
  • Added a shared Temporal worker bootstrap (runWorker) to centralize Runtime telemetry setup, connection creation, health checks, sinks/interceptors, and SIGTERM/SIGINT draining + OTEL flush.
  • Enabled Temporal workflow-start trace propagation in the TypeScript client via OpenTelemetryWorkflowClientInterceptor.
  • Updated the Rust temporal client to start workflows via the low-level API so it can inject the _tracer-data header payload.
  • Extended the local OTEL collector to scrape Temporal server Prometheus metrics and updated docker-compose to expose the Temporal metrics endpoint.
  • Added unit tests for OTEL helpers (span naming / peer.service resolution) and the Temporal workflow span v1→v2 adapter.

Technical Notes: Uses batch processors to keep export off the request path and includes explicit shutdown flushing with per-provider timeouts; Temporal workflow spans are bridged from the v1 interceptor shape to the v2 OTLP exporter shape.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread libs/@local/hash-backend-utils/src/opentelemetry.ts Outdated
Comment thread libs/@local/hash-backend-utils/src/temporal/worker-bootstrap.ts Outdated
Comment thread libs/@local/hash-backend-utils/src/opentelemetry.ts Fixed
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 26.31579% with 154 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.03%. Comparing base (eeba8a9) to head (98041ff).
⚠️ Report is 20 commits behind head on main.

Files with missing lines Patch % Lines
...ash-backend-utils/src/temporal/worker-bootstrap.ts 0.00% 86 Missing ⚠️
...ibs/@local/hash-backend-utils/src/opentelemetry.ts 42.52% 50 Missing ⚠️
apps/hash-api/src/instrument.mjs 0.00% 11 Missing ⚠️
libs/@local/hash-backend-utils/src/temporal.ts 0.00% 3 Missing ⚠️
...ackend-utils/src/temporal/workflow-span-adapter.ts 90.00% 1 Missing and 1 partial ⚠️
apps/hash-api/src/integrations/linear/webhook.ts 0.00% 1 Missing ⚠️
...c/temporal/interceptors/workflows/opentelemetry.ts 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8681      +/-   ##
==========================================
- Coverage   62.08%   62.03%   -0.05%     
==========================================
  Files        1341     1345       +4     
  Lines      135072   135263     +191     
  Branches     5744     5782      +38     
==========================================
+ Hits        83854    83908      +54     
- Misses      50310    50446     +136     
- Partials      908      909       +1     
Flag Coverage Δ
apps.hash-ai-worker-ts 1.41% <ø> (ø)
apps.hash-api 0.00% <0.00%> (ø)
local.hash-backend-utils 2.81% <27.91%> (+2.81%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 30, 2026

Merging this PR will not alter performance

✅ 80 untouched benchmarks


Comparing t/be-519-otel-workers (98041ff) with main (891f36f)

Open in CodSpeed

Comment thread libs/@local/temporal-client/src/ai.rs
- Fix tsc failure in @tests/hash-backend-integration: setup-opentelemetry.ts
  pointed at the removed @apps/hash-api/src/graphql/opentelemetry module;
  migrate it to @local/hash-backend-utils/opentelemetry with the new
  options-object signature.
- Restore the workflow-start identity field that the high-level
  WorkflowClientTrait::start_workflow auto-populated; the low-level
  StartWorkflowExecutionRequest defaulted it to "" so Temporal Server
  could not attribute starts to a client.
- Use URL.hostname (not URL.host) when resolving peer.service so
  outbound origins like https://api.openai.com:443/ still match the
  exact-host rules.
- Fix the worker-bootstrap serviceName doc-comment: it claimed the value
  is also used as Temporal worker identity, but Worker.create keeps the
  SDK default (pid@hostname). Document the actual usage (service.name
  in the OTEL resource).
- Move the OTEL shutdown error logging into the per-target try/catch so
  the label is captured at execution time instead of indexed back out
  of the targets array.
@vercel vercel Bot temporarily deployed to Preview – petrinaut April 30, 2026 11:28 Inactive
@github-actions github-actions Bot added area/tests New or updated tests area/tests > integration New or updated integration tests labels Apr 30, 2026
Comment thread libs/@local/temporal-client/src/ai.rs Fixed
Comment thread libs/@local/temporal-client/src/ai.rs Fixed
Comment thread apps/hash-api/src/instrument.mjs Outdated
Comment thread libs/@local/hash-backend-utils/src/temporal/worker-bootstrap.ts
@graphite-app graphite-app Bot requested a review from a team April 30, 2026 12:01
@vercel vercel Bot temporarily deployed to Preview – petrinaut April 30, 2026 12:29 Inactive
ESLint's no-unnecessary-condition narrows shuttingDown to its initial
false here because the only mutation lives inside the SIGINT/SIGTERM
handler closure, which TS-flow does not see from the surrounding scope.
Same shape for the rethrow: workerError holds whatever the SDK threw
(typed unknown), so only-throw-error fires on the rethrow even though
preserving the original value is the correct behaviour.
@vercel vercel Bot temporarily deployed to Preview – petrinaut April 30, 2026 13:03 Inactive
Comment thread libs/@local/hash-backend-utils/src/temporal/worker-bootstrap.ts Outdated
Comment thread libs/@local/hash-backend-utils/src/temporal.ts Outdated
The helper imports DefaultLogger from @temporalio/worker, which bundles
the native Rust core bindings. Keeping it in temporal.ts pulled the
worker package into every consumer of createTemporalClient — including
hash-api, where it has no business being.

worker-bootstrap.ts is the only caller, so the helper moves there and
becomes file-private. temporal.ts is back to a pure @temporalio/client
surface.
@vercel vercel Bot temporarily deployed to Preview – petrinaut April 30, 2026 14:12 Inactive
- worker-bootstrap.ts: drop the dead try/catch around httpServer.close();
  http.Server.close() reports failures via the optional callback, not
  synchronously. Pass an error-logging callback instead so close failures
  actually surface.
- opentelemetry.ts: tighten the host vs hostname comment. The example
  api.openai.com:443 was wrong — URL strips the default :443 for https.
  Use collector:4318 instead so the rationale matches what the URL spec
  does.
- hash-api/index.ts: call out the LIFO cleanup order explicitly. Without
  the GracefulShutdown.reverse() reference, a future maintainer reading
  "Flush OpenTelemetry last" alongside addCleanup() calls below would
  reasonably conclude the comment is stale and move the registration.
A SIGTERM during the worker startup window — NativeConnection.connect,
workflow bundle compile, Worker.create — would hit the Node default
handler and terminate the process without flushing OTEL. K8s pod
evictions during startup are a real edge case in rolling deploys.

Move SIGINT/SIGTERM registration to the top of runWorker, before any
async startup work. The handler captures `worker` once it's set; if a
signal arrives earlier, it just flips `shuttingDown` and the linear
cleanup path below handles flush+exit. The worker.run() call gates on
`shuttingDown` so we skip running entirely if startup got interrupted.
Locks down the port-derivation contract that prevents the exporter from
tracing its own outbound traffic. The failure mode is exponential span
amplification per export batch, hard to spot from logs — easier to
catch with a unit test.

Covers: configured gRPC port (4317), non-default port from URL (4318),
fallback when the URL has no explicit port, and fallback on a malformed
endpoint (the helper must not throw on every outgoing request).

Reads the hook back via HttpInstrumentation.getConfig() so the test
exercises the same wiring runtime callers see.
Switching from Omit to Pick means future Temporal SDK additions don't
silently leak through and let callers override helper-owned wiring
(activities, connection, taskQueue, sinks, interceptors,
workflowBundle). Only key currently in use is
maxHeartbeatThrottleInterval; add more on demand.
The previous catch swallowed every throw — including the kind that
matters most in dev / CI: typos in instrumentation construction, bad
endpoint URLs, missing peer deps. A regression that disabled telemetry
on every deploy would slip through.

Keep the fall-through-to-undefined behaviour for production (don't
crash a prod service over the collector layer) but rethrow elsewhere
so the bootstrap error surfaces during development.
@vercel vercel Bot temporarily deployed to Preview – petrinaut April 30, 2026 15:04 Inactive
CiaranMn
CiaranMn previously approved these changes Apr 30, 2026
Comment thread libs/@local/hash-backend-utils/src/temporal/worker-bootstrap.ts
Comment thread libs/@local/hash-backend-utils/src/opentelemetry.ts Outdated
biome auto-removed the now-redundant as HttpInstrumentationConfig cast
(getConfig() already types it), leaving the import unused.

worker-bootstrap.ts:306: same TS-narrowing-vs-closure-mutation pattern
as the other shuttingDown checks — TS sees the initial false but the
SIGTERM handler can flip it before this line runs.
http.RequestOptions.port is string | number | null | undefined; some
callers pass a string and `"4317" === 4317` is false. The filter would
miss the exporter's own outbound traffic and the very feedback loop
this filter exists to prevent would slip through.

Number(options.port) handles both shapes, undefined → NaN → not equal,
which preserves the original "let unrelated traffic through" path.
Test covers the string and undefined cases the original suite missed.
@vercel vercel Bot temporarily deployed to Preview – petrinaut April 30, 2026 15:50 Inactive
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 98041ff. Configure here.

Comment thread libs/@local/hash-backend-utils/src/temporal.ts
@github-actions
Copy link
Copy Markdown
Contributor

Benchmark results

@rust/hash-graph-benches – Integrations

policy_resolution_large

Function Value Mean Flame graphs
resolve_policies_for_actor user: empty, selectivity: high, policies: 2002 $$27.1 \mathrm{ms} \pm 149 \mathrm{μs}\left({\color{gray}0.162 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: empty, selectivity: low, policies: 1 $$3.49 \mathrm{ms} \pm 17.1 \mathrm{μs}\left({\color{gray}2.19 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: empty, selectivity: medium, policies: 1001 $$12.3 \mathrm{ms} \pm 87.2 \mathrm{μs}\left({\color{gray}0.872 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: seeded, selectivity: high, policies: 3314 $$42.2 \mathrm{ms} \pm 321 \mathrm{μs}\left({\color{gray}-1.586 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: seeded, selectivity: low, policies: 1 $$14.8 \mathrm{ms} \pm 135 \mathrm{μs}\left({\color{red}5.15 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: seeded, selectivity: medium, policies: 1526 $$23.6 \mathrm{ms} \pm 168 \mathrm{μs}\left({\color{gray}-0.855 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: high, policies: 2078 $$28.0 \mathrm{ms} \pm 174 \mathrm{μs}\left({\color{gray}-2.618 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: low, policies: 1 $$3.81 \mathrm{ms} \pm 16.9 \mathrm{μs}\left({\color{gray}2.59 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: medium, policies: 1033 $$13.3 \mathrm{ms} \pm 104 \mathrm{μs}\left({\color{gray}1.15 \mathrm{\%}}\right) $$ Flame Graph

policy_resolution_medium

Function Value Mean Flame graphs
resolve_policies_for_actor user: empty, selectivity: high, policies: 102 $$3.77 \mathrm{ms} \pm 21.6 \mathrm{μs}\left({\color{gray}-0.010 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: empty, selectivity: low, policies: 1 $$2.95 \mathrm{ms} \pm 14.4 \mathrm{μs}\left({\color{gray}0.117 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: empty, selectivity: medium, policies: 51 $$3.32 \mathrm{ms} \pm 16.5 \mathrm{μs}\left({\color{gray}-0.289 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: seeded, selectivity: high, policies: 269 $$5.12 \mathrm{ms} \pm 25.8 \mathrm{μs}\left({\color{gray}-0.337 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: seeded, selectivity: low, policies: 1 $$3.51 \mathrm{ms} \pm 17.9 \mathrm{μs}\left({\color{gray}0.247 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: seeded, selectivity: medium, policies: 107 $$4.10 \mathrm{ms} \pm 29.2 \mathrm{μs}\left({\color{gray}-0.506 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: high, policies: 133 $$4.39 \mathrm{ms} \pm 25.2 \mathrm{μs}\left({\color{gray}-0.443 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: low, policies: 1 $$3.44 \mathrm{ms} \pm 23.4 \mathrm{μs}\left({\color{gray}0.288 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: medium, policies: 63 $$4.05 \mathrm{ms} \pm 24.1 \mathrm{μs}\left({\color{gray}-0.141 \mathrm{\%}}\right) $$ Flame Graph

policy_resolution_none

Function Value Mean Flame graphs
resolve_policies_for_actor user: empty, selectivity: high, policies: 2 $$2.59 \mathrm{ms} \pm 16.2 \mathrm{μs}\left({\color{gray}0.045 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: empty, selectivity: low, policies: 1 $$2.49 \mathrm{ms} \pm 14.9 \mathrm{μs}\left({\color{gray}-0.755 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: empty, selectivity: medium, policies: 1 $$2.56 \mathrm{ms} \pm 15.7 \mathrm{μs}\left({\color{gray}-0.204 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: high, policies: 8 $$2.81 \mathrm{ms} \pm 16.6 \mathrm{μs}\left({\color{gray}-1.128 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: low, policies: 1 $$2.61 \mathrm{ms} \pm 15.0 \mathrm{μs}\left({\color{gray}-1.378 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: medium, policies: 3 $$2.82 \mathrm{ms} \pm 19.5 \mathrm{μs}\left({\color{gray}-0.374 \mathrm{\%}}\right) $$ Flame Graph

policy_resolution_small

Function Value Mean Flame graphs
resolve_policies_for_actor user: empty, selectivity: high, policies: 52 $$2.99 \mathrm{ms} \pm 13.2 \mathrm{μs}\left({\color{gray}-0.375 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: empty, selectivity: low, policies: 1 $$2.71 \mathrm{ms} \pm 13.5 \mathrm{μs}\left({\color{gray}-0.565 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: empty, selectivity: medium, policies: 25 $$2.97 \mathrm{ms} \pm 17.7 \mathrm{μs}\left({\color{gray}0.004 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: seeded, selectivity: high, policies: 94 $$3.39 \mathrm{ms} \pm 19.2 \mathrm{μs}\left({\color{gray}-0.347 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: seeded, selectivity: low, policies: 1 $$2.92 \mathrm{ms} \pm 13.6 \mathrm{μs}\left({\color{gray}-0.493 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: seeded, selectivity: medium, policies: 26 $$3.25 \mathrm{ms} \pm 16.9 \mathrm{μs}\left({\color{gray}-0.767 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: high, policies: 66 $$3.30 \mathrm{ms} \pm 16.3 \mathrm{μs}\left({\color{gray}-0.398 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: low, policies: 1 $$2.91 \mathrm{ms} \pm 13.7 \mathrm{μs}\left({\color{gray}-0.435 \mathrm{\%}}\right) $$ Flame Graph
resolve_policies_for_actor user: system, selectivity: medium, policies: 29 $$3.32 \mathrm{ms} \pm 19.1 \mathrm{μs}\left({\color{gray}0.448 \mathrm{\%}}\right) $$ Flame Graph

read_scaling_complete

Function Value Mean Flame graphs
entity_by_id;one_depth 1 entities $$54.5 \mathrm{ms} \pm 326 \mathrm{μs}\left({\color{gray}-1.279 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;one_depth 10 entities $$46.1 \mathrm{ms} \pm 216 \mathrm{μs}\left({\color{gray}-0.072 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;one_depth 25 entities $$49.7 \mathrm{ms} \pm 222 \mathrm{μs}\left({\color{gray}-1.472 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;one_depth 5 entities $$44.2 \mathrm{ms} \pm 189 \mathrm{μs}\left({\color{gray}-0.101 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;one_depth 50 entities $$61.8 \mathrm{ms} \pm 346 \mathrm{μs}\left({\color{gray}-1.990 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;two_depth 1 entities $$61.7 \mathrm{ms} \pm 336 \mathrm{μs}\left({\color{gray}-0.476 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;two_depth 10 entities $$55.5 \mathrm{ms} \pm 268 \mathrm{μs}\left({\color{gray}-0.520 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;two_depth 25 entities $$102 \mathrm{ms} \pm 546 \mathrm{μs}\left({\color{gray}-1.358 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;two_depth 5 entities $$46.3 \mathrm{ms} \pm 222 \mathrm{μs}\left({\color{gray}-0.795 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;two_depth 50 entities $$291 \mathrm{ms} \pm 922 \mathrm{μs}\left({\color{red}6.12 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;zero_depth 1 entities $$19.4 \mathrm{ms} \pm 143 \mathrm{μs}\left({\color{gray}-0.282 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;zero_depth 10 entities $$20.0 \mathrm{ms} \pm 125 \mathrm{μs}\left({\color{gray}0.406 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;zero_depth 25 entities $$20.4 \mathrm{ms} \pm 135 \mathrm{μs}\left({\color{gray}1.10 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;zero_depth 5 entities $$19.5 \mathrm{ms} \pm 97.4 \mathrm{μs}\left({\color{gray}-0.446 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id;zero_depth 50 entities $$25.1 \mathrm{ms} \pm 119 \mathrm{μs}\left({\color{gray}-0.118 \mathrm{\%}}\right) $$ Flame Graph

read_scaling_linkless

Function Value Mean Flame graphs
entity_by_id 1 entities $$19.4 \mathrm{ms} \pm 112 \mathrm{μs}\left({\color{gray}-1.717 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 10 entities $$19.2 \mathrm{ms} \pm 97.3 \mathrm{μs}\left({\color{gray}-0.648 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 100 entities $$19.5 \mathrm{ms} \pm 123 \mathrm{μs}\left({\color{gray}-0.101 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 1000 entities $$20.1 \mathrm{ms} \pm 131 \mathrm{μs}\left({\color{gray}-0.222 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id 10000 entities $$26.7 \mathrm{ms} \pm 204 \mathrm{μs}\left({\color{gray}1.40 \mathrm{\%}}\right) $$ Flame Graph

representative_read_entity

Function Value Mean Flame graphs
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/block/v/1 $$35.6 \mathrm{ms} \pm 287 \mathrm{μs}\left({\color{gray}3.76 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/book/v/1 $$34.7 \mathrm{ms} \pm 318 \mathrm{μs}\left({\color{gray}2.92 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/building/v/1 $$35.1 \mathrm{ms} \pm 348 \mathrm{μs}\left({\color{gray}1.06 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/organization/v/1 $$34.2 \mathrm{ms} \pm 289 \mathrm{μs}\left({\color{gray}3.04 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/page/v/2 $$34.4 \mathrm{ms} \pm 255 \mathrm{μs}\left({\color{gray}0.491 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/person/v/1 $$34.5 \mathrm{ms} \pm 295 \mathrm{μs}\left({\color{gray}0.338 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/playlist/v/1 $$33.9 \mathrm{ms} \pm 317 \mathrm{μs}\left({\color{gray}-3.988 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/song/v/1 $$35.0 \mathrm{ms} \pm 331 \mathrm{μs}\left({\color{gray}0.370 \mathrm{\%}}\right) $$ Flame Graph
entity_by_id entity type ID: https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1 $$34.6 \mathrm{ms} \pm 368 \mathrm{μs}\left({\color{gray}1.31 \mathrm{\%}}\right) $$ Flame Graph

representative_read_entity_type

Function Value Mean Flame graphs
get_entity_type_by_id Account ID: bf5a9ef5-dc3b-43cf-a291-6210c0321eba $$8.59 \mathrm{ms} \pm 43.3 \mathrm{μs}\left({\color{gray}0.305 \mathrm{\%}}\right) $$ Flame Graph

representative_read_multiple_entities

Function Value Mean Flame graphs
entity_by_property traversal_paths=0 0 $$92.9 \mathrm{ms} \pm 639 \mathrm{μs}\left({\color{gray}0.189 \mathrm{\%}}\right) $$
entity_by_property traversal_paths=255 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true $$148 \mathrm{ms} \pm 497 \mathrm{μs}\left({\color{gray}0.466 \mathrm{\%}}\right) $$
entity_by_property traversal_paths=2 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false $$99.8 \mathrm{ms} \pm 492 \mathrm{μs}\left({\color{gray}-0.749 \mathrm{\%}}\right) $$
entity_by_property traversal_paths=2 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true $$111 \mathrm{ms} \pm 567 \mathrm{μs}\left({\color{gray}-0.400 \mathrm{\%}}\right) $$
entity_by_property traversal_paths=2 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true $$118 \mathrm{ms} \pm 472 \mathrm{μs}\left({\color{gray}-0.309 \mathrm{\%}}\right) $$
entity_by_property traversal_paths=2 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true $$127 \mathrm{ms} \pm 514 \mathrm{μs}\left({\color{gray}-0.769 \mathrm{\%}}\right) $$
link_by_source_by_property traversal_paths=0 0 $$103 \mathrm{ms} \pm 520 \mathrm{μs}\left({\color{gray}-0.022 \mathrm{\%}}\right) $$
link_by_source_by_property traversal_paths=255 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true $$135 \mathrm{ms} \pm 543 \mathrm{μs}\left({\color{gray}1.05 \mathrm{\%}}\right) $$
link_by_source_by_property traversal_paths=2 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false $$109 \mathrm{ms} \pm 425 \mathrm{μs}\left({\color{gray}-0.846 \mathrm{\%}}\right) $$
link_by_source_by_property traversal_paths=2 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true $$119 \mathrm{ms} \pm 484 \mathrm{μs}\left({\color{gray}-0.435 \mathrm{\%}}\right) $$
link_by_source_by_property traversal_paths=2 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true $$121 \mathrm{ms} \pm 603 \mathrm{μs}\left({\color{gray}-0.282 \mathrm{\%}}\right) $$
link_by_source_by_property traversal_paths=2 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true $$121 \mathrm{ms} \pm 513 \mathrm{μs}\left({\color{gray}-0.177 \mathrm{\%}}\right) $$

scenarios

Function Value Mean Flame graphs
full_test query-limited $$181 \mathrm{ms} \pm 723 \mathrm{μs}\left({\color{gray}1.26 \mathrm{\%}}\right) $$ Flame Graph
full_test query-unlimited $$190 \mathrm{ms} \pm 1.48 \mathrm{ms}\left({\color{gray}-4.952 \mathrm{\%}}\right) $$ Flame Graph
linked_queries query-limited $$40.7 \mathrm{ms} \pm 196 \mathrm{μs}\left({\color{lightgreen}-61.787 \mathrm{\%}}\right) $$ Flame Graph
linked_queries query-unlimited $$526 \mathrm{ms} \pm 781 \mathrm{μs}\left({\color{lightgreen}-7.763 \mathrm{\%}}\right) $$ Flame Graph

@TimDiekmann TimDiekmann requested a review from vilkinsons May 11, 2026 07:38
@TimDiekmann TimDiekmann added this pull request to the merge queue May 11, 2026
Merged via the queue into main with commit f8eae3a May 11, 2026
179 of 181 checks passed
@TimDiekmann TimDiekmann deleted the t/be-519-otel-workers branch May 11, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/apps > hash* Affects HASH (a `hash-*` app) area/apps > hash-api Affects the HASH API (app) area/apps area/deps Relates to third-party dependencies (area) area/libs Relates to first-party libraries/crates/packages (area) area/tests > integration New or updated integration tests area/tests New or updated tests type/eng > backend Owned by the @backend team

Development

Successfully merging this pull request may close these issues.

4 participants