BE-519: Set up OpenTelemetry across hash-api and Temporal workers#8681
BE-519: Set up OpenTelemetry across hash-api and Temporal workers#8681TimDiekmann merged 17 commits intomainfrom
Conversation
Two rounds of multi-agent review surfaced 5 Critical and 8 Important
items, plus a Critical regression introduced by the first cleanup
pass. This commit addresses all of them.
Production telemetry fixes:
- Bundle scripts now register the OTEL workflow interceptor as a
workflow-bundle module; without this, prod workers shipped no
workflow spans because `Worker.create.interceptors.workflowModules`
is ignored when a prebuilt bundle is loaded.
- `BatchSpanProcessor` / `BatchLogRecordProcessor` replace the
Simple variants so OTLP exports stay off the request path.
- `registerOpenTelemetry` shutdown surfaces `Promise.allSettled`
rejections to stderr and applies a 2-second per-provider timeout.
- `instrument.mjs` (and the worker shims) wrap the bootstrap in
try/catch so a misconfigured collector falls back to no telemetry
rather than crashing the process before the logger is wired.
Worker shutdown:
- Workers now call `worker.shutdown()` and await `worker.run()` from
the SIGTERM handler so in-flight activities drain cleanly. The
previous draft called `process.exit` before the drain completed.
- Exit code is 0 on graceful signals, 1 only on actual failure.
- `OpenTelemetryActivityOutboundInterceptor` is wired so log lines
carry `trace_id` / `span_id` for Loki ↔ Tempo correlation.
Refactor:
- New `runWorker(opts)` helper in `@local/hash-backend-utils/temporal/worker-bootstrap`
collapses the previously-duplicated bootstrap logic in both worker
`main.ts` files. Sentry init stays per-worker (ESM ordering).
- `WorkflowSource` discriminated union (`{ kind: "bundle" | "path" }`)
replaces `Partial<WorkerOptions>` for workflow-source config, and
`ExtraWorkerOptions = Omit<WorkerOptions, ...>` excludes helper-owned
keys from the escape hatch.
- `makeV2WorkflowSink(setup)` quarantines the `as unknown as Parameters<...>`
casts to a single helper, ready to drop with BE-520.
- `httpRequestSpanNameHook` is shared across the three Node bootstraps.
- The `getActiveOpenTelemetrySetup` singleton is gone; `instrument.mjs`
exports `otelSetup` directly.
- Rust `build_otel_header` emits a once-per-process warning when the
trace-context carrier is empty, surfacing missing-propagator
configuration instead of silently producing parent-less workflows.
Tests (23 unit tests covering the v1↔v2 normalisation, peer.service
resolution, and HTTP hook behaviour):
- `wrapWorkflowSpanExporter` / `normaliseSpan` against v1-shaped
spans, mixed batches, the existing-`parentSpanContext` precedence
case, the `parentSpanId === ""` edge case, and result-callback
propagation.
- `resolvePeerService` exact / suffix matching and lookalike-domain
rejection.
- `httpRequestSpanNameHook` for incoming, outgoing, Express-wrapped,
query-stripped, and missing-method paths.
The drain semantics are verified manually — `TestWorkflowEnvironment`
integration tests would be a follow-up.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Adds end-to-end trace context propagation into Temporal by attaching the OTEL workflow client interceptor in Updates the Rust Reviewed by Cursor Bugbot for commit 98041ff. Bugbot is set up for automated code reviews on this repo. Configure here. |
🤖 Augment PR SummarySummary: This PR standardizes OpenTelemetry across the Node API and Temporal workers, and propagates trace context into Temporal workflow starts so caller → workflow/activity spans join into one trace. Changes:
Technical Notes: Uses batch processors to keep export off the request path and includes explicit shutdown flushing with per-provider timeouts; Temporal workflow spans are bridged from the v1 interceptor shape to the v2 OTLP exporter shape. 🤖 Was this summary useful? React with 👍 or 👎 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8681 +/- ##
==========================================
- Coverage 62.08% 62.03% -0.05%
==========================================
Files 1341 1345 +4
Lines 135072 135263 +191
Branches 5744 5782 +38
==========================================
+ Hits 83854 83908 +54
- Misses 50310 50446 +136
- Partials 908 909 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Fix tsc failure in @tests/hash-backend-integration: setup-opentelemetry.ts pointed at the removed @apps/hash-api/src/graphql/opentelemetry module; migrate it to @local/hash-backend-utils/opentelemetry with the new options-object signature. - Restore the workflow-start identity field that the high-level WorkflowClientTrait::start_workflow auto-populated; the low-level StartWorkflowExecutionRequest defaulted it to "" so Temporal Server could not attribute starts to a client. - Use URL.hostname (not URL.host) when resolving peer.service so outbound origins like https://api.openai.com:443/ still match the exact-host rules. - Fix the worker-bootstrap serviceName doc-comment: it claimed the value is also used as Temporal worker identity, but Worker.create keeps the SDK default (pid@hostname). Document the actual usage (service.name in the OTEL resource). - Move the OTEL shutdown error logging into the per-target try/catch so the label is captured at execution time instead of indexed back out of the targets array.
ESLint's no-unnecessary-condition narrows shuttingDown to its initial false here because the only mutation lives inside the SIGINT/SIGTERM handler closure, which TS-flow does not see from the surrounding scope. Same shape for the rethrow: workerError holds whatever the SDK threw (typed unknown), so only-throw-error fires on the rethrow even though preserving the original value is the correct behaviour.
The helper imports DefaultLogger from @temporalio/worker, which bundles the native Rust core bindings. Keeping it in temporal.ts pulled the worker package into every consumer of createTemporalClient — including hash-api, where it has no business being. worker-bootstrap.ts is the only caller, so the helper moves there and becomes file-private. temporal.ts is back to a pure @temporalio/client surface.
- worker-bootstrap.ts: drop the dead try/catch around httpServer.close(); http.Server.close() reports failures via the optional callback, not synchronously. Pass an error-logging callback instead so close failures actually surface. - opentelemetry.ts: tighten the host vs hostname comment. The example api.openai.com:443 was wrong — URL strips the default :443 for https. Use collector:4318 instead so the rationale matches what the URL spec does. - hash-api/index.ts: call out the LIFO cleanup order explicitly. Without the GracefulShutdown.reverse() reference, a future maintainer reading "Flush OpenTelemetry last" alongside addCleanup() calls below would reasonably conclude the comment is stale and move the registration.
A SIGTERM during the worker startup window — NativeConnection.connect, workflow bundle compile, Worker.create — would hit the Node default handler and terminate the process without flushing OTEL. K8s pod evictions during startup are a real edge case in rolling deploys. Move SIGINT/SIGTERM registration to the top of runWorker, before any async startup work. The handler captures `worker` once it's set; if a signal arrives earlier, it just flips `shuttingDown` and the linear cleanup path below handles flush+exit. The worker.run() call gates on `shuttingDown` so we skip running entirely if startup got interrupted.
Locks down the port-derivation contract that prevents the exporter from tracing its own outbound traffic. The failure mode is exponential span amplification per export batch, hard to spot from logs — easier to catch with a unit test. Covers: configured gRPC port (4317), non-default port from URL (4318), fallback when the URL has no explicit port, and fallback on a malformed endpoint (the helper must not throw on every outgoing request). Reads the hook back via HttpInstrumentation.getConfig() so the test exercises the same wiring runtime callers see.
Switching from Omit to Pick means future Temporal SDK additions don't silently leak through and let callers override helper-owned wiring (activities, connection, taskQueue, sinks, interceptors, workflowBundle). Only key currently in use is maxHeartbeatThrottleInterval; add more on demand.
The previous catch swallowed every throw — including the kind that matters most in dev / CI: typos in instrumentation construction, bad endpoint URLs, missing peer deps. A regression that disabled telemetry on every deploy would slip through. Keep the fall-through-to-undefined behaviour for production (don't crash a prod service over the collector layer) but rethrow elsewhere so the bootstrap error surfaces during development.
biome auto-removed the now-redundant as HttpInstrumentationConfig cast (getConfig() already types it), leaving the import unused. worker-bootstrap.ts:306: same TS-narrowing-vs-closure-mutation pattern as the other shuttingDown checks — TS sees the initial false but the SIGTERM handler can flip it before this line runs.
See the output of git range-diff at https://github.com/hashintel/hash/actions/runs/25175069658
http.RequestOptions.port is string | number | null | undefined; some callers pass a string and `"4317" === 4317` is false. The filter would miss the exporter's own outbound traffic and the very feedback loop this filter exists to prevent would slip through. Number(options.port) handles both shapes, undefined → NaN → not equal, which preserves the original "let unrelated traffic through" path. Test covers the string and undefined cases the original suite missed.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 98041ff. Configure here.
Benchmark results
|
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 2002 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 1001 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 3314 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 1526 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 2078 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 1033 | Flame Graph |
policy_resolution_medium
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 102 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 51 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 269 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 107 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 133 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 63 | Flame Graph |
policy_resolution_none
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 2 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 8 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 3 | Flame Graph |
policy_resolution_small
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 52 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 25 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 94 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 26 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 66 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 29 | Flame Graph |
read_scaling_complete
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id;one_depth | 1 entities | Flame Graph | |
| entity_by_id;one_depth | 10 entities | Flame Graph | |
| entity_by_id;one_depth | 25 entities | Flame Graph | |
| entity_by_id;one_depth | 5 entities | Flame Graph | |
| entity_by_id;one_depth | 50 entities | Flame Graph | |
| entity_by_id;two_depth | 1 entities | Flame Graph | |
| entity_by_id;two_depth | 10 entities | Flame Graph | |
| entity_by_id;two_depth | 25 entities | Flame Graph | |
| entity_by_id;two_depth | 5 entities | Flame Graph | |
| entity_by_id;two_depth | 50 entities | Flame Graph | |
| entity_by_id;zero_depth | 1 entities | Flame Graph | |
| entity_by_id;zero_depth | 10 entities | Flame Graph | |
| entity_by_id;zero_depth | 25 entities | Flame Graph | |
| entity_by_id;zero_depth | 5 entities | Flame Graph | |
| entity_by_id;zero_depth | 50 entities | Flame Graph |
read_scaling_linkless
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id | 1 entities | Flame Graph | |
| entity_by_id | 10 entities | Flame Graph | |
| entity_by_id | 100 entities | Flame Graph | |
| entity_by_id | 1000 entities | Flame Graph | |
| entity_by_id | 10000 entities | Flame Graph |
representative_read_entity
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/block/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/book/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/building/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/organization/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/page/v/2
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/person/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/playlist/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/song/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1
|
Flame Graph |
representative_read_entity_type
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| get_entity_type_by_id | Account ID: bf5a9ef5-dc3b-43cf-a291-6210c0321eba
|
Flame Graph |
representative_read_multiple_entities
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_property | traversal_paths=0 | 0 | |
| entity_by_property | traversal_paths=255 | 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=0 | 0 | |
| link_by_source_by_property | traversal_paths=255 | 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true |
scenarios
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| full_test | query-limited | Flame Graph | |
| full_test | query-unlimited | Flame Graph | |
| linked_queries | query-limited | Flame Graph | |
| linked_queries | query-unlimited | Flame Graph |

🌟 What is the purpose of this PR?
Wires OpenTelemetry through hash-api, hash-ai-worker-ts, and hash-integration-worker so traces, logs, and metrics flow into the existing OTLP collector. Caller-side trace context (Express HTTP spans, Rust
temporal-client::start_workflow) is propagated into Temporal workflow start headers so the worker-sideRunWorkflowandRunActivityspans chain off the caller's trace in Tempo.Outbound
fetchcalls (OpenAI, Anthropic, Linear, …) are now traced via@opentelemetry/instrumentation-undiciwith a sharedpeer.servicemapping so Tempo'sservice_graphsprocessor renders external dependencies as edges in the service map.🔗 Related links
@temporalio/interceptors-opentelemetry-v2is released🚫 Blocked by
🔍 What does this change?
New shared OTEL module (
@local/hash-backend-utils/opentelemetry):registerOpenTelemetry({ endpoint, serviceName, instrumentations })registers a globalNodeTracerProvider/LoggerProvider/MeterProvideragainst an OTLP/gRPC collector.createUndiciInstrumentation()andhttpRequestSpanNameHookshared across all three services.peer.servicemapping (discriminated exact/suffix host rules) for OpenAI, Anthropic, Linear, Google Cloud.BatchSpanProcessor/BatchLogRecordProcessorkeep export off the request path. Shutdown surfacesPromise.allSettledrejections and applies a 2-second per-provider timeout.Temporal trace propagation:
OpenTelemetryWorkflowClientInterceptor) attached increateTemporalClientinjects the active context into workflow start headers.activityfactory shape) extract context and stamptrace_id/span_idon activity log lines for Loki ↔ Tempo correlation.bundleWorkflowCode(production path) AND inWorker.create(dev path) — without the bundle registration, prebuilt bundles ignore the runtimeworkflowModulesconfig and prod ships zero workflow spans.wrapWorkflowSpanExporter/makeV2WorkflowSinkbridges v1-shapedReadableSpanfrom@temporalio/interceptors-opentelemetry@1to our v2 stack (synthesisesinstrumentationScopeandparentSpanContext); markedTODO(BE-520)and quarantined to one helper.Rust trace context propagation (
hash-temporal-client::ai):start_ai_workflowto use the low-levelWorkflowService::start_workflow_executionso the protoheaderfield is exposed.build_otel_header()injects the active trace context as a_tracer-datapayload; emits a once-per-processtracing::warn!when the carrier is empty.otel.kind = "producer"for the asynchronous fire-and-forget shape.Worker bootstrap helper (
@local/hash-backend-utils/temporal/worker-bootstrap):runWorker(opts)collapses the previously-duplicated bootstrap logic in both workermain.tsfiles.Sentry.initstays per-worker because of ESM import-ordering.WorkflowSourcediscriminated union ({ kind: "bundle" } | { kind: "path" }) replacesPartial<WorkerOptions>;ExtraWorkerOptions = Omit<WorkerOptions, …>excludes helper-owned and source-owned keys from the escape hatch.worker.run()so in-flight activities drain before OTEL providers shut down. Exit code 0 on graceful signals.Sentry coexistence:
skipOpenTelemetrySetup: !!otelSetupso Sentry shares ourNodeTracerProvider. If Sentry registered its own, it would shadow the global provider and the OTEL workflow client interceptor would silently break trace context propagation.Metrics scrape:
apps/hash-external-services/opentelemetry-collector/otel-collector-config.yamlnow scrapes Temporal Server's prometheus endpoint (temporal:8000) with atemporal_prefix relabel.Pre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
This PR:
📜 Does this require a change to the docs?
The changes in this PR:
🕸️ Does this require a change to the Turbo Graph?
The changes in this PR:
@temporalio/interceptors-opentelemetry-v2lands — tracked in BE-520.@temporalio/interceptors-opentelemetryproduces). The OTEL spec for async messaging recommends PRODUCER/CONSUMER + Span Links, but neither v1 nor v2 of the upstream package implement that. Doesn't affect our metrics today: workflow spans areINTERNALkind and start after the caller span ends, so Tempo's service-graphs processor doesn't inflate edge latencies. Captured as a follow-up note in BE-520.HASH_OTLP_ENDPOINTmust be set on ALL backend services or NONE — a partial config silently breaks caller↔worker context propagation. No runtime check (deferred — infra-level concern).🐾 Next steps
@temporalio/interceptors-opentelemetry-v2once released, drop the workflow span adapter.🛡 What tests cover this?
23 unit tests in
libs/@local/hash-backend-utils/:src/opentelemetry.test.ts(12 tests) —resolvePeerService(exact/suffix matching, lookalike-domain rejection, suffix-dot-boundary),httpRequestSpanNameHook(incoming, outgoing, Express-wrapped, query-stripped, missing-method paths).src/temporal/workflow-span-adapter.test.ts(11 tests) — v1→v2 normalisation (instrumentationLibrary → instrumentationScope, parentSpanId → parentSpanContext), already-v2 passthrough identity check, attribute / event preservation through the rewrite path, mixed batch handling, theparentSpanId === ""edge case, existing-parentSpanContextprecedence, result-callback propagation, shutdown / forceFlush delegation.Drain semantics on SIGTERM are verified manually —
TestWorkflowEnvironmentintegration tests deferred.❓ How to test this?
HASH_OTLP_ENDPOINT=http://localhost:4317and start the OTEL collector + Tempo + Mimir + Loki via theapps/hash-external-servicesdocker-compose.yarn dev:backend+yarn start:worker:ai.updateEntityEmbeddingsworkflow).Node API. Expected trace shape:Node API POST /graphql(~150ms) at the root.start_ai_workflow(~12ms) as child, markedProducerkind.AI Worker RunWorkflow:updateEntityEmbeddings(~2.5s) as child of the producer.RunActivity:createAndStoreEntityEmbeddingsActivitynested under it.OpenAI POST /v1/embeddingsas external-service span (undici instrumentation).Node API,Graph API,AI Worker,OpenAI,Postgres, with anAI Worker → OpenAIedge.kill -TERM <ai-worker-pid>while a workflow is mid-flight. Logs showReceived SIGTERM, exiting…, then activity completion logs, then OTEL flush, then exit 0.📹 Demo
Tempo screenshots from local testing — see Linear BE-519 for image attachments.