Skip to content

feat(telemetry): MVP support for opt-in telemetry of runner crashes#2126

Open
AndreiCravtov wants to merge 20 commits into
mainfrom
andrei/telem
Open

feat(telemetry): MVP support for opt-in telemetry of runner crashes#2126
AndreiCravtov wants to merge 20 commits into
mainfrom
andrei/telem

Conversation

@AndreiCravtov
Copy link
Copy Markdown
Collaborator

@AndreiCravtov AndreiCravtov commented May 28, 2026

Motivation

Runner crashes currently leave useful diagnostics only on the local machine. This adds an opt-in path for collecting raw runner stderr logs so crash reports can be inspected centrally without making telemetry part of the critical execution path.

Changes

  • Added --telemetry, disabled by default.
  • Added EXO_TELEMETRY_API_URL, defaulting to https://telemetry.exolabs.net/.
  • Added TelemetryService and TelemetrySink for bounded, non-blocking telemetry submission.
  • Wired telemetry from Node through Worker into RunnerSupervisor.
  • Changed runner stderr logging to write per-runner timestamped files under:
    runner_log/<instance_id>/<runner_id>/<utc_timestamp>.stderr.log
  • Upload runner stderr logs by:
    • reading the completed stderr file
    • skipping zero-byte files
    • hashing the raw file bytes with SHA-256
    • requesting a pre-signed upload URL
    • uploading the raw bytes to that URL
  • Swallow telemetry upload failures after logging a warning.

Why It Works

Telemetry is isolated behind a bounded channel and best-effort background service, so failures, slow uploads, or backpressure do not block runner supervision or node operation.

Runner stderr files are written per bound instance/runner with UTC timestamped names, avoiding the previous shared append-only log path. Hashing the final file contents before upload gives the backend a stable object identity and lets it verify that uploaded contents match the requested submission.

Test Plan

Manual Testing

  • Ran exo with --telemetry against the telemetry API.
  • Forced a runner failure and confirmed a non-empty runner stderr log was submitted through the pre-signed upload flow.

Automated Testing

  • Added unit coverage for runner stderr telemetry upload:
    • SHA-256 presign payload generation
    • raw byte upload to the returned URL
    • upload failure handling without propagating exceptions
  • Updated runner supervisor tests to use isolated per-test log paths and a dummy telemetry service.

Comment thread src/exo/worker/runner/supervisor.py Outdated
Comment thread src/exo/main.py
Comment thread src/exo/shared/telemetry.py Outdated
@AndreiCravtov AndreiCravtov requested a review from Evanev7 May 29, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants