fix(http): apply per-read socket timeout to bound body-stall hangs by nathantournant · Pull Request #590 · DataDog/datadog-sync-cli

nathantournant · 2026-06-04T15:39:30Z

Summary

Adds a sock_read timeout to the aiohttp ClientTimeout used by sync-cli's
HTTP client. Previously, only a bare integer was passed as the timeout
argument; aiohttp wraps this as ClientTimeout(total=<int>). Once response
headers arrive, aiohttp's total timer is effectively disarmed — the
response-body read is unprotected. If the upstream API stalls mid-body for
any length of time, sync-cli waits indefinitely with no surfaceable signal.

This change adds a per-read socket-level deadline so a body-read stall of
more than --http-client-timeout seconds fires asyncio.TimeoutError,
which is caught by the existing per-resource-type handler and surfaced as
a classified failure (failure_class=http_timeout) instead of a silent hang.

Why

Observed: an unbounded GET against a large org's resource list (large
response body with slow server-side delivery) appears to sync-cli as a
silent multi-minute hang. The subprocess produces no stdout during the
body read, so any process running sync-cli as a subprocess (e.g. via
subprocess.Popen + cmd.Wait) cannot observe progress until the body
read either completes or fails.

The fix is strictly additive: total=None is preserved (we explicitly
allow long-but-progressing body reads), and sock_read bounds the
inter-chunk gap. Long successful body reads are unaffected.

Change

# Before
session.get(url, timeout=self.timeout)   # int → ClientTimeout(total=120)

# After — via new _client_timeout() helper
session.get(url, timeout=aiohttp.ClientTimeout(total=None, sock_read=self.timeout))

The helper _client_timeout() is applied to all six HTTP verb call sites
(get, post, put, patch, delete, _post_raw) for consistency.

Two catch chains in resources_handler.py are also tightened from
except TimeoutError to except (asyncio.TimeoutError, TimeoutError).
In Python 3.11+, asyncio.TimeoutError is an alias for TimeoutError
so there is no functional change there; in Python 3.9/3.10 (both in the
supported range >=3.9) the two are distinct classes and the dual catch
ensures the asyncio.TimeoutError raised by aiohttp is handled correctly.

Tests

tests/unit/test_custom_client_timeout.py (new, 12 tests):

TestClientTimeoutShape — verifies _client_timeout() returns
aiohttp.ClientTimeout with total=None and sock_read equal to
self.timeout for several representative timeout values.
TestVerbsUseClientTimeout — mocks the aiohttp session and asserts
that every HTTP verb (get, post, put, patch, delete) forwards
a ClientTimeout rather than a bare integer.
TestTimeoutPropagation — verifies that both asyncio.TimeoutError
and TimeoutError raised during a body read propagate through
request_with_retry to the caller.

tests/unit/test_failure_class.py (one test added):

test_asyncio_timeout_error — confirms _sanitize_reason classifies
asyncio.TimeoutError as failure_class=http_timeout, the same as the
existing TimeoutError branch.

All 687 unit tests pass. Lint (ruff) clean.

Backwards compatibility

total=None (no wall-clock cap on the overall request) — matches the
previously-observed behaviour for slow-streaming responses, so no regression.
sock_read=self.timeout — new bound on per-read gaps. Existing callers
who expect responses with no data gap longer than --http-client-timeout
seconds see no behavioural change.

Out of scope

Pagination of the listing endpoints (the deeper architectural fix for
very large response bodies — separate consideration).
Heartbeat / progress logging during long body reads (operability follow-up).

Previously every HTTP verb passed an integer timeout to aiohttp, which wraps it as ClientTimeout(total=<int>). Once response headers arrive, aiohttp's total timer stops protecting the body read, so a server that streams a large response slowly can hold the connection open for any length of time with no error surfaced. Add _client_timeout() returning ClientTimeout(total=None, sock_read=self.timeout). - total=None preserves the current behaviour: long-but-progressing body reads are not hard-capped, matching what production callers already rely on for large resource listings. - sock_read=self.timeout bounds the inter-chunk gap: if the server stops sending bytes for more than self.timeout seconds, asyncio.TimeoutError is raised, propagates through request_with_retry uncaught, and is caught by _import_get_resources_cb's existing handler where it is classified as failure_class=http_timeout. Also tighten the catch chains in resources_handler.py: - except (asyncio.TimeoutError, TimeoutError) at the _import_get_resources_cb call site (asyncio.TimeoutError is a subclass of TimeoutError in Python 3.11+ but a distinct class in 3.9/3.10; the dual catch is safe on all supported versions). - isinstance check in _sanitize_reason updated to match.

datadog-datadog-prod-us1-2 · 2026-06-04T20:48:09Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

Ensure labels | changelog

See error
Missing required workflow permissions. Ensure 'changelog/*' label is added to the pull request.

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 16e771e | Docs | Datadog PR Page | Give us feedback!}

nathantournant marked this pull request as ready for review June 4, 2026 20:45

nathantournant requested a review from a team as a code owner June 4, 2026 20:45

nathantournant added the changelog/Fixed label Jun 4, 2026

michael-richey approved these changes Jun 4, 2026

View reviewed changes

gh-worker-dd-devflow-36fce6 Bot added mergequeue-status: queued mergequeue-status: in_progress and removed mergequeue-status: queued labels Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(http): apply per-read socket timeout to bound body-stall hangs#590

fix(http): apply per-read socket timeout to bound body-stall hangs#590
nathantournant wants to merge 1 commit into
mainfrom
nathan.tournant/fix-sock-read-timeout

nathantournant commented Jun 4, 2026

Uh oh!

datadog-datadog-prod-us1-2 Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nathantournant commented Jun 4, 2026

Summary

Why

Change

Tests

Backwards compatibility

Out of scope

Uh oh!

datadog-datadog-prod-us1-2 Bot commented Jun 4, 2026

⚠️ Warnings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants