Skip to content

fix(http): apply per-read socket timeout to bound body-stall hangs#590

Open
nathantournant wants to merge 1 commit into
mainfrom
nathan.tournant/fix-sock-read-timeout
Open

fix(http): apply per-read socket timeout to bound body-stall hangs#590
nathantournant wants to merge 1 commit into
mainfrom
nathan.tournant/fix-sock-read-timeout

Conversation

@nathantournant
Copy link
Copy Markdown
Member

Summary

Adds a sock_read timeout to the aiohttp ClientTimeout used by sync-cli's
HTTP client. Previously, only a bare integer was passed as the timeout
argument; aiohttp wraps this as ClientTimeout(total=<int>). Once response
headers arrive, aiohttp's total timer is effectively disarmed — the
response-body read is unprotected. If the upstream API stalls mid-body for
any length of time, sync-cli waits indefinitely with no surfaceable signal.

This change adds a per-read socket-level deadline so a body-read stall of
more than --http-client-timeout seconds fires asyncio.TimeoutError,
which is caught by the existing per-resource-type handler and surfaced as
a classified failure (failure_class=http_timeout) instead of a silent hang.

Why

Observed: an unbounded GET against a large org's resource list (large
response body with slow server-side delivery) appears to sync-cli as a
silent multi-minute hang. The subprocess produces no stdout during the
body read, so any process running sync-cli as a subprocess (e.g. via
subprocess.Popen + cmd.Wait) cannot observe progress until the body
read either completes or fails.

The fix is strictly additive: total=None is preserved (we explicitly
allow long-but-progressing body reads), and sock_read bounds the
inter-chunk gap. Long successful body reads are unaffected.

Change

# Before
session.get(url, timeout=self.timeout)   # int → ClientTimeout(total=120)

# After — via new _client_timeout() helper
session.get(url, timeout=aiohttp.ClientTimeout(total=None, sock_read=self.timeout))

The helper _client_timeout() is applied to all six HTTP verb call sites
(get, post, put, patch, delete, _post_raw) for consistency.

Two catch chains in resources_handler.py are also tightened from
except TimeoutError to except (asyncio.TimeoutError, TimeoutError).
In Python 3.11+, asyncio.TimeoutError is an alias for TimeoutError
so there is no functional change there; in Python 3.9/3.10 (both in the
supported range >=3.9) the two are distinct classes and the dual catch
ensures the asyncio.TimeoutError raised by aiohttp is handled correctly.

Tests

tests/unit/test_custom_client_timeout.py (new, 12 tests):

  • TestClientTimeoutShape — verifies _client_timeout() returns
    aiohttp.ClientTimeout with total=None and sock_read equal to
    self.timeout for several representative timeout values.
  • TestVerbsUseClientTimeout — mocks the aiohttp session and asserts
    that every HTTP verb (get, post, put, patch, delete) forwards
    a ClientTimeout rather than a bare integer.
  • TestTimeoutPropagation — verifies that both asyncio.TimeoutError
    and TimeoutError raised during a body read propagate through
    request_with_retry to the caller.

tests/unit/test_failure_class.py (one test added):

  • test_asyncio_timeout_error — confirms _sanitize_reason classifies
    asyncio.TimeoutError as failure_class=http_timeout, the same as the
    existing TimeoutError branch.

All 687 unit tests pass. Lint (ruff) clean.

Backwards compatibility

  • total=None (no wall-clock cap on the overall request) — matches the
    previously-observed behaviour for slow-streaming responses, so no regression.
  • sock_read=self.timeout — new bound on per-read gaps. Existing callers
    who expect responses with no data gap longer than --http-client-timeout
    seconds see no behavioural change.

Out of scope

  • Pagination of the listing endpoints (the deeper architectural fix for
    very large response bodies — separate consideration).
  • Heartbeat / progress logging during long body reads (operability follow-up).

Previously every HTTP verb passed an integer timeout to aiohttp, which
wraps it as ClientTimeout(total=<int>). Once response headers arrive,
aiohttp's total timer stops protecting the body read, so a server that
streams a large response slowly can hold the connection open for any
length of time with no error surfaced.

Add _client_timeout() returning ClientTimeout(total=None, sock_read=self.timeout).

- total=None preserves the current behaviour: long-but-progressing body
  reads are not hard-capped, matching what production callers already
  rely on for large resource listings.
- sock_read=self.timeout bounds the inter-chunk gap: if the server stops
  sending bytes for more than self.timeout seconds, asyncio.TimeoutError
  is raised, propagates through request_with_retry uncaught, and is
  caught by _import_get_resources_cb's existing handler where it is
  classified as failure_class=http_timeout.

Also tighten the catch chains in resources_handler.py:
- except (asyncio.TimeoutError, TimeoutError) at the _import_get_resources_cb
  call site (asyncio.TimeoutError is a subclass of TimeoutError in Python
  3.11+ but a distinct class in 3.9/3.10; the dual catch is safe on all
  supported versions).
- isinstance check in _sanitize_reason updated to match.
@nathantournant nathantournant marked this pull request as ready for review June 4, 2026 20:45
@nathantournant nathantournant requested a review from a team as a code owner June 4, 2026 20:45
@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

Ensure labels | changelog   View in Datadog   GitHub Actions

See error Missing required workflow permissions. Ensure 'changelog/*' label is added to the pull request.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 16e771e | Docs | Datadog PR Page | Give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants