-
Notifications
You must be signed in to change notification settings - Fork 854
fix: handle HTTP 413 by splitting and retrying in OTLP HTTP exporters #5032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
2b885a1
e14b425
d4accab
d9aa5ef
8de5daf
ba3fc8d
1ffcda9
e3fad2a
ee1a1d8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,6 +31,7 @@ | |
| Compression, | ||
| ) | ||
| from opentelemetry.exporter.otlp.proto.http._common import ( | ||
| _is_payload_too_large, | ||
| _is_retryable, | ||
| _load_session_from_envvar, | ||
| ) | ||
|
|
@@ -41,6 +42,7 @@ | |
| ) | ||
| from opentelemetry.sdk._shared_internal import DuplicateFilter | ||
| from opentelemetry.sdk.environment_variables import ( | ||
| _OTEL_PYTHON_EXPERIMENTAL_OTLP_RETRY_ON_413, | ||
| _OTEL_PYTHON_EXPORTER_OTLP_HTTP_LOGS_CREDENTIAL_PROVIDER, | ||
| OTEL_EXPORTER_OTLP_CERTIFICATE, | ||
| OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE, | ||
|
|
@@ -69,6 +71,7 @@ | |
| DEFAULT_LOGS_EXPORT_PATH = "v1/logs" | ||
| DEFAULT_TIMEOUT = 10 # in seconds | ||
| _MAX_RETRYS = 6 | ||
| _MAX_BISECTS = 5 | ||
|
|
||
|
|
||
| class OTLPLogExporter(LogRecordExporter): | ||
|
|
@@ -183,8 +186,17 @@ def export( | |
| _logger.warning("Exporter already shutdown, ignoring batch") | ||
| return LogRecordExportResult.FAILURE | ||
|
|
||
| serialized_data = encode_logs(batch).SerializeToString() | ||
| deadline_sec = time() + self._timeout | ||
| return self._export_batch(batch, deadline_sec, _MAX_BISECTS) | ||
|
|
||
| def _export_batch( | ||
| self, | ||
| batch: Sequence[ReadableLogRecord], | ||
| deadline_sec: float, | ||
| remaining_bisects: int, | ||
| ) -> LogRecordExportResult: | ||
| serialized_data = encode_logs(batch).SerializeToString() | ||
|
|
||
| for retry_num in range(_MAX_RETRYS): | ||
| # multiplying by a random number between .8 and 1.2 introduces a +/20% jitter to each backoff. | ||
| backoff_seconds = 2**retry_num * random.uniform(0.8, 1.2) | ||
|
|
@@ -196,12 +208,24 @@ def export( | |
| reason = error | ||
| retryable = isinstance(error, ConnectionError) | ||
| status_code = None | ||
| bisectable = False | ||
| else: | ||
| reason = resp.reason | ||
| retryable = _is_retryable(resp) | ||
| status_code = resp.status_code | ||
| bisectable = ( | ||
| _is_payload_too_large(resp) | ||
| and len(batch) > 1 | ||
| and remaining_bisects > 0 | ||
| and environ.get( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should ideally be bound to an instance variable in the constructor (e.g. |
||
| _OTEL_PYTHON_EXPERIMENTAL_OTLP_RETRY_ON_413, "" | ||
| ) | ||
| .strip() | ||
| .lower() | ||
| == "true" | ||
| ) | ||
|
|
||
| if not retryable: | ||
| if not retryable and not bisectable: | ||
| _logger.error( | ||
| "Failed to export logs batch code: %s, reason: %s", | ||
| status_code, | ||
|
|
@@ -219,6 +243,34 @@ def export( | |
| "max retries or shutdown." | ||
| ) | ||
| return LogRecordExportResult.FAILURE | ||
|
|
||
| if bisectable: | ||
| if time() >= deadline_sec or self._shutdown: | ||
| _logger.error( | ||
| "Payload too large but %s, dropping %d log records", | ||
| "shutdown in progress" | ||
| if self._shutdown | ||
| else "deadline expired", | ||
| len(batch), | ||
| ) | ||
| return LogRecordExportResult.FAILURE | ||
|
Comment on lines
+248
to
+256
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This shouldn't be necessary anymore because of lines 212-215. |
||
| mid = len(batch) // 2 | ||
| _logger.warning( | ||
| "Payload too large (%d log records), splitting into two batches", | ||
| len(batch), | ||
| ) | ||
| first = self._export_batch( | ||
| list(batch[:mid]), | ||
| deadline_sec, | ||
| remaining_bisects - 1, | ||
| ) | ||
| if first != LogRecordExportResult.SUCCESS: | ||
| return LogRecordExportResult.FAILURE | ||
| return self._export_batch( | ||
| list(batch[mid:]), | ||
| deadline_sec, | ||
| remaining_bisects - 1, | ||
| ) | ||
| _logger.warning( | ||
| "Transient error %s encountered while exporting logs batch, retrying in %.2fs.", | ||
| reason, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what scenarios do you need this feature? If you already know the payload limits of your backend, we could bisect based on a configurable
max_payload_size, and eliminate the413errors entirely.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reactive approach covers cases where the backend limit isn't known upfront or changes without SDK reconfiguration (e.g., proxies). I can tackle the proactive splitting via max_payload_size in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Krishnachaitanyakc Do you have a custom collector implementation that returns
413error codes? From my understanding the vanilla OpenTelemetry collector does not return413errors.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@herin049 The vanilla Collector returns 400, not 413. The 413 comes from real world examples such as:
The Collector's own exporter already treats 413 as a permanent error (open-telemetry/opentelemetry-collector#5674). There's also active spec work on payload limits (open-telemetry/opentelemetry-proto#782).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Krishnachaitanyakc
Right, I'm more asking about the following:
Typically the flow for telemetry is:
The only way you'd get a
413error code is if the collector returned it to the SDK, which from my understanding is not something the collector currently does.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that flow, yes, the vanilla Collector doesn't return 413 to the SDK.
But the direct-to-backend (no Collector) is a supported deployment pattern which is what the original reporter seems to be using on #4533