Skip to content

debug_bundle: support OAUTHBEARER auth in broker-side admin API#30225

Merged
dotnwat merged 1 commit intodevfrom
debug-bundle-oauthbearer-authn
Apr 23, 2026
Merged

debug_bundle: support OAUTHBEARER auth in broker-side admin API#30225
dotnwat merged 1 commit intodevfrom
debug-bundle-oauthbearer-authn

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

@david-yu david-yu commented Apr 17, 2026

Closes #30222

Why is this needed?

Redpanda's admin API already supports OIDC/bearer tokens for inbound request authentication (docs) — so it's reasonable to ask: why does rpk debug remote-bundle start need any broker-side change to support OAUTHBEARER?

The answer is that this endpoint involves two distinct layers of authentication, and only one of them was already covered:

  1. Caller → admin API (transport-level auth on the inbound request). This is what the authentication docs describe. OIDC/bearer tokens here already work — no change needed.
  2. Creds embedded in the request body, forwarded to a subprocess. The debug bundle endpoint is unusual: the broker spawns an rpk debug bundle subprocess on the node, and that subprocess has to turn around and call the local Kafka API, schema registry, and admin API to collect the bundle. The caller supplies those credentials inside the POST body under an authentication field; the broker forwards them to the subprocess via -X flags.

Before this PR, the in-body authentication variant only modeled SCRAM ({username, password, mechanism}). Even if the caller authenticated their admin API request with a bearer token, there was no way to tell the subprocess "use OAUTHBEARER when you call Kafka" — the broker's JSON parser would reject or mis-parse a {token, mechanism} payload. This PR adds that second variant so the broker can emit -Xpass=token:<TOKEN> -Xsasl.mechanism=OAUTHBEARER to the subprocess.

TL;DR: admin API inbound auth ≠ debug-bundle subprocess auth. The existing OIDC support covers the first; this PR adds the second.

Context

This PR implements the broker-side half of end-to-end OAUTHBEARER support for rpk debug remote-bundle start. The full chain is:

  1. rpk: add OAUTHBEARER SASL mechanism support #30169 (merged/open) — adds OAUTHBEARER SASL support to rpk's Kafka, admin, and schema registry clients; adds an explicit rejection of OAUTHBEARER in rpk debug remote-bundle start with a "not yet supported" error until the broker side is ready
  2. redpanda-data/common-go#165 (prerequisite) — adds rpadmin.WithOAuthBearerAuthentication(token) and the {mechanism, token} JSON payload
  3. This PR — broker-side C++: parses the {mechanism, token} payload and passes -Xpass=token:<TOKEN> -Xsasl.mechanism=OAUTHBEARER to the rpk subprocess
  4. rpk-side follow-up (after common-go releases) — remove the out.Die in start.go and call rpadmin.WithOAuthBearerAuthentication(token) when the profile mechanism is OAUTHBEARER

Summary of changes

  • types.h: Add bearer_creds{token, mechanism}; expand debug_bundle_authn_options to std::variant<scram_creds, bearer_creds>
  • json.h: Add from_json<bearer_creds>; update debug_bundle_authn_options dispatch to select the variant by presence of "token" (OAUTHBEARER) vs "username" (SCRAM) — {mechanism: OAUTHBEARER} with no token field is rejected with invalid_parameters (400)
  • debug_bundle_service.cc: Add bearer_creds arm to the ss::visit that builds rpk subprocess args, emitting -Xpass=token:<TOKEN> and -Xsasl.mechanism=OAUTHBEARER
  • debug_bundle.json: Update authentication schema to document both the SCRAM and OAUTHBEARER variants
  • Tests:
    • json_test.cc: bearer_creds added to typed test suite (BasicType, TypeIsInvalid, ValidateControlCharacters); ParametersWithBearerAuth standalone test; BearerAuthMissingTokenIsRejected standalone test verifying 400 behaviour
    • debug_bundle_service_test.cc: test_bearer_creds_args verifies the subprocess receives -Xpass=token:<TOKEN> -Xsasl.mechanism=OAUTHBEARER; check_clean_up made robust against slow SHA256 I/O

Test fixes (from CI run analysis)

Two pre-existing test issues were exposed by the new test ordering:

test_bearer_creds_args — the original implementation polled rpk_debug_bundle_status in a loop while expecting status == running, breaking out when stdout became non-empty (~1s). At that point the rpk-shim was still sleeping (5s total), so fixture teardown terminated a live process, producing a spurious "Failed to terminate" warning. Rewritten to use run_bundle so the process completes before the stdout is inspected.

check_clean_up — the test waited for the .out file with a 10s budget starting from when the zip file appeared (~T+0). The .out file is only written after the process exits (T+5s) and set_metadata finishes its calculate_sha256_sum call; on slow Bazel sandbox I/O, SHA256 was observed taking >10s, blowing the deadline. Fixed by replacing wait_for_file_to_be_created(.out, 10s) with wait_for_kvstore_to_populate(30s): the kvstore entry is written only after the .out file, making it a reliable completion signal with headroom for SHA256 latency.

Test plan

  • bazel test //src/v/debug_bundle/tests:json_test
  • bazel test //src/v/debug_bundle/tests:debug_bundle_service_test
  • bazel run //tools:clang_format

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Features

  • Add OAUTHBEARER SASL mechanism support to the admin API endpoint used by rpk debug remote-bundle start, enabling remote debug bundle collection against clusters configured with OAUTHBEARER authentication.

Generated with Claude Code

@david-yu
Copy link
Copy Markdown
Contributor Author

This should be backported to 26.1.x and 25.3.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Apr 20, 2026

CI test results

test results on build#83412
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL debug_bundle_service_started_fixture check_clean_up unit https://buildkite.com/redpanda/redpanda/builds/83412#019dac33-c067-421e-95ef-a944023b2ae7 0/1
test results on build#83430
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) InternalTopicProtectionLargeClusterTest test_consumer_offset_topic null integration https://buildkite.com/redpanda/redpanda/builds/83430#019dae05-cb61-433d-b25a-3b602c98bcb4 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=InternalTopicProtectionLargeClusterTest&test_method=test_consumer_offset_topic

@tyson-redpanda
Copy link
Copy Markdown
Contributor

This should be backported to 26.1.x and 25.3.x

@david-yu please update your PR description to follow the template: https://github.com/redpanda-data/redpanda/blob/dev/.github/pull_request_template.md

If you want a backport, you can check those boxes and it'll happen automatically.

Copy link
Copy Markdown
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug bundle unit tests failed

2026-04-20 18:53:39 UTC | //src/v/debug_bundle/tests:debug_bundle_service_test                     FAILED in 109.9s
-- | --
2026-04-20 18:53:39 UTC | /root/.cache/bazel/_bazel_root/661a3f61959a8d0bfbf25f4f3b68e0c0/execroot/_main/bazel-out/aarch64-dbg/testlogs/src/v/debug_bundle/tests/debug_bundle_service_test/test.log

The logs for the test failure are available as artifacts on the run if you click through to the buildkite job. Happy to help navigate that with you if needed!

@david-yu david-yu requested a review from dotnwat April 21, 2026 04:11
@david-yu
Copy link
Copy Markdown
Contributor Author

Failures resolved, waiting review, no rush though

@david-yu
Copy link
Copy Markdown
Contributor Author

david-yu commented Apr 22, 2026

From talking to @mattschumpert there are two separate endpoints to auth to both brokers and admin api. We should try to do service discovery on brokers and admin api. Will need to dig into whether this is needed.

Copy link
Copy Markdown
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clean-up the commit history a bit. There are three issues with it:

  1. In general we want to avoid having commits in a PR that fix a bug that was introduced in a previous commit in the same PR. So for example, the commit debug_bundle: fix test_bearer_creds_args and check_clean_up appears to fix issues introduced in the first commit of the PR.
  2. That same commit, debug_bundle: fix test_bearer_creds_args and check_clean_up, in addition to fixing that issue introduced in the first commit, fixes an apparent issue with the "check_clean_up" test which appears to be unrelated to the OAUTHBEARER changes and in this case it would be nice to factor out that fix into a separate commit so that we can discuss it separately.
  3. We'll want to remove the merge commits from the commit history. This can generally be done right before merging with a git rebase.

On testing, since #30169 merged, it looks like we can add a ducktape test for the feature now so that we get end-to-end coverage for the feature. The unit tests in this PR are one part of the testing strategy, but they don't do a full e2e test of the feature over the network. Is it possible to do that or are there still pieces that are missing before we can have an e2e test? Another option is manual testing too if we don't expect to derive value out of the e2e test.

Extends the debug bundle SASL credential path to handle OAUTHBEARER in
addition to the existing SCRAM variants.

The admin API already accepts OIDC/bearer tokens for inbound request
authentication. However, `rpk debug remote-bundle start` involves a
second layer of auth: the broker spawns an `rpk debug bundle`
subprocess, which must connect back to the local Kafka, schema
registry, and admin APIs to collect the bundle. The caller supplies
those credentials inside the POST body under an `authentication`
field; the broker forwards them to the subprocess via -X flags. Until
now, that in-body `authentication` variant only modeled SCRAM
(`{username, password, mechanism}`) — there was no way to instruct
the subprocess to use OAUTHBEARER when calling Kafka.

- types.h: add bearer_creds{token, mechanism} and expand
  debug_bundle_authn_options to std::variant<scram_creds, bearer_creds>
- json.h: add from_json<bearer_creds>; update debug_bundle_authn_options
  dispatch to select the variant by the presence of "token" (OAUTHBEARER)
  vs "username" (SCRAM), so a missing "token" field on an OAUTHBEARER
  payload is rejected with invalid_parameters (400)
- debug_bundle_service.cc: add bearer_creds arm to the ss::visit that
  builds rpk subprocess args; emits -Xpass=token:<TOKEN> and
  -Xsasl.mechanism=OAUTHBEARER, which rpk already accepts
- debug_bundle.json: update the authentication schema to document both
  the SCRAM and OAUTHBEARER variants
- tests: extend json_test.cc typed suite with bearer_creds and update
  debug_bundle_authn_options cases; add standalone tests for parameters
  with OAUTHBEARER auth and for rejection of a missing token field; add
  test_bearer_creds_args to the service test to verify correct
  subprocess argument emission

Test fixes surfaced by the new test ordering:

- test_bearer_creds_args: the previous implementation polled
  rpk_debug_bundle_status in a loop while expecting the process to
  still be running. When stdout became non-empty (~1s after start),
  the loop broke and the test body finished while the rpk-shim was
  still sleeping, causing the fixture to tear down a live process.
  Rewritten to use run_bundle so the process completes before checking
  the output.
- check_clean_up: the test waited for the .out file with a 10s budget
  starting from when the zip appeared (~T+0). The .out file is written
  only after the process exits (T+5s) and set_metadata finishes its
  SHA256 calculation; on slow sandbox I/O this SHA256 step was observed
  to take >10s. Replace wait_for_file_to_be_created(.out, 10s) with
  wait_for_kvstore_to_populate(30s): the kvstore is updated only after
  the .out file is written, making it a reliable completion signal
  with headroom for SHA256 latency.

Also fixes an ill-formed duplicate default argument on
wait_for_kvstore_to_populate: the forward declaration already specifies
the default, and repeating it in the definition is ill-formed in C++.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@david-yu david-yu force-pushed the debug-bundle-oauthbearer-authn branch from b982155 to c45b786 Compare April 22, 2026 23:37
@david-yu
Copy link
Copy Markdown
Contributor Author

Yes, a ducktape e2e test is feasible. The missing pieces are: (1) tagging common-go#165 as a new rpadmin release (currently rpk is on v0.2.5 which doesn't have WithOAuthBearerAuthentication), and (2) a follow-up rpk PR to remove the out.Die guard in start.go:87 and add the bearer arm in toRpadminOptions. Once those are done the ducktape test can follow the DebugBundleSCRAMAuthn pattern + KeycloakService (already used in redpanda_oauth_test.py). I'll open the rpk PR to track this and tag common-go#165 first, then the ducktape test can land together with the rpk changes.

Copy link
Copy Markdown
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. in the future please split out changes into separate commits (e.g. "check_clean_up" fix should be in a separate commit).

@dotnwat dotnwat merged commit f853ae2 into dev Apr 23, 2026
23 checks passed
@dotnwat dotnwat deleted the debug-bundle-oauthbearer-authn branch April 23, 2026 19:06
david-yu added a commit that referenced this pull request Apr 23, 2026
Add DebugBundleOAuthBearerAuthn, an end-to-end ducktape test that
exercises the full OAUTHBEARER forwarding path introduced in #30225
and #30277.

The test spins up a Keycloak OIDC provider alongside a single Redpanda
broker configured with SASL OAUTHBEARER.  It issues a client credentials
token from Keycloak, then POSTs a debug bundle start request with
authentication: {mechanism: OAUTHBEARER, token: <JWT>}.  The broker
forwards the token to the rpk subprocess via -Xsasl.mechanism and
-Xpass=token:..., and the subprocess authenticates to Kafka using the
JWT.  The test verifies the bundle completes successfully and the
expected topic appears in kafka.json.

Supporting changes:
- redpanda_types.py: add OAuthBearerCredentials dataclass, which
  serializes to {mechanism, token} (handled by the existing
  DebugBundleEncoder dataclass branch without any special-casing)
- admin.py: widen DebugBundleStartConfigParams.authentication to accept
  OAuthBearerCredentials alongside the existing SaslCredentials

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@david-yu
Copy link
Copy Markdown
Contributor Author

Will do thank you!

david-yu added a commit that referenced this pull request Apr 27, 2026
Remove the early-exit guard that rejected OAUTHBEARER profiles in
'rpk debug remote-bundle start'. The broker-side admin API now accepts
a bearer_creds payload (#30225), and rpadmin v0.2.6 exposes
WithOAuthBearerAuthentication so rpk can forward the token.

toRpadminOptions now dispatches on mechanism: OAUTHBEARER profiles
call WithOAuthBearerAuthentication(token), all other SASL profiles
fall through to the existing WithSCRAMAuthentication path.

Add DebugBundleOAuthBearerAuthn, an end-to-end ducktape test that
exercises the full OAUTHBEARER forwarding path. The test spins up a
Keycloak OIDC provider alongside a single Redpanda broker configured
with SASL OAUTHBEARER. It issues a client credentials token from
Keycloak, then POSTs a debug bundle start request with authentication:
{mechanism: OAUTHBEARER, token: <JWT>}. The broker forwards the token
to the rpk subprocess via -Xsasl.mechanism and -Xpass=token:..., and
the subprocess authenticates to Kafka using the JWT. The test verifies
the bundle completes successfully and the expected topic appears in
kafka.json.

Supporting changes:
- redpanda_types.py: add OAuthBearerCredentials dataclass, which
  serializes to {mechanism, token}
- admin.py: widen DebugBundleStartConfigParams.authentication to accept
  OAuthBearerCredentials alongside the existing SaslCredentials

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
david-yu added a commit that referenced this pull request Apr 27, 2026
Remove the early-exit guard that rejected OAUTHBEARER profiles in
'rpk debug remote-bundle start'. The broker-side admin API now accepts
a bearer_creds payload (#30225), and rpadmin v0.2.6 exposes
WithOAuthBearerAuthentication so rpk can forward the token.

toRpadminOptions now dispatches on mechanism: OAUTHBEARER profiles
call WithOAuthBearerAuthentication(token), all other SASL profiles
fall through to the existing WithSCRAMAuthentication path.

Add DebugBundleOAuthBearerAuthn, an end-to-end ducktape test that
exercises the full OAUTHBEARER forwarding path. The test spins up a
Keycloak OIDC provider alongside a single Redpanda broker configured
with SASL OAUTHBEARER. It issues a client credentials token from
Keycloak, then POSTs a debug bundle start request with authentication:
{mechanism: OAUTHBEARER, token: <JWT>}. The broker forwards the token
to the rpk subprocess via -Xsasl.mechanism and -Xpass=token:..., and
the subprocess authenticates to Kafka using the JWT. The test verifies
the bundle completes successfully and the expected topic appears in
kafka.json.

Supporting changes:
- redpanda_types.py: add OAuthBearerCredentials dataclass, which
  serializes to {mechanism, token}
- admin.py: widen DebugBundleStartConfigParams.authentication to accept
  OAuthBearerCredentials alongside the existing SaslCredentials

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

debug_bundle: support OAUTHBEARER auth in broker-side admin API

4 participants