Skip to content

tools: run clang-format/clang-tidy in a separate bazel output_base#30213

Merged
travisdowns merged 1 commit intoredpanda-data:devfrom
travisdowns:td-noblock-tools
Apr 20, 2026
Merged

tools: run clang-format/clang-tidy in a separate bazel output_base#30213
travisdowns merged 1 commit intoredpanda-data:devfrom
travisdowns:td-noblock-tools

Conversation

@travisdowns
Copy link
Copy Markdown
Member

Invoking tools/clang-format or tools/clang-tidy currently uses the main
bazel server's output_base, which means the tool invocation acquires the
same server lock as builds. If a build is in progress, the tool just sits
there waiting — potentially for many minutes. Fast, interactive operations
(formatting one file, running tidy on a diff) turn into long blocks, and
it's especially confusing when driven by Claude because the command looks
indistinguishable from a hang (which it already sometimes is for other
reasons).

This PR introduces an opt-in BAZEL_TRAMPOLINE_PARALLEL env var. When set,
bazel-tool-trampoline.sh passes --output_base=<default>_rptool so the
tool runs against a dedicated bazel server that can operate concurrently
with the main one. clang-tool.sh (the shim behind tools/clang-format
and tools/clang-tidy) defaults it to 1, so those tools opt in
automatically; callers can still set BAZEL_TRAMPOLINE_PARALLEL=0 to
restore the old behavior.

Computing the default output_base is itself awkward, because
bazel info output_base also takes the main lock. The script caches the
value at tools/bazel-tools/.cache/output_base and refreshes it
opportunistically with bazel --noblock_for_lock info output_base
(exit 9 = lock held, fall back to cache). Only a cold cache forces a
blocking query.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

@travisdowns
Copy link
Copy Markdown
Member Author

travisdowns commented Apr 17, 2026

The other implementation i could think of is executing the clang-format binary directly, instead of using bazel run, which would also probably require some similar caching nonsense (to remember the path of the tool), and maybe a few more footguns.

This separate output base takes ~802 MB on my machine.

It actually seems to speed up tool execution from ~200ms to ~100ms, probably because bazel run happens against a server with much less loaded state ("or something").

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in mechanism for Bazel tool trampolines (and enables it by default for clang tools) to run using a dedicated Bazel output_base, avoiding lock contention with the primary Bazel server and improving interactive latency.

Changes:

  • Add BAZEL_TRAMPOLINE_PARALLEL support to run bazel run with a separate --output_base and cache the default output_base.
  • Default tools/clang-format / tools/clang-tidy shim behavior to enable parallel mode.
  • Ignore the new .cache/ directory used for output_base caching.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
tools/clang-tool.sh Defaults clang tool invocations to opt into parallel Bazel output base behavior.
tools/bazel-tools/support/bazel-tool-trampoline.sh Implements parallel output_base selection + caching and threads --output_base into bazel run.
tools/bazel-tools/.gitignore Ignores the .cache/ directory created by the trampoline.

Comment on lines +97 to +100
tool_output_base="${default_output_base}_rptool"
output_base_flag=(--output_base="$tool_output_base")

debug "tool output_base : $tool_output_base"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When BAZEL_TRAMPOLINE_PARALLEL is enabled, bazel run executes against a different --output_base, but the clang-tidy plugin path (LOADS) is still derived from bazel info bazel-bin using the default output base (earlier in this script). That can cause clang-tidy to fail to find plugins.so (or load a plugin built in a different output base) on a clean checkout or when the main output base hasn't built it yet. Consider computing bazel-bin using the same --output_base as the tool invocation (or restructuring so output_base_flag is decided before resolving BAZEL_BIN/LOADS).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — this was the reason I originally left the whole thing opt-in. Rather than try to compute plugin paths against the _rptool output_base (which would force clang-tidy to rebuild them), I split the shim: clang-format now goes through a new tools/clang-tool-parallel.sh that sets BAZEL_TRAMPOLINE_PARALLEL=1, while clang-tidy still points at clang-tool.sh and therefore keeps using the default output_base where plugins.so already lives.

Comment on lines +66 to +68
# - Cache exists: try `bazel --block_for_lock=false info output_base` to
# opportunistically refresh the cache; on lock-held (exit 9) just use the
# cached value.
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments refer to --block_for_lock=false, but the actual command uses --noblock_for_lock. To avoid confusing future maintainers, align the documentation with the flag that is actually used (and the exit-code semantics described).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned the comments to reference --noblock_for_lock (the form actually used in the code) in the amended commit.

Comment on lines +83 to +93
fresh=$(bazel --noblock_for_lock info output_base 2>/dev/null) || rc=$?
if ((rc == 0)); then
debug "output_base from: live bazel query (cache refreshed)"
default_output_base=$fresh
echo "$default_output_base" >"$cache_file"
elif ((rc == 9)); then
default_output_base="$(cat "$cache_file")"
debug "output_base from: cache ($cache_file, lock held)"
else
echo "ERROR: bazel-tool-trampoline.sh: bazel info output_base failed (exit $rc)" >&2
exit "$rc"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-blocking bazel ... info output_base call discards stderr (2>/dev/null). For failures other than lock-held (rc=9), this makes the surfaced error much harder to diagnose because only a generic message is printed. Consider capturing stderr and printing it when rc != 0 && rc != 9 (while still keeping the lock-held path quiet).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if it becomes a problem we can fix it

Comment on lines +74 to +90
if [[ ! -f $cache_file ]]; then
debug "output_base from: blocking bazel query (no cache yet)"
default_output_base="$(bazel info output_base)"
echo "$default_output_base" >"$cache_file"
else
# Bazel exits 9 (LOCK_HELD_NOBLOCK_FOR_LOCK) when --block_for_lock=false
# cannot acquire the server lock. Any other non-zero exit is a real
# error and should surface rather than silently falling back to cache.
rc=0
fresh=$(bazel --noblock_for_lock info output_base 2>/dev/null) || rc=$?
if ((rc == 0)); then
debug "output_base from: live bazel query (cache refreshed)"
default_output_base=$fresh
echo "$default_output_base" >"$cache_file"
elif ((rc == 9)); then
default_output_base="$(cat "$cache_file")"
debug "output_base from: cache ($cache_file, lock held)"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache updates write directly to output_base (echo ... >"$cache_file"). If the process is interrupted mid-write, the cache can become truncated/empty and later produce an invalid --output_base like "_rptool". Consider writing to a temp file and atomically mv-ing into place, and/or validating the cached value is non-empty before using it.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to be pretty unlucky to get a truncated write of a single line :)

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Retry command for Build#83315

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#83315
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL ClusterConfigTest test_valid_settings null integration https://buildkite.com/redpanda/redpanda/builds/83315#019d9be1-f3d5-4f78-9a6b-408f99ea3d5f 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_valid_settings

# their output base. disk_cache will in general obviate the need to actually
# rebuild anything.
#
# Determining the default output_base is tricky: `bazel info output_base` itself
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the default output_base at all? Isn't that equivalent to running with no args?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We build the 2nd base output based on the first one, by appending _rptool, so it goes in the same directory, gets wiped when you wipe that, on the same mount, etc.

Instead of just dumping it somehwere in /tmp (which maybe also isn't a terrible idea, but it will accumulate).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I see, guess if it works then we might as well just go with it now.

Previously, invoking `tools/clang-format` or `tools/clang-tidy` used the
main bazel server's output_base, which meant the tool invocation would
block on the main server's lock whenever a build was in progress. This
turned simple, fast operations (formatting a single file, running tidy
on a diff) into multi-minute waits — very annoying interactively, and
especially confusing when driven by Claude since the command just looks
like it's hanging (which it already sometimes does for other reasons).

Add a BAZEL_TRAMPOLINE_PARALLEL env var that, when set, causes
bazel-tool-trampoline.sh to pass `--output_base=<default>_rptool` so the
tool runs under a dedicated bazel server. Default it to 1 in
clang-tool.sh so clang-format and clang-tidy opt in automatically.

Determining the default output_base is itself fiddly because
`bazel info output_base` also takes the main lock. We cache the value in
tools/bazel-tools/.cache/output_base, refreshing it opportunistically
via `bazel --noblock_for_lock info output_base` (exit 9 = lock held) and
only blocking when the cache doesn't yet exist.
@travisdowns travisdowns merged commit 3a600a1 into redpanda-data:dev Apr 20, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants