Skip to content

base: finer grained log filtering in vlog#30202

Open
WillemKauf wants to merge 6 commits intoredpanda-data:devfrom
WillemKauf:log_filter
Open

base: finer grained log filtering in vlog#30202
WillemKauf wants to merge 6 commits intoredpanda-data:devfrom
WillemKauf:log_filter

Conversation

@WillemKauf
Copy link
Copy Markdown
Contributor

@WillemKauf WillemKauf commented Apr 16, 2026

Adds per-callsite toggling to our existing seastar vlog macros. Requires changes in Seastar made in redpanda-data/seastar#272.

This feature allows users to force log lines on or off at a directory, file, or line granularity. When a log line is forced on, it will be logged regardless of the globally configured seastar log level (e.g. even if a log level is TRACE and the currently configured level is INFO, the log line will still appear to the user).

Similarily, when a log line is forced off, it will not be shown, regardless of the globally configured seastar log level.

Some examples of the available API through the admin endpoint:

  • Force every vlog callsite in src/v/storage/disk_log_impl.cc on:
  curl -sS -u admin:admin -X POST \
  http://localhost:9644/redpanda.core.admin.v2.internal.LogFilterService/SetLogFilter \
  -H 'content-type: application/json' \
  -d '{"rules":[
     {"file":"src/v/storage/disk_log_impl.cc",
     "state":"LOG_FILTER_STATE_FORCE_ON"}]}'
  • Silence a specific noisy line (file + line number):
  curl -sS -u admin:admin -X POST \
    http://localhost:9644/redpanda.core.admin.v2.internal.LogFilterService/SetLogFilter \
    -H 'content-type: application/json' \
    -d '{"rules":[
      {"file":"src/v/cluster/partition_balancer_backend.cc","line":[423],
       "state":"LOG_FILTER_STATE_FORCE_OFF"}
    ]}'

  • Force on loggers across a range of lines:
  -d '{"rules":[{"file":"src/v/raft/consensus.cc","line":[1200,1450],"state":"LOG_FILTER_STATE_FORCE_ON"}]}'
  • Force on a whole directory, but leave one file alone:
  -d '{"rules":[
    {"file":"src/v/storage/*","state":"LOG_FILTER_STATE_FORCE_ON"},
    {"file":"src/v/storage/spill_key_index.cc","state":"LOG_FILTER_STATE_INHERITED"}
  ]}'

(Rules apply in order; last match wins.)

  • Match by format-string substring:
  -d '{"rules":[{"contains":"slow consumer","state":"LOG_FILTER_STATE_FORCE_OFF"}]}'
  • Inspect current rules:
  curl -sS -u admin:admin -X POST \
    http://localhost:9644/redpanda.core.admin.v2.internal.LogFilterService/GetLogFilter \
    -H 'content-type: application/json' -d '{}'
  • List every registered callsite (add file_filter to narrow):
  curl -sS -u admin:admin -X POST \
    http://localhost:9644/redpanda.core.admin.v2.internal.LogFilterService/ListLogCallsites \
    -H 'content-type: application/json' \
    -d '{"file_filter":"src/v/raft/*"}'
  • Clear all rules:
  curl -sS -u admin:admin -X POST \
    http://localhost:9644/redpanda.core.admin.v2.internal.LogFilterService/ResetLogFilter \
    -H 'content-type: application/json' -d '{}'

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

Copilot AI review requested due to automatic review settings April 16, 2026 21:08
@WillemKauf WillemKauf changed the title base: finer grained log filtering base: finer grained log filtering in vlog Apr 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds runtime, per-callsite filtering for the vlog macro family, plus an internal admin RPC service to apply/reset/list log filter rules on a live node.

Changes:

  • Introduces a vlog::detail::callsite registry and rule engine (apply_rules/reset_rules/for_each_callsite) to enable/disable individual vlog callsites at runtime.
  • Updates vlog/vlogl/vloglr macros to gate formatting/argument evaluation behind a per-callsite atomic<bool>.
  • Adds LogFilterService (proto + admin service impl) with RPCs to set/reset rules and list registered callsites; includes a new unit test target for the filtering logic.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/v/redpanda/application_admin.cc Registers the new internal log filter admin service.
src/v/redpanda/admin/services/internal/log_filter.h Declares the LogFilterService implementation class.
src/v/redpanda/admin/services/internal/log_filter.cc Implements RPC handlers translating wire rules to vlog::apply_rules and listing callsites.
src/v/redpanda/admin/services/internal/BUILD Adds a new library target for the log filter admin service.
src/v/redpanda/BUILD Links the new internal log filter service into the redpanda target.
src/v/base/vlog_filter.h Defines rule schema and the public filter API (apply_rules/reset_rules/for_each_callsite).
src/v/base/vlog_callsite.h Introduces the per-callsite data structure and enabled flag.
src/v/base/vlog_callsite.cc Implements rule matching/evaluation, rules snapshot publication, and the lock-free callsite registry.
src/v/base/vlog.h Updates vlog* macros to consult the per-callsite enabled flag before formatting/log dispatch.
src/v/base/tests/vlog_filter_test.cc Adds unit tests for rule matching, ordering, registration timing, and defaults.
src/v/base/tests/BUILD Adds a new gtest target for vlog filtering.
src/v/base/BUILD Adds the new callsite source and headers to the base library.
proto/redpanda/core/admin/internal/v1/log_filter.proto Adds the RPC/service and message schema for log filtering and callsite listing.
proto/redpanda/core/admin/internal/v1/BUILD Adds build targets for the new proto.

Comment on lines +38 to +39
// Enabled value written to every site the rule matches.
bool enabled = 4;
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogFilterRule.enabled is a non-optional proto3 bool, so if a client omits it the default will be false (disabling matching callsites). That conflicts with the in-process default (vlog::rule::enabled = true) and makes it easy for partially-specified rules (e.g. {file:"..."}) to unintentionally disable logs. Consider making this field optional bool enabled = 4 (so presence is tracked) and defaulting to true server-side when it is not set, or otherwise redesigning the API so the default behavior is unambiguous.

Suggested change
// Enabled value written to every site the rule matches.
bool enabled = 4;
// Enabled value written to every site the rule matches. If omitted,
// the server should default this to true.
optional bool enabled = 4;

Copilot uses AI. Check for mistakes.

// A registered vlog callsite and its current enabled state.
message LogCallsiteInfo {
// Source-file basename (path stripped at compile time).
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LogCallsiteInfo.file field comment says the value is a basename with the path stripped, but cs.file() comes from __FILE__ (and the PR description suggests matching on paths like src/v/storage/*). Either update the comment to describe the actual value (e.g. __FILE__ path as compiled), or change the implementation to strip to basename before returning it to clients so the API matches its documentation.

Suggested change
// Source-file basename (path stripped at compile time).
// Callsite source file path as compiled into __FILE__.

Copilot uses AI. Check for mistakes.
@WillemKauf WillemKauf force-pushed the log_filter branch 2 times, most recently from 71bd540 to 6a4f868 Compare April 16, 2026 22:44
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#83283
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL MasterTestSuite quota_manager_fetch_throttling unit https://buildkite.com/redpanda/redpanda/builds/83283#019d9878-5fc2-406d-af39-527572979852 0/1
FLAKY(PASS) TxAtomicProduceConsumeTest test_basic_tx_consumer_transform_produce {"with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/83283#019d9893-76da-4561-acb5-875f73822942 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxAtomicProduceConsumeTest&test_method=test_basic_tx_consumer_transform_produce

@WillemKauf
Copy link
Copy Markdown
Contributor Author

/ci-repeat 1

Each vlog/vlogl/vloglr invocation now emits a static vlog::detail::callsite
that self-registers in a lock-free intrusive list at first hit and gates
the log emission on a relaxed atomic<bool>. Runtime filter rules
(file-glob, file+line, format-substring) flip those flags through
vlog::apply_rules, which snapshots the current rule set as a shared_ptr
swapped atomically via std::atomic_load/store free functions. A cold
callsite evaluates its initial enabled state against the currently-
published rules so filters set before a line first fires take effect on
its very first invocation.

The enabled-line cost is an extra relaxed load plus a well-predicted
branch; the disabled-line cost drops from full format-argument evaluation
to just that branch.

`base`: promote vlog callsite gate from bool to 4-value state enum

Callsite state becomes {uninit, default_, force_on, force_off}.
Filter rules carry a state field in place of bool enabled.
evaluate() returns the resolved state; apply_rules / reset_rules
write state values via set_state. The vlog.h macro layer and the
existing tests still compile against bool enabled — both are
updated in subsequent commits.

base/tests: rewrite vlog_filter_test to state enum schema

Mechanical translation of bool enabled() and rule{.enabled=...}
to resolved_state() and rule{.state=...}. Semantics preserved:
the prior 'enabled = false' is 'state = force_off', the prior
'enabled = true' used as a later-rule override is 'state = default_'.
Test coverage for force_on and for state=default_ as a carve-out
follows in a separate commit.

base/tests: cover force_on, force_off, default_ rule states

Drives vlog(...) against a captured seastar logger with the level
set to warn, verifying force_on emits a trace line and force_off
suppresses it regardless. A final test exercises apply_rules
ordering across all three states against a static callsite whose
resolved state is observed directly.
vlog / vlogl / vloglr macros switch on the callsite's resolved
state: default_ keeps today's path, force_on routes through the
seastar force_tag overloads to bypass the level gate, force_off
short-circuits before argument evaluation. The uninit case is
unreachable after resolved_state()'s slow_init — listed only to
satisfy exhaustive-switch warnings.

Wrapper logger types that are used with vlog (raft::ctx_log,
prefix_logger, kafka group::ctx_log, connection_context::ctx_log,
basic_retry_chain_logger) each receive matching force_tag overloads
so that the force_on path compiles throughout the codebase. The
one callsite that used an ad-hoc lambda as the method argument is
converted to vloglr, which carries force semantics natively.

The vlogl and vloglr macro parameters are renamed to _vlog_lgr_ to
avoid token collision with the "logger" in ::seastar::logger::force
during macro expansion.

utils/truncating_logger: add force_tag overloads

Follow-up to the vlog 3-way dispatch change. truncating_logger
was not in the set of wrapper loggers updated in the preceding
commit, but kafka::kwire (a truncating_logger) is passed to vlog
macros so its force_on branch requires a force_tag overload on
each convenience method. Mirrors the pattern used for the other
wrapper loggers.

The force path pre-formats into a fmt::memory_buffer with the
same truncation logic as the normal path, then emits via
_logger.log(lvl, force, ...) to bypass the level gate.
SEASTAR_LOGGER_COMPILE_TIME_FMT is handled via #ifdef since the
two seastar code paths differ in how runtime strings are passed
to format_info.
proto/admin: log filter gains LogFilterState enum

Replaces the bool enabled field on LogFilterRule and
LogCallsiteInfo with a LogFilterState enum carrying
INHERITED/FORCE_ON/FORCE_OFF. The UNSPECIFIED = 0 proto3 default is
rejected by SetLogFilter and never returned by ListLogCallsites.
LOG_FILTER_STATE_DEFAULT was renamed to LOG_FILTER_STATE_INHERITED
because the protobuf C++ generator strips the enum name prefix,
producing a symbol named 'default' which is a reserved keyword.
Branch-local schema break; no clients exist yet.
admin/log_filter: translate LogFilterState enum both ways

SetLogFilter rejects LOG_FILTER_STATE_UNSPECIFIED with
INVALID_ARGUMENT and refuses any unknown enum value.
ListLogCallsites reports the resolved callsite state via
to_wire_state. The proto enum's INHERITED maps to the C++
callsite_base::state::default_ (keyword-collision workaround
from the preceding proto commit).
seastar::logger::set_ostream(_captured);
seastar::logger::set_ostream_enabled(true);
}
~logger_capture() { seastar::logger::set_ostream(std::cerr); }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would be wrong if the output stream wasn't cerr (e.g., because the user used soem commad line args which changed it).

Perhaps the set_ostream should return the old stream so it can be restored? Or are there lifetime issues?

Alternately maybe we need a reset_ostream which sets the stream back to the default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed. Unfortunately that will take some API changes to seastar... the tests are nice to have though, so wondering which direction you think we should go here (straight to upstream seastar changes, or changes to our fork)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, seastar::logger::set_ostream_enabled(false) might be okay, since the logging would be limited to this binary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants