feat: eBPF event deduplication before CEL rule evaluation#762
feat: eBPF event deduplication before CEL rule evaluation#762
Conversation
Add a lock-free dedup cache that prevents structurally identical eBPF events from reaching the expensive CEL rule engine. Events are keyed by type-specific fields (mntns + pid + relevant attributes) using xxhash, with per-event-type TTL windows (2-10s). The cache uses packed atomic uint64 slots (48-bit key + 16-bit expiry bucket) for zero-lock concurrent access from the 3,000-goroutine worker pool. Consumers opt in to skipping duplicates: RuleManager, ContainerProfileManager, and MalwareManager skip; Metrics, DNSManager, NetworkStream, and RulePolicy always process. No events are dropped — the Duplicate flag is advisory. Benchmarks: cache check ~7ns/op, key computation 24-52ns/op, 0 allocations. Implements design/node-agent-performance-epic/ebpf-event-deduplication.md §1.3 (targets 10% of the 20% CPU reduction goal). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben <ben@armosec.io>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an optional event deduplication subsystem: a lock-free DedupCache, xxhash-based key generators, integration into the containerwatcher pipeline (marking duplicates and skipping selected handlers), config + metrics additions, tests/benchmarks, docs, and CI benchmark workflows. Max 50 words. Changes
Sequence DiagramsequenceDiagram
participant EventSrc as Event Source
participant CWatcher as Container Watcher
participant EHandler as Event Handler Factory
participant DCache as Dedup Cache
participant Metrics as Metrics Manager
participant RuleMgr as Rule Manager
EventSrc->>CWatcher: Emit raw event
CWatcher->>CWatcher: Enrich event, set DedupBucket
CWatcher->>EHandler: enrichAndProcess(enrichedEvent)
EHandler->>EHandler: Compute dedup key & TTL
EHandler->>DCache: CheckAndSet(key, ttlBuckets, currentBucket)
alt Duplicate
DCache-->>EHandler: true
EHandler->>EHandler: enrichedEvent.Duplicate = true
EHandler->>Metrics: ReportDedupEvent(type, duplicate=true)
else Not duplicate
DCache-->>EHandler: false
EHandler->>Metrics: ReportDedupEvent(type, duplicate=false)
end
EHandler->>RuleMgr: ReportEnrichedEvent(enrichedEvent)
alt enrichedEvent.Duplicate == true
RuleMgr-->>EHandler: return (skip processing)
else
RuleMgr-->>EHandler: continue (enrich, evaluate, export)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
pkg/metricsmanager/metrics_manager_mock.go (1)
67-67: Consider recording dedup results in the mock.This keeps the interface satisfied, but it also makes the new metric path unassertable in unit tests. Tracking duplicate/passed counts here would keep dedup observability testable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/metricsmanager/metrics_manager_mock.go` at line 67, The MetricsMock.ReportDedupEvent stub should record dedup outcomes so tests can assert on them: update the MetricsMock struct to include fields (e.g., DedupDuplicateCount, DedupPassedCount and/or a DedupEvents slice) and a sync.Mutex for thread-safety, then implement ReportDedupEvent(eventType utils.EventType, duplicate bool) to lock, increment the appropriate counter (or append a record with eventType and duplicate), and unlock; expose simple getter/accessor methods or make the fields public so unit tests can inspect counts/events.pkg/metricsmanager/prometheus/prometheus.go (1)
395-400: Avoid per-event label-map allocations here.
ReportDedupEventruns on the dedup hot path, butWith(prometheus.Labels{...})creates a fresh map on every call. That adds avoidable allocation/GC pressure right where this PR is trying to save CPU.WithLabelValues(string(eventType), result)or a tiny cached lookup would keep this path much cheaper.♻️ Minimal change
- p.dedupEventCounter.With(prometheus.Labels{eventTypeLabel: string(eventType), "result": result}).Inc() + p.dedupEventCounter.WithLabelValues(string(eventType), result).Inc()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/metricsmanager/prometheus/prometheus.go` around lines 395 - 400, ReportDedupEvent currently allocates a new map on every hot-path call by using p.dedupEventCounter.With(prometheus.Labels{...}), causing GC/alloc pressure; change it to use the zero-allocation variant WithLabelValues(string(eventType), result) or implement a small cached map/observer keyed by eventType/result to avoid per-call map allocations - update the ReportDedupEvent function to call p.dedupEventCounter.WithLabelValues(string(eventType), result).Inc() (or use a precomputed label handle cache) and keep references to dedupEventCounter and eventTypeLabel names to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/config/config.go`:
- Around line 28-32: The EventDedupConfig.SlotsExponent is used unchecked when
creating the dedup cache and can cause an overflow/panic in
dedupcache.NewDedupCache; update LoadConfig to validate SlotsExponent (e.g.,
ensure 0 <= SlotsExponent <= 31) before using it, returning an error if out of
range. Locate the struct EventDedupConfig and the code path in LoadConfig (and
any other places that call dedupcache.NewDedupCache) and add explicit bounds
checking for SlotsExponent, with a clear error message referencing the invalid
value, so callers cannot pass an exponent >31 (or your chosen safe max) into
dedupcache.NewDedupCache.
In `@pkg/containerwatcher/v2/container_watcher.go`:
- Line 473: The DedupBucket is being recomputed per event which can change
mid-batch; compute the bucket once for the entire queue batch and reuse it. Move
the calculation (uint16(time.Now().UnixNano() / (64 * 1_000_000))) out of the
per-event code and store it in a local variable (e.g., dedupBucket) at the start
of the batch processing routine, then assign enrichedEvent.DedupBucket =
dedupBucket for each event; ensure the calculation and assignment use the same
expression so behavior is unchanged except now stable across the batch.
In `@pkg/containerwatcher/v2/event_handler_factory.go`:
- Around line 193-196: The dedupSkipSet currently contains
containerProfileAdapter (and malwareManager), which causes ReportDroppedEvent
calls within containerProfileAdapter to be skipped for duplicate events; modify
the event flow so dropped-event reporting is always executed before
deduplication: extract the dropped-event handling path (calls to
ReportDroppedEvent and checks of enrichedEvent.HasDroppedEvents) out of
containerProfileAdapter and invoke it unconditionally prior to evaluating
factory.dedupSkipSet, or alternatively call the adapter's ReportDroppedEvent
logic explicitly before the duplicate-skip check; ensure the same change is
applied to the second occurrence referenced around the other dedupSkipSet
insertion so kernel-drop notifications are never suppressed.
In `@pkg/dedupcache/dedup_cache.go`:
- Around line 36-46: The expiry comparison in DedupCache.CheckAndSet uses a
wrapping uint16 bucket which breaks across the 65535→0 rollover; change the
expiry to a non-wrapping time base by extending the stored expiry field to a
larger integer (e.g., use a uint32/uint64 expiry inside the packed value),
update pack and unpack to encode/decode that wider expiry, ensure CheckAndSet
stores pack(key, uint64(currentBucket)+uint64(ttlBuckets)) and compares
storedExpiry > uint64(currentBucket) using the wider type, and add a regression
test that simulates buckets across the 65535→0 boundary to verify dedup
behavior; alternatively, if you prefer epoching, implement explicit rollover
detection in CheckAndSet that clears/epochs slots when currentBucket <
lastSeenBucket and add the same wrap test.
- Around line 16-21: NewDedupCache currently computes size := uint64(1) <<
slotsExponent without validating slotsExponent which can yield size==0 (when
slotsExponent>=64) or unreasonably large sizes (e.g., >30) causing panics or
huge allocations; add a bounds check at the start of NewDedupCache to reject
invalid exponents (e.g., require slotsExponent > 0 and <= 30 or another
project-safe maximum), and return an error or panic with a clear message instead
of constructing a zero/huge slice; ensure the check prevents creating a
DedupCache with zero-length slots and protects callers of CheckAndSet from
out-of-range accesses.
In `@pkg/dedupcache/keys.go`:
- Around line 41-48: ComputeNetworkKey (and the other key-builder helpers that
hash multiple variable-length strings such as the capability+syscall builder,
the host+path+rawQuery builder, and the oldPath+newPath builder) currently write
adjacent strings directly into the xxhash stream which can cause ambiguous
collisions; update each function (starting with ComputeNetworkKey) to prefix
every string component with its length (or another unambiguous delimiter) before
calling h.WriteString so the concatenation is unambiguous—e.g., call
writeUint32/Uint16/Uint64 for the string length then write the string itself for
each variable-length field; apply the same change to all similar helpers
mentioned in the comment so every multi-string hash is length-delimited.
---
Nitpick comments:
In `@pkg/metricsmanager/metrics_manager_mock.go`:
- Line 67: The MetricsMock.ReportDedupEvent stub should record dedup outcomes so
tests can assert on them: update the MetricsMock struct to include fields (e.g.,
DedupDuplicateCount, DedupPassedCount and/or a DedupEvents slice) and a
sync.Mutex for thread-safety, then implement ReportDedupEvent(eventType
utils.EventType, duplicate bool) to lock, increment the appropriate counter (or
append a record with eventType and duplicate), and unlock; expose simple
getter/accessor methods or make the fields public so unit tests can inspect
counts/events.
In `@pkg/metricsmanager/prometheus/prometheus.go`:
- Around line 395-400: ReportDedupEvent currently allocates a new map on every
hot-path call by using p.dedupEventCounter.With(prometheus.Labels{...}), causing
GC/alloc pressure; change it to use the zero-allocation variant
WithLabelValues(string(eventType), result) or implement a small cached
map/observer keyed by eventType/result to avoid per-call map allocations -
update the ReportDedupEvent function to call
p.dedupEventCounter.WithLabelValues(string(eventType), result).Inc() (or use a
precomputed label handle cache) and keep references to dedupEventCounter and
eventTypeLabel names to locate the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d640beeb-a69e-4b2d-9fa9-a9145caec339
📒 Files selected for processing (14)
go.modpkg/config/config.gopkg/config/config_test.gopkg/containerwatcher/v2/container_watcher.gopkg/containerwatcher/v2/event_handler_factory.gopkg/dedupcache/dedup_cache.gopkg/dedupcache/dedup_cache_test.gopkg/dedupcache/keys.gopkg/dedupcache/keys_test.gopkg/ebpf/events/enriched_event.gopkg/metricsmanager/metrics_manager_interface.gopkg/metricsmanager/metrics_manager_mock.gopkg/metricsmanager/prometheus/prometheus.gopkg/rulemanager/rule_manager.go
Benchmark comparing v0.3.71 (baseline) vs dedup branch on a kind cluster with 1000 open/s, 100 http/s load. Results show -16% avg CPU, -29% peak CPU, with 91-99% dedup ratios on high-frequency event types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben <ben@armosec.io>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/dedup-testing/benchmark-results.md`:
- Line 87: Replace the misspelled directory name "perfornamce" with the correct
spelling "performance" in the markdown line that references the directory (the
string "perfornamce" in the diff); update any other occurrences of the same typo
in the document (search for "perfornamce") so all directory references use
"performance".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4a3b9631-0ff1-4184-a380-896c0145c38b
⛔ Files ignored due to path filters (4)
docs/dedup-testing/after_cpu_usage.pngis excluded by!**/*.pngdocs/dedup-testing/after_memory_usage.pngis excluded by!**/*.pngdocs/dedup-testing/before_cpu_usage.pngis excluded by!**/*.pngdocs/dedup-testing/before_memory_usage.pngis excluded by!**/*.png
📒 Files selected for processing (1)
docs/dedup-testing/benchmark-results.md
Add benchmark scripts and CI workflow that runs A/B performance benchmarks on Kind clusters, comparing baseline vs PR node-agent images. Posts results as PR comments and uploads artifacts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben <ben@armosec.io>
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
.github/workflows/incluster-comp-pr-merged.yaml (1)
339-345: Benchmark doesn't gate release promotion.
create-release-and-retagstill does not depend on this new job, so the image can be retagged before the benchmark finishes or even if it fails. If this prerelease benchmark is supposed to validate the image before promotion, addbenchmarkto the downstreamneeds/condition.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/incluster-comp-pr-merged.yaml around lines 339 - 345, The benchmark job is not gating release promotion: ensure the downstream job that performs promotion (create-release-and-retag) depends on the new benchmark job and only runs when it succeeds; update the workflow so the create-release-and-retag job's needs includes "benchmark" (or add a conditional that checks success of needs.benchmark) so the image cannot be retagged before or if the benchmark fails.benchmark/dedup-bench.sh (1)
64-72: Use the pinned benchmark requirements here.
benchmark/requirements.txtalready pins the Python stack, but local runs bypass it and install unpinnedpandas matplotlib requestsinstead. That lets the same benchmark run against different dependency versions locally vs CI, and the existing.venvis never refreshed when the pins change.Possible fix
# Set up Python venv with deps local venv_dir="${SCRIPT_DIR}/.venv" if [[ ! -d "$venv_dir" ]]; then - log "Creating Python venv and installing dependencies..." + log "Creating Python venv..." python3 -m venv "$venv_dir" - "$venv_dir/bin/pip" install pandas matplotlib requests fi + log "Installing benchmark Python dependencies..." + "$venv_dir/bin/pip" install -r "$SCRIPT_DIR/requirements.txt" # Use the venv python for the rest of the script export PATH="${venv_dir}/bin:${PATH}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmark/dedup-bench.sh` around lines 64 - 72, The script creates a venv in venv_dir but installs unpinned packages; change the install step to install from the pinned requirements file (benchmark/requirements.txt) instead of hardcoded packages and ensure pip runs with upgrade to pick up changed pins (replace the "$venv_dir/bin/pip" install pandas matplotlib requests call with an install from "${SCRIPT_DIR}/requirements.txt" using --upgrade/--requirement). Keep the rest of the venv handling (export PATH with venv_dir) the same so local runs match CI and pinned deps are applied when the venv is created or refreshed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/benchmark.yaml:
- Around line 38-46: The workflow currently downloads kind from
https://kind.sigs.k8s.io/dl/v0.27.0/kind-linux-amd64 and pipes Helm's installer
from https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
directly in the "Install Kind" and "Install Helm" steps; replace these with
pinned, verifiable installs: for the "Install Kind" step either use a maintained
setup action (e.g., actions/setup-kind@vX) or download the tagged release asset
and verify it against a trusted SHA256 checksum before installing; for the
"Install Helm" step use a pinned setup action (e.g., azure/setup-helm@vX or
another official action) or call the Helm install script from a specific release
tag (not main) and verify its checksum/signature, ensuring both steps log
failures and do not pipe unverified scripts into bash.
- Around line 113-119: The workflow uses peter-evans/create-or-update-comment@v4
with an unsupported input `comment-tag`, causing new comments each run; remove
`comment-tag` and implement a sticky-comment flow by first using an action like
peter-evans/find-comment to search for the existing comment via `body-includes`
(e.g., matching text from `report.md`), capture its `comment-id`, then pass that
`comment-id` into create-or-update-comment (or use `edit-mode`) along with
`body-path: report.md` and `issue-number` to update the existing comment;
alternatively, switch the job trigger to `pull_request_target` to avoid fork PR
permission issues as documented in the create-or-update-comment README.
In `@benchmark/compare-metrics.py`:
- Around line 37-42: compute_resource_stats currently filters node-agent rows
but computes avg/peak across pods (per-pod), undercounting total footprint on
multi-agent clusters; change it to first aggregate Value across node-agent pods
per timestamp (e.g., group the filtered na DataFrame by the timestamp column,
such as "Timestamp", and sum the "Value" for each timestamp) and then return
{"avg": summed_values.mean(), "peak": summed_values.max()} while keeping the
initial filter (na = df[df["Pod"].str.contains("node-agent", na=False)]) and the
same return shape.
In `@benchmark/dedup-bench.sh`:
- Around line 338-345: The loop over queries writes Prometheus responses without
checking HTTP or Prometheus-level errors; update the block that performs
requests.get (inside the for name, query in queries.items() loop) to call
resp.raise_for_status() after the request and then parse resp.json() into a
variable and verify it contains "status" == "success" before writing the file to
output_dir; if either the HTTP status or the Prometheus "status" is an error,
print a clear error including name and response content and fail the benchmark
run (e.g., raise an exception or call sys.exit(1)) instead of silently
continuing.
- Around line 267-269: The remove_load_simulator() function currently deletes
the load-simulator namespace with --wait=false which returns immediately and
causes the after phase to recreate the namespace while it may still be
Terminating; change remove_load_simulator() to wait for the namespace to be
fully deleted before returning (e.g., use kubectl wait --for=delete
namespace/load-simulator or poll kubectl get namespace load-simulator until it
no longer exists) so subsequent apply steps (the recreate/apply logic) cannot
run against a terminating namespace.
In `@benchmark/load-simulator/daemonset.yaml`:
- Around line 15-16: The load-simulator container image is not pinned (image
field under the container named "load-simulator"), causing implicit :latest
resolution and potential benchmark drift; update the image reference in the
daemonset to an immutable reference by replacing
quay.io/benarmosec/load-simulator-f866d4884d08e4a0d1907b094978b48d with either a
specific versioned tag (e.g., :v1.2.3) or, preferably, a content digest
(quay.io/benarmosec/load-simulator@sha256:<digest>) so deployments from
benchmark/dedup-bench.sh are reproducible.
---
Nitpick comments:
In @.github/workflows/incluster-comp-pr-merged.yaml:
- Around line 339-345: The benchmark job is not gating release promotion: ensure
the downstream job that performs promotion (create-release-and-retag) depends on
the new benchmark job and only runs when it succeeds; update the workflow so the
create-release-and-retag job's needs includes "benchmark" (or add a conditional
that checks success of needs.benchmark) so the image cannot be retagged before
or if the benchmark fails.
In `@benchmark/dedup-bench.sh`:
- Around line 64-72: The script creates a venv in venv_dir but installs unpinned
packages; change the install step to install from the pinned requirements file
(benchmark/requirements.txt) instead of hardcoded packages and ensure pip runs
with upgrade to pick up changed pins (replace the "$venv_dir/bin/pip" install
pandas matplotlib requests call with an install from
"${SCRIPT_DIR}/requirements.txt" using --upgrade/--requirement). Keep the rest
of the venv handling (export PATH with venv_dir) the same so local runs match CI
and pinned deps are applied when the venv is created or refreshed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a349e24d-9f4e-433d-8cd5-a515c1a65c9b
📒 Files selected for processing (7)
.github/workflows/benchmark.yaml.github/workflows/incluster-comp-pr-merged.yamlbenchmark/compare-metrics.pybenchmark/dedup-bench.shbenchmark/get_data_from_prometheus.pybenchmark/load-simulator/daemonset.yamlbenchmark/requirements.txt
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
- Validate slotsExponent range [10,30] in NewDedupCache and LoadConfig - Fix uint16 expiry wrap-around using signed subtraction - Length-prefix strings in hash keys to prevent adjacent-field collisions - Extract dropped-event reporting from dedup-skippable adapter - Guard req.URL nil dereference in HTTP dedup key computation - Use WithLabelValues for dedup metrics (avoids map alloc on hot path) - Fix benchmark: use requirements.txt, add resp.raise_for_status(), fix typo Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben <ben@armosec.io>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
benchmark/dedup-bench.sh (2)
267-270:⚠️ Potential issue | 🟠 MajorWait for namespace deletion before redeploying load simulator.
Line 269 uses
--wait=false, but the after phase redeploy starts immediately (Line 419). This can race against aTerminatingnamespace and make benchmark runs flaky.Suggested fix
remove_load_simulator() { log "Removing load simulator..." - kubectl delete namespace load-simulator --wait=false 2>/dev/null || true + kubectl delete namespace load-simulator --wait=true --timeout=300s 2>/dev/null || true }Also applies to: 413-419
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmark/dedup-bench.sh` around lines 267 - 270, The remove_load_simulator() function currently deletes the load-simulator namespace with --wait=false which can race with the subsequent redeploy; change it to wait for the namespace to be fully deleted (either remove --wait=false or use --wait=true / kubectl wait --for=delete namespace/load-simulator --timeout=<reasonable>) so the namespace is confirmed gone before deploy_load_simulator runs; update remove_load_simulator() and ensure any after-phase logic that calls deploy_load_simulator waits for that deletion to complete.
338-350:⚠️ Potential issue | 🟠 MajorFail the benchmark on Prometheus query errors (and fix quote-unsafe status access).
This block still degrades to warnings and continues, so bad/partial metrics can produce misleading comparisons. Also, Line 344 uses a quote pattern that is shell-fragile in a
python3 -c "..."block.Suggested fix
for name, query in queries.items(): - try: - resp = requests.get(f'{url}/api/v1/query', params={'query': query, 'time': end.isoformat()}, timeout=30) - resp.raise_for_status() - data = resp.json() - if data.get('status') != 'success': - print(f'Warning: {name}: Prometheus returned status={data.get("status")}') - with open(os.path.join(output_dir, f'{name}.json'), 'w') as f: - json.dump(data, f, indent=2) - print(f'Collected {name}') - except Exception as e: - print(f'Warning: {name}: {e}') + resp = requests.get( + f'{url}/api/v1/query', + params={'query': query, 'time': end.isoformat()}, + timeout=30, + ) + resp.raise_for_status() + data = resp.json() + if data.get('status') != 'success': + raise RuntimeError( + f"{name}: Prometheus status={data.get('status')} body={resp.text[:500]}" + ) + with open(os.path.join(output_dir, f'{name}.json'), 'w') as f: + json.dump(data, f, indent=2) + print(f'Collected {name}')#!/bin/bash set -euo pipefail # Inspect the exact block under review sed -n '338,350p' benchmark/dedup-bench.sh # Confirm fail-open warning paths currently exist rg -n "Prometheus returned status|Warning: \{name\}" benchmark/dedup-bench.sh # Validate quote fragility warning if shellcheck exists in the environment if command -v shellcheck >/dev/null 2>&1; then shellcheck -s bash benchmark/dedup-bench.sh | rg -n 'SC2140|344' || true fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmark/dedup-bench.sh` around lines 338 - 350, The loop that queries Prometheus (variables name, query, url, end and the requests.get call) should fail the whole benchmark on any query error or non-success response instead of just printing warnings: when resp.raise_for_status() or data.get("status") != "success" occurs, print an error and exit non‑zero (e.g., exit 1) so bad/partial metrics abort the run; also fix the quote-unsafe access by using double quotes for the JSON key (data.get("status")) so the Python snippet is safe inside a shell-wrapped python3 -c "...".
🧹 Nitpick comments (1)
pkg/containerwatcher/v2/event_handler_factory.go (1)
213-216: Consider extracting PID extraction into a helper.The pattern for extracting PID via type assertion to
utils.EnrichEventis repeated 8 times throughout this function. A small helper could reduce duplication:func extractPID(event utils.K8sEvent) uint32 { if ee, ok := event.(utils.EnrichEvent); ok { return ee.GetPID() } return 0 }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/containerwatcher/v2/event_handler_factory.go` around lines 213 - 216, The PID extraction via type assertion to utils.EnrichEvent is duplicated multiple times in event_handler_factory.go; create a small helper function (e.g., extractPID(event utils.K8sEvent) uint32) that performs the assertion and returns ee.GetPID() or 0, then replace all repeated blocks that check "if ee, ok := event.(utils.EnrichEvent); ok { pid = ee.GetPID() }" with calls to extractPID(event); ensure the helper is visible to the functions that need it and keep the exact types and method names (utils.EnrichEvent, utils.K8sEvent, GetPID) referenced so behavior is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@benchmark/dedup-bench.sh`:
- Around line 267-270: The remove_load_simulator() function currently deletes
the load-simulator namespace with --wait=false which can race with the
subsequent redeploy; change it to wait for the namespace to be fully deleted
(either remove --wait=false or use --wait=true / kubectl wait --for=delete
namespace/load-simulator --timeout=<reasonable>) so the namespace is confirmed
gone before deploy_load_simulator runs; update remove_load_simulator() and
ensure any after-phase logic that calls deploy_load_simulator waits for that
deletion to complete.
- Around line 338-350: The loop that queries Prometheus (variables name, query,
url, end and the requests.get call) should fail the whole benchmark on any query
error or non-success response instead of just printing warnings: when
resp.raise_for_status() or data.get("status") != "success" occurs, print an
error and exit non‑zero (e.g., exit 1) so bad/partial metrics abort the run;
also fix the quote-unsafe access by using double quotes for the JSON key
(data.get("status")) so the Python snippet is safe inside a shell-wrapped
python3 -c "...".
---
Nitpick comments:
In `@pkg/containerwatcher/v2/event_handler_factory.go`:
- Around line 213-216: The PID extraction via type assertion to
utils.EnrichEvent is duplicated multiple times in event_handler_factory.go;
create a small helper function (e.g., extractPID(event utils.K8sEvent) uint32)
that performs the assertion and returns ee.GetPID() or 0, then replace all
repeated blocks that check "if ee, ok := event.(utils.EnrichEvent); ok { pid =
ee.GetPID() }" with calls to extractPID(event); ensure the helper is visible to
the functions that need it and keep the exact types and method names
(utils.EnrichEvent, utils.K8sEvent, GetPID) referenced so behavior is unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5b6e5340-9d03-4c58-98f9-3bf51f30e6ed
📒 Files selected for processing (9)
benchmark/dedup-bench.shdocs/dedup-testing/benchmark-results.mdpkg/config/config.gopkg/containerwatcher/v2/event_handler_factory.gopkg/dedupcache/dedup_cache.gopkg/dedupcache/dedup_cache_test.gopkg/dedupcache/keys.gopkg/dedupcache/keys_test.gopkg/metricsmanager/prometheus/prometheus.go
✅ Files skipped from review due to trivial changes (1)
- docs/dedup-testing/benchmark-results.md
🚧 Files skipped from review as they are similar to previous changes (5)
- pkg/dedupcache/keys_test.go
- pkg/metricsmanager/prometheus/prometheus.go
- pkg/dedupcache/dedup_cache.go
- pkg/dedupcache/dedup_cache_test.go
- pkg/dedupcache/keys.go
Integrate pprof.Do event labeling from main with dedup skip logic from feature branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben <ben@armosec.io>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/containerwatcher/v2/container_watcher.go (1)
454-463:⚠️ Potential issue | 🟠 MajorKeep one stable dedup bucket per batch.
Line 470 still recomputes
DedupBucketper event. If batch drain crosses a 64ms boundary, same-batch duplicates can miss dedup.🛠️ Suggested fix
func (cw *ContainerWatcher) processQueueBatch() { batchSize := cw.cfg.EventBatchSize processedCount := 0 + dedupBucket := uint16(time.Now().UnixNano() / (64 * 1_000_000)) for !cw.orderedEventQueue.Empty() && processedCount < batchSize { event, ok := cw.orderedEventQueue.PopEvent() if !ok { break } - cw.enrichAndProcess(event) + cw.enrichAndProcess(event, dedupBucket) processedCount++ } } -func (cw *ContainerWatcher) enrichAndProcess(entry EventEntry) { +func (cw *ContainerWatcher) enrichAndProcess(entry EventEntry, dedupBucket uint16) { enrichedEvent := cw.eventEnricher.EnrichEvents(entry) - enrichedEvent.DedupBucket = uint16(time.Now().UnixNano() / (64 * 1_000_000)) + enrichedEvent.DedupBucket = dedupBucketAlso applies to: 468-470
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/containerwatcher/v2/container_watcher.go` around lines 454 - 463, In processQueueBatch, avoid recomputing the DedupBucket per event (which currently happens inside the loop) by computing a single stable dedup bucket for the entire batch before draining events; compute DedupBucket once (e.g., dedupBucket := generateDedupBucket() or cw.newDedupBucket()) prior to the for loop in ContainerWatcher.processQueueBatch and pass that stable bucket into enrichAndProcess (or attach it to the event) instead of letting enrichAndProcess derive it per event, so duplicates within the same batch that cross the 64ms boundary are consistently deduplicated.
🧹 Nitpick comments (1)
pkg/config/config.go (1)
255-259: Avoid duplicated dedup exponent bounds across packages.
10and30are duplicated here and inpkg/dedupcache/dedup_cache.go; consider sharing constants to prevent future drift.♻️ Minimal refactor direction
- if config.EventDedup.Enabled && (config.EventDedup.SlotsExponent < 10 || config.EventDedup.SlotsExponent > 30) { + if config.EventDedup.Enabled && (config.EventDedup.SlotsExponent < dedupcache.MinSlotsExponent || config.EventDedup.SlotsExponent > dedupcache.MaxSlotsExponent) {// pkg/dedupcache/dedup_cache.go const ( MinSlotsExponent uint8 = 10 MaxSlotsExponent uint8 = 30 defaultExponent uint8 = 18 )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/config/config.go` around lines 255 - 259, The numeric bounds for EventDedup.SlotsExponent are duplicated; replace the hardcoded 10 and 30 in the config validation with shared constants from the dedup cache package (e.g. use pkg/dedupcache.MinSlotsExponent and MaxSlotsExponent), update types if needed (cast or change to matching uint8/int), and remove the literal values so both pkg/config (validation in the block checking config.EventDedup.SlotsExponent) and pkg/dedupcache/dedup_cache.go reference the same exported constants to avoid drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/containerwatcher/v2/container_watcher.go`:
- Around line 454-463: In processQueueBatch, avoid recomputing the DedupBucket
per event (which currently happens inside the loop) by computing a single stable
dedup bucket for the entire batch before draining events; compute DedupBucket
once (e.g., dedupBucket := generateDedupBucket() or cw.newDedupBucket()) prior
to the for loop in ContainerWatcher.processQueueBatch and pass that stable
bucket into enrichAndProcess (or attach it to the event) instead of letting
enrichAndProcess derive it per event, so duplicates within the same batch that
cross the 64ms boundary are consistently deduplicated.
---
Nitpick comments:
In `@pkg/config/config.go`:
- Around line 255-259: The numeric bounds for EventDedup.SlotsExponent are
duplicated; replace the hardcoded 10 and 30 in the config validation with shared
constants from the dedup cache package (e.g. use pkg/dedupcache.MinSlotsExponent
and MaxSlotsExponent), update types if needed (cast or change to matching
uint8/int), and remove the literal values so both pkg/config (validation in the
block checking config.EventDedup.SlotsExponent) and
pkg/dedupcache/dedup_cache.go reference the same exported constants to avoid
drift.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d638d2cd-c1cb-4ec2-8b88-d351ed07235f
📒 Files selected for processing (5)
go.modpkg/config/config.gopkg/containerwatcher/v2/container_watcher.gopkg/containerwatcher/v2/event_handler_factory.gopkg/rulemanager/rule_manager.go
✅ Files skipped from review due to trivial changes (2)
- pkg/rulemanager/rule_manager.go
- go.mod
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/containerwatcher/v2/event_handler_factory.go
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
matthyx
left a comment
There was a problem hiding this comment.
- seems very good and well engineered
- why did we stop checking for dropped events?
- let's check the Rabbit comment in container_watcher.go and use an official load simulator image
- benchmark: use --wait=true for namespace deletion to prevent race where the "after" run tries to create resources in a terminating namespace - dedup: remove containerProfileAdapter from dedupSkipSet so the profile builder sees all events (fixes Test_11_EndpointTest regression where HTTP endpoint headers from repeated requests were lost) - benchmark: use official load-simulator image with digest pin Signed-off-by: Ben <ben@armosec.io>
…dupSigned-off-by: Ben <ben@armosec.io>
MetricsNoop was added to main but was missing the ReportDedupEvent method added by the dedup feature branch, causing build failures. Signed-off-by: Ben <ben@armosec.io>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
Pull the load-simulator image with docker and load into kind before deploying, so it doesn't depend on in-cluster pulls from quay.io. Signed-off-by: Ben <ben@armosec.io>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
Add --check flag to compare-metrics.py that fails with exit code 1 if any node-agent CPU or memory metric degrades by more than 10% compared to the baseline. Added as a workflow step that runs after the benchmark completes. Signed-off-by: Ben <ben@armosec.io>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
Summary
pkg/dedupcache/) using packedatomic.Uint64slots (48-bit xxhash key + 16-bit expiry bucket at 64ms granularity)EventHandlerFactory.ProcessEvent()— duplicate events skip RuleManager (CEL evaluation), ContainerProfileManager, and MalwareManager while still flowing to metrics, DNS manager, and network streamEventDedupConfig(enabled by default, 2^18 = 262K slots ≈ 2MB)node_agent_dedup_events_total{event_type, result}for observabilityMotivation
Node-Agent Performance Epic §1.3 — targets 10% of the 20% CPU reduction goal by deduplicating repetitive eBPF events (e.g., repeated file opens, DNS lookups) before they reach the expensive CEL rule evaluation engine.
Design
See
design/node-agent-performance-epic/ebpf-event-deduplication.mdfor the full design document.Benchmarks
DedupCache.CheckAndSet: ~6ns/op miss, ~0.8ns/op hit, 0 allocationsTest plan
go test ./...passes (pre-existing eBPF permission failures only)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Metrics
Tests
Documentation
Chores