docs: add benchmarking blog posts and performance reference page by SamBarker · Pull Request #254 · kroxylicious/kroxylicious.github.io

SamBarker · 2026-05-13T04:00:17Z

Summary

Adds two blog posts about benchmarking Kroxylicious proxy overhead:
- [May 1] "Does my proxy look big in this cluster?" — operator-focused: methodology, passthrough and encryption results, sizing guidance
- [May 8] "Benchmarking a Kafka proxy: the engineering story" — engineer-focused: OMB harness, flamegraphs (interactive iframes), bugs found in own tooling, cluster incident
Adds a /performance/ reference page summarising key numbers and linking to both posts
Adds interactive async-profiler flamegraphs as self-contained HTML assets
Updates overview.markdown with headline performance figures and a link to the reference page

Status

Draft — the posts are first drafts. Known open items:

Per-connection scaling section in Post 1 needs the TODO placeholder replaced once 4-core sweep data is available and the scaling picture is better understood
Post 2 has a stub section for 4-core validation results (pending sweep completion)
Post 2 tone has not yet received the same voice treatment as Post 1

Test plan

Run ./run.sh and verify site renders at http://127.0.0.1:4000/
Check both blog posts render correctly including flamegraph iframes
Check /performance/ page renders with correct tables
Check cross-links between posts and to /performance/ work

🤖 Generated with Claude Code

tombentley · 2026-05-13T04:26:14Z

+| Kroxylicious proxy | 1.4% |
+| GC | 0.1% |
+
+The proxy is overwhelmingly I/O-bound. 59% of CPU is in `send`/`recv` syscalls — the inherent cost of maintaining two TCP connections (client→proxy, proxy→Kafka) with data flowing through the JVM. The proxy itself accounts for 1.4%. It really is a TCP relay with protocol awareness.


I wonder how much that's down to the decode predicate thing -- basically we know the filter chain, and what each filter in it wants to intercept, and I think we avoid doing the request/response decoding when we know nothing is interested. That was code that was in there from the beginning, but I don't actually know how relevant it is -- maybe some of the internal filters mean we're decoding requests and response always, in which case 1.4% is impressive. Or maybe we're acting more like a L4 proxy most of the time, in which case 1.4% is not quite as impressive.

Great question — this is actually a stronger story than the original prose suggested. The default infrastructure filters (BrokerAddressFilter, TopicNameCacheFilter, ApiVersionsIntersect) are doing genuine L7 work: metadata, FindCoordinator, and API version exchanges are fully decoded for address rewriting and version negotiation. But the high-volume produce/consume traffic hits the decode predicate and passes through without full deserialisation. So the proxy is selectively L7 — real protocol awareness where it needs it, L4-like passthrough on the hot path. The 1.4% is the cost of that design, and it validates it. Updating the prose to make this explicit.

tombentley · 2026-05-13T04:30:16Z

+
+The direct crypto cost is 13.3% (11.3% AES-GCM + 2.0% Kroxylicious filter logic). But encryption adds indirect costs too:
+
+- **Buffer management (+5.8%)**: encrypted records need to be read into buffers, encrypted, and written to new buffers — more allocation, more copying


Did we ever figure out how to reuse the buffers more? I think that was a TODO at one point.

Correct — the TODO was never addressed. A BufferPool class existed at one point but was deleted as unused in early 2024. Cipher instances are still created fresh per operation. These remain genuine open optimisation opportunities.

PaulRMellor

I read through the first post. Reads well overall, and the tone is nice and approachable. I left a few suggestions, particularly in places where the AI-assisted wording feels a bit noticeable.

PaulRMellor · 2026-05-15T13:04:19Z

+categories: benchmarking performance
+---
+
+All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly.


Suggested change

All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly.

All good benchmarking stories start with a hunch. I was confident Kroxylicious was cheap to run — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly.

PaulRMellor · 2026-05-15T13:05:58Z

+
+All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly.
+
+There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us.


Suggested change

There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us.

There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just another way of asking: "is this thing going to slow down my Kafka?" We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true, but not especially useful.

PaulRMellor · 2026-05-15T13:07:27Z

+
+There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us.
+
+So we stopped saying "it depends", and got off the fence: we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours.


Suggested change

So we stopped saying "it depends", and got off the fence: we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours.

So instead of saying "it depends", we built something measurable you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours.

"got off the fence" might be a bit colloquial for non-native speakers

PaulRMellor · 2026-05-15T13:08:26Z

+
+## Test environment
+
+No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.


Suggested change

No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.

We ran the benchmarks on a realistic deployment rather than a local development machine: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform, providing a controlled test environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.

PaulRMellor · 2026-05-15T13:10:05Z

+| E2E latency p99 | 499.00 ms | 499.00 ms | 0 |
+| Publish rate | 500 msg/s | 500 msg/s | 0 |
+
+**The headline: ~0.2 ms additional average publish latency. Throughput is unaffected.**


Suggested change

**The headline: ~0.2 ms additional average publish latency. Throughput is unaffected.**

**The headline: ~0.2 ms additional average publish latency. Measured throughput was unaffected.**

PaulRMellor · 2026-05-15T13:14:33Z

+
+## Record encryption: now we're doing real work
+
+Ok, so let's make the proxy smarter — make it do something people actually care about! [Record encryption](https://kroxylicious.io/documentation/0.20.0/html/record-encryption-guide) uses AES-256-GCM to encrypt each record passing through the proxy. AES-256-GCM is going to ask the CPU to work relatively hard on its own, but it's also going to push the proxy to understand each record it receives, unpack it, copy it, encrypt it, and re-pack it before sending it on to the broker. With all that work going on we expect some impact to latency and throughput. To answer our original question we need to identify two things: the latency when everything is going smoothly, and the reduction in throughput all this work causes. Monitoring latency once we go past the throughput inflection point isn't very helpful — it's dominated by the throughput limits and their erratic impacts on the latency of individual requests (a big hello to batching and buffering effects).


Suggested change

Ok, so let's make the proxy smarter — make it do something people actually care about! [Record encryption](https://kroxylicious.io/documentation/0.20.0/html/record-encryption-guide) uses AES-256-GCM to encrypt each record passing through the proxy. AES-256-GCM is going to ask the CPU to work relatively hard on its own, but it's also going to push the proxy to understand each record it receives, unpack it, copy it, encrypt it, and re-pack it before sending it on to the broker. With all that work going on we expect some impact to latency and throughput. To answer our original question we need to identify two things: the latency when everything is going smoothly, and the reduction in throughput all this work causes. Monitoring latency once we go past the throughput inflection point isn't very helpful — it's dominated by the throughput limits and their erratic impacts on the latency of individual requests (a big hello to batching and buffering effects).

[Record encryption](https://kroxylicious.io/documentation/0.20.0/html/record-encryption-guide) is a more representative workload because the proxy actively processes each record. Record encryption uses AES-256-GCM to encrypt each record passing through the proxy. AES-256-GCM is going to ask the CPU to work relatively hard on its own, but it's also going to push the proxy to understand each record it receives, unpack it, copy it, encrypt it, and re-pack it before sending it on to the broker. With all that work going on we expect some impact to latency and throughput. To answer our original question we need to identify two things: the latency when everything is going smoothly, and the reduction in throughput all this work causes. Monitoring latency once we go past the throughput inflection point isn't very helpful — it's dominated by the throughput limits and their erratic impacts on the latency of individual requests (a big hello to batching and buffering effects).

Is understand the right word here: "push the proxy to understand each record it receives"?
Would parse suffice?

PaulRMellor · 2026-05-15T13:19:14Z

+
+### Latency at sub-saturation rates
+
+A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Averages flatter; the p99 is what your slowest clients actually experience, and it's usually the number that matters.


Suggested change

A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Averages flatter; the p99 is what your slowest clients actually experience, and it's usually the number that matters.

A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Average latency can hide tail latency effects; the p99 is what your slowest clients actually experience, and it's usually the number that matters.

PaulRMellor · 2026-05-15T13:22:27Z

+
+   Worked example: 100k msg/s at 1 KB = 100 MB/s produce → 100 × 20 = 2000m, plus headroom → ~2600m (~2.6 cores).
+
+2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it.


Suggested change

2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it.

2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — providing sufficient headroom keeps the latency overhead relatively small.

PaulRMellor · 2026-05-15T13:26:46Z

+
+## Caveats and next steps
+
+These are real results from real hardware, but they don't tell a story for your workload. A few things worth knowing before you put these numbers in a slide deck:


Suggested change

These are real results from real hardware, but they don't tell a story for your workload. A few things worth knowing before you put these numbers in a slide deck:

These are real results from real hardware, but they do not necessarily reflect your workload characteristics. A few things worth knowing before you put these numbers in a slide deck:

showuon

Thanks for the blog post. It's good to see some real numbers about kroxy benchmark. Left some comments for the Does my proxy look big in this cluster? post.

showuon · 2026-05-21T06:08:52Z

+| Publish latency avg | 2.66 ms | 2.82 ms | +0.16 ms (+6%) |
+| Publish latency p99 | 5.54 ms | 6.07 ms | +0.53 ms (+10%) |
+| E2E latency avg | 253.16 ms | 253.76 ms | +0.60 ms (+0.2%) |
+| E2E latency p99 | 499.00 ms | 499.00 ms | 0 |


It'd be good to explain the difference between publish latency and E2E latency.

showuon · 2026-05-21T06:14:52Z

+
+The overhead holding across 10 and 100 topics makes sense for the same reason: the proxy doesn't contend between topics. Think of the proxy as independent circuits on a distribution board — switching the breaker for lights doesn't cut power to the fridge. A Kafka broker is more like the mains supply itself — every circuit draws from the same source, so heavy load anywhere reduces what's available everywhere. Topics don't contend for shared resources: throughput scales linearly across them, and the connection sweep validates it.
+
+The end-to-end p99 figure is dominated by Kafka consumer fetch timeouts, as it should be. That said, it is reassuring to have a sub-ms impact on the p99.


This is unclear to me. This is a confirmed result or a speculation? It'd be good if we can elaborate more on it, or remove this sentence directly because it seems that this is not the main point we want to talk here.

showuon · 2026-05-21T06:40:14Z

+
+The transition wasn't a clean cliff edge — the proxy alternated between sustaining and saturating in a narrow band just above the ceiling. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Stay below 14k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**.
+
+### The ceiling scales with CPU budget


I'd be interested in knowing if memory increasing helps here, in addition to CPU? I believe so since you mentioned GC overhead somewhere?

showuon · 2026-05-21T06:44:31Z

+| 37,200 msg/s | Publish avg | 9.12 ms | 12.19 ms | +3.07 ms (+34%) |
+| 37,200 msg/s | Publish p99 | 74.88 ms | 113.15 ms | +38.27 ms (+51%) |
+
+So we know that somewhere above 34k we're hitting a limit. Time to hunt out exactly where — enter the rate-sweep.


When reading here, I'm confusing about the numbers before 34,000 msg/s. Does that mean as long as our rate is 34000 msg/s below, then we don't need to worry about the latency? Or it stays a fixed latency percentage as what we see in 34k here?

showuon · 2026-05-21T06:49:45Z

+|-----------|---------|
+| CPU | AMD EPYC-Rome, 2 GHz |
+| Cluster | 8-node OpenShift (5 workers, 3 masters), RHCOS 9.6 |
+| Kafka | 3-broker Strimzi cluster, replication factor 3 |


It'd be to good know the kafka version you're running on.

showuon · 2026-05-21T06:52:38Z

+| Cluster | 8-node OpenShift (5 workers, 3 masters), RHCOS 9.6 |
+| Kafka | 3-broker Strimzi cluster, replication factor 3 |
+| Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit |
+| KMS | HashiCorp Vault (in-cluster) |


The version of vault.

showuon · 2026-05-21T07:00:52Z

+| CPU limit | Comfortable ceiling | Saturation point |
+|-----------|--------------------|--------------------|
+| 1000m | ~80k msg/s | ~126k msg/s |
+| 2000m | ~80k msg/s | above 160k msg/s |
+| 4000m | ~160k msg/s | above 321k msg/s |


I know we use 1 KB sized messages, but it's hard to clearly know the real throughput here. It'd be good to have the size number also appended, ex: ~126k msg/s (~126MB/s). Same to other tables.

showuon · 2026-05-21T07:02:58Z

+
+Numbers without guidance aren't very useful, so here's how to translate these results into pod specs.
+
+**Passthrough proxy**: size your Kafka cluster as you normally would. The proxy won't be the bottleneck — but if you want to verify that on your own hardware, the rate sweep is exactly the tool for it. Run the baseline and passthrough scenarios back-to-back and you'll have your own numbers.


A stupid question: What does rate sweep tool mean here?

showuon · 2026-05-21T07:04:16Z

+
+**Passthrough proxy**: size your Kafka cluster as you normally would. The proxy won't be the bottleneck — but if you want to verify that on your own hardware, the rate sweep is exactly the tool for it. Run the baseline and passthrough scenarios back-to-back and you'll have your own numbers.
+
+**With record encryption:**


I believe record encryption is just an example here. So maybe you can say sth like this: with filters, ex: record encryption

showuon · 2026-05-21T07:20:26Z

+So we stopped saying "it depends", and got off the fence: we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours.
+
+<!-- FIXME: verify all numbers against final benchmark run before publish -->
+**TL;DR**: A passthrough Kroxylicious proxy adds ~0.2 ms to average publish latency with no throughput impact. Add record encryption and expect a ~25% throughput reduction and 0.2–3 ms of additional latency at comfortable rates. The throughput ceiling scales linearly with CPU: budget 10 millicores per MB/s of total proxy traffic. The full benchmark harness is open source — run it on your own cluster for numbers that reflect your workload.


nit: Since this summarize the article, it'd be good we make them as a bulleted list to make it clear.

SamBarker · 2026-05-21T07:58:23Z

+
+| Rate | Metric | Baseline | Encryption | Delta |
+|------|--------|----------|------------|-------|
+| 34,000 msg/s | Publish avg | 8.00 ms | 8.19 ms | +0.19 ms (+2%) |


Fixed — post uses msg/s consistently throughout.

Covers methodology, test environment, passthrough proxy results, encryption latency and throughput ceiling, the per-connection scaling insight, and sizing guidance. Includes a TODO placeholder for the connection sweep results before publication. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Covers why we chose OMB over Kafka's own tools, the benchmark harness we built (Helm chart, orchestration scripts, JBang result processors), workload design rationale, CPU flamegraphs with embedded interactive iframes, the per-connection ceiling discovery, bugs found in our own tooling, and the cluster recovery incident. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Adds /performance/ as a dedicated quick-reference page with headline benchmark numbers, comparison tables, and sizing guidance, linked from both blog posts. Updates the existing Performance section in overview.markdown with the key headline numbers and a link to the full reference page. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

…aming - Shift publication dates to May 21 and May 28 - Replace speculative per-connection ceiling explanation with empirical finding: encryption throughput ceiling scales linearly with CPU budget (validated at 1000m, 2000m, 4000m) - Add sizing formula: CPU (mc) = 20 × produce_MB_per_s, with worked example - Add RF=3 masking caveat: initial 1-topic sweeps conflated Kafka replication ceiling with proxy CPU ceiling; coefficient derived from RF=1 multi-topic workloads - Post 2: add full investigation narrative — workload isolation approach, coefficient derivation, 4-core confirmation, and 2-core prediction/validation - Drop stale "future work" items that are now complete Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

The proxy is selectively L7: default infrastructure filters do genuine Kafka protocol work (address rewriting, API version negotiation, metadata caching) while high-volume produce/consume traffic bypasses full deserialisation via the decode predicate. The 1.4% proxy CPU share validates this design, not just reflects it. Also drop the Fyre cluster upgrade section — OCP-internal incident with no relevance to readers. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Warm up test environment intro: realistic deployment framing - Add conversational lead-in to sizing guidance in both documents - Improve caveats opener in Post 1 - Add caveats section to performance page (RF=3 masking, message size, horizontal scaling) Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- New opening: laptop/codebase/confidence → harness/cluster/nuance - Why not Kafka tools: add coordinated omission bullet with voice - What we built: reframe around two experimental questions (rate sweep, connection sweep) before tooling details; add two-dimensions framing - Banishing click-ops: replace dry Helm section with Red Hat/operator motivation and all-your-CRs joke - JSON always comes in megabytes: replace docs dump with signal/noise framing; sharpen Comparator vs Summariser distinction - Following the ceiling: rewrite as investigation arc (spare CPU → what were we hitting? → RF=3 masking → connection sweep → coefficient) - Rename Post 2 title to "How hard can it be??? Maxing out a Kroxylicious instance" - Revert slug rename (benchmarking-the-proxy-under-the-hood stays) - Update performance.markdown cross-links to match Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Replaces dry methodology notes with a fuller narrative arc: - Opens with the representative vs repeatable tension in benchmarking - Explains the single-partition choice and why it makes the author wince - Justifies RF=3: proxy adds one real hop, but RF=1 would double the hop count — not a fair production comparison - Multi-topic runs reconnect to representative: baseline tax at normal load - Rate sweep methodology explained as technique, not run-specific numbers Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Format all narrator asides as *(italic brackets)* to distinguish narrator voice from main text - Fix coordinated omission bullet missing bold formatting - Fix "tracking...tracking" redundancy in OMB paragraph - "it made me wince" → "*(I had to squirm to type it)*" — more honest, author reached for single-partition deliberately Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Reframe the takeaway: the proxy boils the latency-sensitive path to near-TCP-stack overhead while operating at Layer 7 — that's the win - Add paragraph explaining why overhead holds across 10/100 topics: the proxy doesn't contend between topics (unlike a broker which juggles disk I/O, partition leaders, and replication); the connection sweep validates linear throughput scaling Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Full investigation arc: spare CPU shock → NIC elimination → 4-producer test → anti-affinity attempt (3 nodes, 3 brokers, nowhere to go) → new cluster → baseline shock → RTT math reveals co-location → second penny drops on OMB scheduling → RF=1 unlocks proxy CPU ceiling → coefficient → prediction. Corrects several issues in the prior draft: Netty theory discarded (proxy metrics showed minimal back pressure); co-location framed at pod/node level not VM level; 37k flagged as the only figure from the original cluster; all coefficient and sweep numbers confirmed as coming from the new distributed cluster. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Applies accurate numbers from the distributed 8-node cluster (5 workers, 3 masters) across all three files, replacing figures from the original co-located cluster: - Cluster description: 6-node → 8-node (5 workers, 3 masters) - RF=3 throughput ceiling: 37.2k→14,600 msg/s (encryption), 50-52k→19,400 msg/s (baseline), 26%→25% reduction - Coefficient: 12.5 mc/MB/s → 9.7 measured / 10 mc/MB/s operator formula - Formula: expose general form (10 × total proxy MB/s) with fan-out explanation; 20 × produce MB/s remains the 1:1 shorthand - 1-core RF=1: ~40k ceiling replaced with safe at 80k (91ms p99), saturating at ~126k - 4-core validation: 447ms→247ms at 160k; catastrophic→elevated at 321k (1,706ms); saturation above 321k - 2-core: comfortable at 80k (850ms), sustaining at 160k (720ms) — saturation not yet measured, consistent with model - Netty aside corrected: thread count scales with availableProcessors() (CPU limit), not fixed at 4 Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Rewrites flamegraph intro with personal motivation: hot path minimalism, Amdahl's law framing, and honest admission that the full sweep story didn't come together - Adds forward reference to bugs section to stitch the structure together - Moves OSS transparency point into "Run it yourself" where it naturally belongs, with a TODO placeholder for the raw data link - Drops duplicate "we share our workings" phrase from flamegraph prose Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Fix punctuation on OMB methodology comparability sentence - Fix repeated "We leaned towards repeatable" in workload design section - Fix tense: "will make" -> "makes" for workload design aside - Fix typo: "died in the wool" -> "dyed in the wool" - Add closing paragraph to flamegraph section: proxy wins are real but we aren't going to make AES faster - Replace stale 36k msg/s flamegraph references with FIXME pending new profiler runs Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Change "All good" to "Every good benchmarking story starts" (Bob's suggestion) - Add TL;DR paragraph with key numbers and sizing formula; flagged with FIXME comment pending final benchmark run Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Fix lone space-hyphen-space to em dash in OMB description - Add runtime warning (~14 hours) before benchmark commands with link to the full blog post reproduction script as a gist Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Explains why MWU testing was added (PhD teammate asked "is the difference real?"), how check-significance.sh works (per-window p99, ~30 samples, p < 0.05), and the honest caveat that per-window samples aren't fully uncorrelated. Distinguishes clearly between what MWU covers (latency delta realness) and what the coefficient derivation doesn't (n=4, no significance test, untested across message sizes). Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Move p99 explanation before first passthrough table where percentiles are first encountered; remove duplicate from encryption section - Expand Layer 7 point with one sentence of context for non-technical readers: most Kafka proxies operate at L4, Kroxylicious parses every message yet still adds only 0.2 ms - Add distribution board analogy for independent connection handling vs broker shared resource contention - Simplify replication factor caveat to one sentence, linking to companion post for detail - Fix "Most proxies" → "Most proxies operate on Kafka" for accuracy Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Post 2 is dated 2026-05-28 — Jekyll skips future posts by default, causing post_url resolution to fail at build time. Replace linked references with plain "companion post" text; links will be restored via a follow-up PR when Post 2 goes live. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Convert TL;DR from prose to bulleted list (S1) - Soften "dominated by Kafka consumer fetch timeouts" to "likely dominated by" — this is an inference, not a measured fact (S5) - Inline definition of rate sweep at first use in sizing guidance (S9) - Broaden "With record encryption" to "With filters (record encryption is the representative example here)" (S10) Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Rename files and update front matter dates; update post_url reference in Post 2 to match Post 1's new filename. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Fix "opevators" → "operators" - Add .DS_Store and .op/ to .gitignore - Remove accidentally committed macOS metadata and 1Password plugin files Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

These files are generated by running Jekyll locally and should not be committed. Add glob patterns to .gitignore to prevent recurrence. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

tombentley reviewed May 13, 2026

View reviewed changes

Comment thread _posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md Outdated

PaulRMellor reviewed May 15, 2026

View reviewed changes

SamBarker force-pushed the blog/benchmarking-the-proxy branch from 6d70f9c to e814763 Compare May 21, 2026 04:28

showuon reviewed May 21, 2026

View reviewed changes

SamBarker commented May 21, 2026

View reviewed changes

SamBarker added 23 commits May 22, 2026 10:39

WIP: Redrafting the engineering deep dive

ee44ce7

Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Standardise on msg/s throughout (was mixed msg/s and msg/sec)

837e3d6

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Reschedule posts to 26 May and 2 June

429c635

Rename files and update front matter dates; update post_url reference in Post 2 to match Post 1's new filename. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

SamBarker added 2 commits May 22, 2026 10:39

SamBarker force-pushed the blog/benchmarking-the-proxy branch from 7c62876 to 6c40ee4 Compare May 21, 2026 22:39


		The direct crypto cost is 13.3% (11.3% AES-GCM + 2.0% Kroxylicious filter logic). But encryption adds indirect costs too:

		- Buffer management (+5.8%): encrypted records need to be read into buffers, encrypted, and written to new buffers — more allocation, more copying

	All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly.
	All good benchmarking stories start with a hunch. I was confident Kroxylicious was cheap to run — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly.


		All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly.

		There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us.


		There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us.

		So we stopped saying "it depends", and got off the fence: we built something you can run yourselves on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours.

	So we stopped saying "it depends", and got off the fence: we built something you can run yourselves on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours.
	So instead of saying "it depends", we built something measurable you can run yourselves on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours.


		## Test environment

		No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.

	No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.
	We ran the benchmarks on a realistic deployment rather than a local development machine: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform, providing a controlled test environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit.

	The headline: ~0.2 ms additional average publish latency. Throughput is unaffected.
	The headline: ~0.2 ms additional average publish latency. Measured throughput was unaffected.


		## Record encryption: now we're doing real work

		Ok, so let's make the proxy smarter — make it do something people actually care about! [Record encryption](https://kroxylicious.io/documentation/0.20.0/html/record-encryption-guide) uses AES-256-GCM to encrypt each record passing through the proxy. AES-256-GCM is going to ask the CPU to work relatively hard on its own, but it's also going to push the proxy to understand each record it receives, unpack it, copy it, encrypt it, and re-pack it before sending it on to the broker. With all that work going on we expect some impact to latency and throughput. To answer our original question we need to identify two things: the latency when everything is going smoothly, and the reduction in throughput all this work causes. Monitoring latency once we go past the throughput inflection point isn't very helpful — it's dominated by the throughput limits and their erratic impacts on the latency of individual requests (a big hello to batching and buffering effects).


		### Latency at sub-saturation rates

		A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Averages flatter; the p99 is what your slowest clients actually experience, and it's usually the number that matters.


		Worked example: 100k msg/s at 1 KB = 100 MB/s produce → 100 × 20 = 2000m, plus headroom → ~2600m (~2.6 cores).

		2. Latency budget: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it.


		## Caveats and next steps

		These are real results from real hardware, but they don't tell a story for your workload. A few things worth knowing before you put these numbers in a slide deck:


		The overhead holding across 10 and 100 topics makes sense for the same reason: the proxy doesn't contend between topics. Think of the proxy as independent circuits on a distribution board — switching the breaker for lights doesn't cut power to the fridge. A Kafka broker is more like the mains supply itself — every circuit draws from the same source, so heavy load anywhere reduces what's available everywhere. Topics don't contend for shared resources: throughput scales linearly across them, and the connection sweep validates it.

		The end-to-end p99 figure is dominated by Kafka consumer fetch timeouts, as it should be. That said, it is reassuring to have a sub-ms impact on the p99.


		The transition wasn't a clean cliff edge — the proxy alternated between sustaining and saturating in a narrow band just above the ceiling. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Stay below 14k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage will vary.

		### The ceiling scales with CPU budget


		Numbers without guidance aren't very useful, so here's how to translate these results into pod specs.

		Passthrough proxy: size your Kafka cluster as you normally would. The proxy won't be the bottleneck — but if you want to verify that on your own hardware, the rate sweep is exactly the tool for it. Run the baseline and passthrough scenarios back-to-back and you'll have your own numbers.


		Passthrough proxy: size your Kafka cluster as you normally would. The proxy won't be the bottleneck — but if you want to verify that on your own hardware, the rate sweep is exactly the tool for it. Run the baseline and passthrough scenarios back-to-back and you'll have your own numbers.

		With record encryption:

Conversation

SamBarker commented May 13, 2026

Summary

Status

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PaulRMellor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants