redpanda: Add support to map code into hugepages by StephanDollberg · Pull Request #30190 · redpanda-data/redpanda

StephanDollberg · 2026-04-16T14:01:01Z

Adds support to map code into hugepages.

Mapping the code/binary into hugepages decreases iTLB/sTLB misses which
helps with frontend-boundness.

This patch maps the main redpanda binary into hugepages combining two
approaches:

Using THP for filed-back mappings, i.e.: the binary
Using MADV_COLLAPSE to force a sync mapping of the binary pages into
hugepages. Avoiding lazy paging by khugepaged.

This patch also makes this runtime switchable by forcing the pages off
hugepages if the feature is being disabled. This is mostly useful for
testing.

The two things above rely on a few things being available:

Linux build flag CONFIG_READ_ONLY_THP_FOR_FS: Allows THP backing for
files.
MADV_COLLAPSE being available: Requires Linux 6.1+

The latter is more and more common but the former is currently still
disabled in some major distributions (Ubuntu, RHEL) hence it will
generally be not very available. Because of this we disable the feature
by default. Not because it would break on these systems but mostly
because it means we don't get much testing.

There is work in progress to remove the CONFIG_READ_ONLY_THP_FOR_FS
kernel flag and instead link it to filesystem support. See
https://lwn.net/Articles/1066582/. This would bring coverage out of the
box to XFS and ext4. We can reevaluate turning this on by default in the
future.

Note independent of kernel support the code pages need to be 2MiB
aligned as well to get successfully mapped.

To enforce this we unconditionally add a linker flag to the redpanda
binary which forces this alignment.

When doing the iteration with dl_iterate_phdr we would naturally include
linked shared libs. However, we currently don't add the aforementioned
linker flag to those and hence those will not map into hugepages.

I think this is a fair tradeoff as those would bloat quite a bit and
shouldn't matter that much from a perf POV. openssl is probably the only
interesting one but I'd claim anything openssl is either crypto compute
bound or syscall bound.

Backports Required

Release Notes

none

StephanDollberg · 2026-04-16T14:01:39Z

Please see the paragraph about shared libs which relates to the very same quesiton that came up in standup.

Copilot

Pull request overview

Adds an opt-in mechanism to promote Redpanda’s executable mappings into transparent hugepages to reduce iTLB/sTLB misses, including a runtime toggle and linker alignment changes to make the main binary eligible for PMD-sized mappings.

Changes:

Introduces syschecks::promote_code_to_hugepages() / demote_code_from_hugepages() using madvise(MADV_HUGEPAGE) and madvise(MADV_COLLAPSE).
Adds a new cluster config property code_hugepages_enabled and wires it into startup + a live watcher.
Forces 2MiB segment alignment for the redpanda binary via linker flags.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/v/syschecks/hugepages.h	Declares promote/demote APIs for hugepage mapping of code segments.
src/v/syschecks/hugepages.cc	Implements ELF segment iteration + madvise-based promote/demote logic and logging.
src/v/syschecks/BUILD	Adds hugepages implementation to the `syschecks` library build.
src/v/redpanda/application.h	Adds a config binding field to watch the new setting at runtime.
src/v/redpanda/application.cc	Calls promote at startup and toggles promote/demote via a config watcher.
src/v/redpanda/BUILD	Adds linker flags to align PT_LOAD segments to 2MiB boundaries.
src/v/config/configuration.h	Adds the `code_hugepages_enabled` property declaration.
src/v/config/configuration.cc	Registers the new property with metadata and default value.

vbotbuildovich · 2026-04-16T15:06:45Z

Retry command for Build#83247

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings

vbotbuildovich · 2026-04-16T15:35:52Z

CI test results

test results on build#83247

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FAIL	ClusterConfigTest	test_valid_settings	null	integration	https://buildkite.com/redpanda/redpanda/builds/83247#019d96b1-d602-4e63-b823-9494ef9eb7c4	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_valid_settings

test results on build#83256

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FAIL	ClusterConfigTest	test_valid_settings	null	integration	https://buildkite.com/redpanda/redpanda/builds/83256#019d9700-62f6-4d0c-bdea-a888650857fb	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_valid_settings

StephanDollberg · 2026-04-16T15:53:03Z

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings

vbotbuildovich · 2026-04-16T16:02:48Z

Retry command for Build#83256

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings

travisdowns · 2026-04-16T16:21:26Z

+        # ensures each PT_LOAD gets its own mmap at a 2 MB boundary rather
+        # than packing segments into a single mapping.
+        "-Wl,-z,max-page-size=2097152",
+        "-Wl,-z,separate-loadable-segments",


how many PT_LOAD segments are there? i.e., are we wasting a lot of "space" if we don't fill them out to the next 2MB boundary

There are 4 in the redpanda binary. We adding ~2MB (just conicidental similar to the 2MiB alignment) of total padding. This is about a 2% binary size increase.

From when I looked into this earlier.

PT_LOAD layout Aligned: LOAD off 0x0000000 filesz 0x1fff980 R LOAD off 0x2000000 filesz 0x4f8b920 R E LOAD off 0x7000000 filesz 0x02630a0 RW LOAD off 0x7400000 filesz 0x0011698 RW Baseline: LOAD off 0x0000000 filesz 0x1fff980 R LOAD off 0x1fff980 filesz 0x4f8b920 R E LOAD off 0x6f8b2c0 filesz 0x02630a0 RW LOAD off 0x71ee360 filesz 0x0011698 RW Gap between segments Aligned: - R -> RX: 1,664 bytes - RX -> RW: 476,896 bytes - RW -> RW: 1,691,488 bytes - total: 2,170,048 bytes Baseline: - R -> RX: 0 - RX -> RW: 32 - RW -> RW: 0 - total: 32 bytes

StephanDollberg · 2026-04-16T16:37:29Z

Interesting, looks like we have a test that turns all configs on and the prefaulting seems to be incompatible with ASAN. Will define guard it.

travisdowns · 2026-04-16T16:49:18Z

+            marked_bytes += len;
+        }
+
+        // Fault in all pages so MADV_COLLAPSE has something to work with.


Why do we want to sync fault in all huge pages? Should we still want on-demand mapping here? It's a lot of memory to waste.

Why do we want to sync fault in all huge pages? Should we still want on-demand mapping here?

For performance reason? Best to take the page faults now? So from that POV I am on the no side.

(Note if we don't want this that then also rules out MADV_COLLAPSE altogether as per the comment).

It's a lot of memory to waste.

It's like 120MiB or so? I mean sure you are probably never going to need every single code page but I don't see much point in saving a few XX MB?

Is that how big the executable segments covered are? Yeah I guess it's not too much. Do you happen to have any timings?

Best to take the page faults now?

Maybe yes, but this is more obvious if you are going to take them all eventually anyway (like the heap, and currently we don't even enable this lock-meomry option for the heap). If you would only take 5% of them over the lifetime of the process then this looks less appealing. I have no idea if it's 5% or 95% though (evidently depends at least a bit on workload).

Anyway I think it's fine.

One other thing though: why are we doing the MADV_HUGEPAGE and MADV_COLLAPSE? ISTM you only want one or the other: the former for lazy, the latter for sync full mapping. I.e., I feel like MADV_HUGEPAGE does nothing now, unless it's for kernels that support MADV_HUGEPAGE and not MADV_COLLAPSE?

Is that how big the executable segments covered are?

Yeah

Do you happen to have any timings?

Timings of what sorry?

One other thing though: why are we doing the MADV_HUGEPAGE and MADV_COLLAPSE? ISTM you only want one or the other: the former for lazy, the latter for sync full mapping. I.e., I feel like MADV_HUGEPAGE does nothing now, unless it's for kernels that support MADV_HUGEPAGE and not MADV_COLLAPSE?

Yeah exactly, MADV_HUGEPAGE should only affect the latter. I don't think it hurts?

We could do the whole if (not collapse fails) else { madv_hugepage } dance but I am not entirely sure about all the MADV_COLLAPSE return value semantics (.e.g.: as per docs one "area" in the range might fail to map which will already make it not return clean so we would do both in that case anyway).

No strong feelings though.

travisdowns · 2026-04-16T16:53:55Z

+        // cop out so we are just explicit in any case.
+        auto* base = static_cast<volatile const char*>(addr);
+        for (size_t off = 0; off < len; off += 4096) {
+            [[maybe_unused]] char c = base[off];


FWIW just base[off] works also

Adds support to map code into hugepages. Mapping the code/binary into hugepages decreases iTLB/sTLB misses which helps with frontend-boundness. This patch maps the main redpanda binary into hugepages combining two approaches: - Using THP for filed-back mappings, i.e.: the binary - Using MADV_COLLAPSE to force a sync mapping of the binary pages into hugepages. Avoiding lazy paging by khugepaged. This patch also makes this runtime switchable by forcing the pages off hugepages if the feature is being disabled. This is mostly useful for testing. The two things above rely on a few things being available: - Linux build flag CONFIG_READ_ONLY_THP_FOR_FS: Allows THP backing for files. - MADV_COLLAPSE being available: Requires Linux 6.1+ The latter is more and more common but the former is currently still disabled in some major distributions (Ubuntu, RHEL) hence it will generally be not very available. Because of this we disable the feature by default. Not because it would break on these systems but mostly because it means we don't get much testing. There is work in progress to remove the CONFIG_READ_ONLY_THP_FOR_FS kernel flag and instead link it to filesystem support. See https://lwn.net/Articles/1066582/. This would bring coverage out of the box to XFS and ext4. We can reevaluate turning this on by default in the future. Note independent of kernel support the code pages need to be 2MiB aligned as well to get successfully mapped. To enforce this we unconditionally add a linker flag to the redpanda binary which forces this alignment. When doing the iteration with dl_iterate_phdr we would naturally include linked shared libs. However, we currently don't add the aforementioned linker flag to those and hence those will not map into hugepages. I think this is a fair tradeoff as those would bloat quite a bit and shouldn't matter that much from a perf POV. openssl is probably the only interesting one but I'd claim anything openssl is either crypto compute bound or syscall bound.

StephanDollberg requested review from ballard26 and travisdowns April 16, 2026 14:01

StephanDollberg requested a review from a team as a code owner April 16, 2026 14:01

Copilot AI review requested due to automatic review settings April 16, 2026 14:01

github-actions Bot added area/build area/redpanda labels Apr 16, 2026

Copilot started reviewing on behalf of StephanDollberg April 16, 2026 14:01 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

travisdowns reviewed Apr 16, 2026

View reviewed changes

StephanDollberg force-pushed the stephan/hugepage-code2 branch from 36e4969 to 418a502 Compare April 16, 2026 17:25

StephanDollberg requested a review from travisdowns April 28, 2026 08:38

Conversation

StephanDollberg commented Apr 16, 2026

Backports Required

Release Notes

Uh oh!

StephanDollberg commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vbotbuildovich commented Apr 16, 2026

Retry command for Build#83247

Uh oh!

vbotbuildovich commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

StephanDollberg commented Apr 16, 2026

Uh oh!

vbotbuildovich commented Apr 16, 2026

Retry command for Build#83256

Uh oh!

travisdowns Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

StephanDollberg Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StephanDollberg commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

travisdowns Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

StephanDollberg Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

travisdowns Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StephanDollberg Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

travisdowns Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

StephanDollberg commented Apr 16, 2026 •

edited

Loading

vbotbuildovich commented Apr 16, 2026 •

edited

Loading

StephanDollberg Apr 16, 2026 •

edited

Loading

StephanDollberg commented Apr 16, 2026 •

edited

Loading

travisdowns Apr 16, 2026 •

edited

Loading