Skip to content

redpanda: Add support to map code into hugepages#30190

Open
StephanDollberg wants to merge 1 commit intodevfrom
stephan/hugepage-code2
Open

redpanda: Add support to map code into hugepages#30190
StephanDollberg wants to merge 1 commit intodevfrom
stephan/hugepage-code2

Conversation

@StephanDollberg
Copy link
Copy Markdown
Member

Adds support to map code into hugepages.

Mapping the code/binary into hugepages decreases iTLB/sTLB misses which
helps with frontend-boundness.

This patch maps the main redpanda binary into hugepages combining two
approaches:

  • Using THP for filed-back mappings, i.e.: the binary
  • Using MADV_COLLAPSE to force a sync mapping of the binary pages into
    hugepages. Avoiding lazy paging by khugepaged.

This patch also makes this runtime switchable by forcing the pages off
hugepages if the feature is being disabled. This is mostly useful for
testing.

The two things above rely on a few things being available:

  • Linux build flag CONFIG_READ_ONLY_THP_FOR_FS: Allows THP backing for
    files.
  • MADV_COLLAPSE being available: Requires Linux 6.1+

The latter is more and more common but the former is currently still
disabled in some major distributions (Ubuntu, RHEL) hence it will
generally be not very available. Because of this we disable the feature
by default. Not because it would break on these systems but mostly
because it means we don't get much testing.

There is work in progress to remove the CONFIG_READ_ONLY_THP_FOR_FS
kernel flag and instead link it to filesystem support. See
https://lwn.net/Articles/1066582/. This would bring coverage out of the
box to XFS and ext4. We can reevaluate turning this on by default in the
future.

Note independent of kernel support the code pages need to be 2MiB
aligned as well to get successfully mapped.

To enforce this we unconditionally add a linker flag to the redpanda
binary which forces this alignment.

When doing the iteration with dl_iterate_phdr we would naturally include
linked shared libs. However, we currently don't add the aforementioned
linker flag to those and hence those will not map into hugepages.

I think this is a fair tradeoff as those would bloat quite a bit and
shouldn't matter that much from a perf POV. openssl is probably the only
interesting one but I'd claim anything openssl is either crypto compute
bound or syscall bound.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

@StephanDollberg StephanDollberg requested a review from a team as a code owner April 16, 2026 14:01
Copilot AI review requested due to automatic review settings April 16, 2026 14:01
@StephanDollberg
Copy link
Copy Markdown
Member Author

StephanDollberg commented Apr 16, 2026

Please see the paragraph about shared libs which relates to the very same quesiton that came up in standup.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in mechanism to promote Redpanda’s executable mappings into transparent hugepages to reduce iTLB/sTLB misses, including a runtime toggle and linker alignment changes to make the main binary eligible for PMD-sized mappings.

Changes:

  • Introduces syschecks::promote_code_to_hugepages() / demote_code_from_hugepages() using madvise(MADV_HUGEPAGE) and madvise(MADV_COLLAPSE).
  • Adds a new cluster config property code_hugepages_enabled and wires it into startup + a live watcher.
  • Forces 2MiB segment alignment for the redpanda binary via linker flags.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/v/syschecks/hugepages.h Declares promote/demote APIs for hugepage mapping of code segments.
src/v/syschecks/hugepages.cc Implements ELF segment iteration + madvise-based promote/demote logic and logging.
src/v/syschecks/BUILD Adds hugepages implementation to the syschecks library build.
src/v/redpanda/application.h Adds a config binding field to watch the new setting at runtime.
src/v/redpanda/application.cc Calls promote at startup and toggles promote/demote via a config watcher.
src/v/redpanda/BUILD Adds linker flags to align PT_LOAD segments to 2MiB boundaries.
src/v/config/configuration.h Adds the code_hugepages_enabled property declaration.
src/v/config/configuration.cc Registers the new property with metadata and default value.

Comment thread src/v/syschecks/hugepages.cc
Comment thread src/v/syschecks/hugepages.cc
Comment thread src/v/syschecks/hugepages.cc
Comment thread src/v/syschecks/BUILD
Comment thread src/v/redpanda/application.cc
Comment thread src/v/redpanda/BUILD
Comment thread src/v/syschecks/hugepages.h
Comment thread src/v/syschecks/hugepages.cc
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Retry command for Build#83247

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Apr 16, 2026

CI test results

test results on build#83247
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL ClusterConfigTest test_valid_settings null integration https://buildkite.com/redpanda/redpanda/builds/83247#019d96b1-d602-4e63-b823-9494ef9eb7c4 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_valid_settings
test results on build#83256
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL ClusterConfigTest test_valid_settings null integration https://buildkite.com/redpanda/redpanda/builds/83256#019d9700-62f6-4d0c-bdea-a888650857fb 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_valid_settings

@StephanDollberg
Copy link
Copy Markdown
Member Author

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Retry command for Build#83256

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings

Comment thread src/v/redpanda/BUILD
# ensures each PT_LOAD gets its own mmap at a 2 MB boundary rather
# than packing segments into a single mapping.
"-Wl,-z,max-page-size=2097152",
"-Wl,-z,separate-loadable-segments",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how many PT_LOAD segments are there? i.e., are we wasting a lot of "space" if we don't fill them out to the next 2MB boundary

Copy link
Copy Markdown
Member Author

@StephanDollberg StephanDollberg Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 4 in the redpanda binary. We adding ~2MB (just conicidental similar to the 2MiB alignment) of total padding. This is about a 2% binary size increase.

From when I looked into this earlier.

PT_LOAD layout

Aligned:
LOAD  off 0x0000000  filesz 0x1fff980  R
LOAD  off 0x2000000  filesz 0x4f8b920  R E
LOAD  off 0x7000000  filesz 0x02630a0  RW
LOAD  off 0x7400000  filesz 0x0011698  RW

Baseline:
LOAD  off 0x0000000  filesz 0x1fff980  R
LOAD  off 0x1fff980  filesz 0x4f8b920  R E
LOAD  off 0x6f8b2c0  filesz 0x02630a0  RW
LOAD  off 0x71ee360  filesz 0x0011698  RW

Gap between segments
Aligned:
- R -> RX: 1,664 bytes
- RX -> RW: 476,896 bytes
- RW -> RW: 1,691,488 bytes
- total: 2,170,048 bytes

Baseline:
- R -> RX: 0
- RX -> RW: 32
- RW -> RW: 0
- total: 32 bytes

@StephanDollberg
Copy link
Copy Markdown
Member Author

StephanDollberg commented Apr 16, 2026

Interesting, looks like we have a test that turns all configs on and the prefaulting seems to be incompatible with ASAN. Will define guard it.

marked_bytes += len;
}

// Fault in all pages so MADV_COLLAPSE has something to work with.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to sync fault in all huge pages? Should we still want on-demand mapping here? It's a lot of memory to waste.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to sync fault in all huge pages? Should we still want on-demand mapping here?

For performance reason? Best to take the page faults now? So from that POV I am on the no side.

(Note if we don't want this that then also rules out MADV_COLLAPSE altogether as per the comment).

It's a lot of memory to waste.

It's like 120MiB or so? I mean sure you are probably never going to need every single code page but I don't see much point in saving a few XX MB?

Copy link
Copy Markdown
Member

@travisdowns travisdowns Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that how big the executable segments covered are? Yeah I guess it's not too much. Do you happen to have any timings?

Best to take the page faults now?

Maybe yes, but this is more obvious if you are going to take them all eventually anyway (like the heap, and currently we don't even enable this lock-meomry option for the heap). If you would only take 5% of them over the lifetime of the process then this looks less appealing. I have no idea if it's 5% or 95% though (evidently depends at least a bit on workload).

Anyway I think it's fine.

One other thing though: why are we doing the MADV_HUGEPAGE and MADV_COLLAPSE? ISTM you only want one or the other: the former for lazy, the latter for sync full mapping. I.e., I feel like MADV_HUGEPAGE does nothing now, unless it's for kernels that support MADV_HUGEPAGE and not MADV_COLLAPSE?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that how big the executable segments covered are?

Yeah

Do you happen to have any timings?

Timings of what sorry?

One other thing though: why are we doing the MADV_HUGEPAGE and MADV_COLLAPSE? ISTM you only want one or the other: the former for lazy, the latter for sync full mapping. I.e., I feel like MADV_HUGEPAGE does nothing now, unless it's for kernels that support MADV_HUGEPAGE and not MADV_COLLAPSE?

Yeah exactly, MADV_HUGEPAGE should only affect the latter. I don't think it hurts?

We could do the whole if (not collapse fails) else { madv_hugepage } dance but I am not entirely sure about all the MADV_COLLAPSE return value semantics (.e.g.: as per docs one "area" in the range might fail to map which will already make it not return clean so we would do both in that case anyway).

No strong feelings though.

// cop out so we are just explicit in any case.
auto* base = static_cast<volatile const char*>(addr);
for (size_t off = 0; off < len; off += 4096) {
[[maybe_unused]] char c = base[off];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW just base[off] works also

Adds support to map code into hugepages.

Mapping the code/binary into hugepages decreases iTLB/sTLB misses which
helps with frontend-boundness.

This patch maps the main redpanda binary into hugepages combining two
approaches:

 - Using THP for filed-back mappings, i.e.: the binary
 - Using MADV_COLLAPSE to force a sync mapping of the binary pages into
   hugepages. Avoiding lazy paging by khugepaged.

This patch also makes this runtime switchable by forcing the pages off
hugepages if the feature is being disabled. This is mostly useful for
testing.

The two things above rely on a few things being available:

 - Linux build flag CONFIG_READ_ONLY_THP_FOR_FS: Allows THP backing for
   files.
 - MADV_COLLAPSE being available: Requires Linux 6.1+

The latter is more and more common but the former is currently still
disabled in some major distributions (Ubuntu, RHEL) hence it will
generally be not very available. Because of this we disable the feature
by default.  Not because it would break on these systems but mostly
because it means we don't get much testing.

There is work in progress to remove the CONFIG_READ_ONLY_THP_FOR_FS
kernel flag and instead link it to filesystem support. See
https://lwn.net/Articles/1066582/. This would bring coverage out of the
box to XFS and ext4. We can reevaluate turning this on by default in the
future.

Note independent of kernel support the code pages need to be 2MiB
aligned as well to get successfully mapped.

To enforce this we unconditionally add a linker flag to the redpanda
binary which forces this alignment.

When doing the iteration with dl_iterate_phdr we would naturally include
linked shared libs. However, we currently don't add the aforementioned
linker flag to those and hence those will not map into hugepages.

I think this is a fair tradeoff as those would bloat quite a bit and
shouldn't matter that much from a perf POV. openssl is probably the only
interesting one but I'd claim anything openssl is either crypto compute
bound or syscall bound.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants