redpanda: Add support to map code into hugepages#30190
redpanda: Add support to map code into hugepages#30190StephanDollberg wants to merge 1 commit intodevfrom
Conversation
|
Please see the paragraph about shared libs which relates to the very same quesiton that came up in standup. |
There was a problem hiding this comment.
Pull request overview
Adds an opt-in mechanism to promote Redpanda’s executable mappings into transparent hugepages to reduce iTLB/sTLB misses, including a runtime toggle and linker alignment changes to make the main binary eligible for PMD-sized mappings.
Changes:
- Introduces
syschecks::promote_code_to_hugepages()/demote_code_from_hugepages()usingmadvise(MADV_HUGEPAGE)andmadvise(MADV_COLLAPSE). - Adds a new cluster config property
code_hugepages_enabledand wires it into startup + a live watcher. - Forces 2MiB segment alignment for the
redpandabinary via linker flags.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/syschecks/hugepages.h | Declares promote/demote APIs for hugepage mapping of code segments. |
| src/v/syschecks/hugepages.cc | Implements ELF segment iteration + madvise-based promote/demote logic and logging. |
| src/v/syschecks/BUILD | Adds hugepages implementation to the syschecks library build. |
| src/v/redpanda/application.h | Adds a config binding field to watch the new setting at runtime. |
| src/v/redpanda/application.cc | Calls promote at startup and toggles promote/demote via a config watcher. |
| src/v/redpanda/BUILD | Adds linker flags to align PT_LOAD segments to 2MiB boundaries. |
| src/v/config/configuration.h | Adds the code_hugepages_enabled property declaration. |
| src/v/config/configuration.cc | Registers the new property with metadata and default value. |
Retry command for Build#83247please wait until all jobs are finished before running the slash command |
CI test resultstest results on build#83247
test results on build#83256
|
|
/ci-repeat 1 |
Retry command for Build#83256please wait until all jobs are finished before running the slash command |
| # ensures each PT_LOAD gets its own mmap at a 2 MB boundary rather | ||
| # than packing segments into a single mapping. | ||
| "-Wl,-z,max-page-size=2097152", | ||
| "-Wl,-z,separate-loadable-segments", |
There was a problem hiding this comment.
how many PT_LOAD segments are there? i.e., are we wasting a lot of "space" if we don't fill them out to the next 2MB boundary
There was a problem hiding this comment.
There are 4 in the redpanda binary. We adding ~2MB (just conicidental similar to the 2MiB alignment) of total padding. This is about a 2% binary size increase.
From when I looked into this earlier.
PT_LOAD layout
Aligned:
LOAD off 0x0000000 filesz 0x1fff980 R
LOAD off 0x2000000 filesz 0x4f8b920 R E
LOAD off 0x7000000 filesz 0x02630a0 RW
LOAD off 0x7400000 filesz 0x0011698 RW
Baseline:
LOAD off 0x0000000 filesz 0x1fff980 R
LOAD off 0x1fff980 filesz 0x4f8b920 R E
LOAD off 0x6f8b2c0 filesz 0x02630a0 RW
LOAD off 0x71ee360 filesz 0x0011698 RW
Gap between segments
Aligned:
- R -> RX: 1,664 bytes
- RX -> RW: 476,896 bytes
- RW -> RW: 1,691,488 bytes
- total: 2,170,048 bytes
Baseline:
- R -> RX: 0
- RX -> RW: 32
- RW -> RW: 0
- total: 32 bytes
|
Interesting, looks like we have a test that turns all configs on and the prefaulting seems to be incompatible with ASAN. Will define guard it. |
| marked_bytes += len; | ||
| } | ||
|
|
||
| // Fault in all pages so MADV_COLLAPSE has something to work with. |
There was a problem hiding this comment.
Why do we want to sync fault in all huge pages? Should we still want on-demand mapping here? It's a lot of memory to waste.
There was a problem hiding this comment.
Why do we want to sync fault in all huge pages? Should we still want on-demand mapping here?
For performance reason? Best to take the page faults now? So from that POV I am on the no side.
(Note if we don't want this that then also rules out MADV_COLLAPSE altogether as per the comment).
It's a lot of memory to waste.
It's like 120MiB or so? I mean sure you are probably never going to need every single code page but I don't see much point in saving a few XX MB?
There was a problem hiding this comment.
Is that how big the executable segments covered are? Yeah I guess it's not too much. Do you happen to have any timings?
Best to take the page faults now?
Maybe yes, but this is more obvious if you are going to take them all eventually anyway (like the heap, and currently we don't even enable this lock-meomry option for the heap). If you would only take 5% of them over the lifetime of the process then this looks less appealing. I have no idea if it's 5% or 95% though (evidently depends at least a bit on workload).
Anyway I think it's fine.
One other thing though: why are we doing the MADV_HUGEPAGE and MADV_COLLAPSE? ISTM you only want one or the other: the former for lazy, the latter for sync full mapping. I.e., I feel like MADV_HUGEPAGE does nothing now, unless it's for kernels that support MADV_HUGEPAGE and not MADV_COLLAPSE?
There was a problem hiding this comment.
Is that how big the executable segments covered are?
Yeah
Do you happen to have any timings?
Timings of what sorry?
One other thing though: why are we doing the MADV_HUGEPAGE and MADV_COLLAPSE? ISTM you only want one or the other: the former for lazy, the latter for sync full mapping. I.e., I feel like MADV_HUGEPAGE does nothing now, unless it's for kernels that support MADV_HUGEPAGE and not MADV_COLLAPSE?
Yeah exactly, MADV_HUGEPAGE should only affect the latter. I don't think it hurts?
We could do the whole if (not collapse fails) else { madv_hugepage } dance but I am not entirely sure about all the MADV_COLLAPSE return value semantics (.e.g.: as per docs one "area" in the range might fail to map which will already make it not return clean so we would do both in that case anyway).
No strong feelings though.
| // cop out so we are just explicit in any case. | ||
| auto* base = static_cast<volatile const char*>(addr); | ||
| for (size_t off = 0; off < len; off += 4096) { | ||
| [[maybe_unused]] char c = base[off]; |
There was a problem hiding this comment.
FWIW just base[off] works also
Adds support to map code into hugepages. Mapping the code/binary into hugepages decreases iTLB/sTLB misses which helps with frontend-boundness. This patch maps the main redpanda binary into hugepages combining two approaches: - Using THP for filed-back mappings, i.e.: the binary - Using MADV_COLLAPSE to force a sync mapping of the binary pages into hugepages. Avoiding lazy paging by khugepaged. This patch also makes this runtime switchable by forcing the pages off hugepages if the feature is being disabled. This is mostly useful for testing. The two things above rely on a few things being available: - Linux build flag CONFIG_READ_ONLY_THP_FOR_FS: Allows THP backing for files. - MADV_COLLAPSE being available: Requires Linux 6.1+ The latter is more and more common but the former is currently still disabled in some major distributions (Ubuntu, RHEL) hence it will generally be not very available. Because of this we disable the feature by default. Not because it would break on these systems but mostly because it means we don't get much testing. There is work in progress to remove the CONFIG_READ_ONLY_THP_FOR_FS kernel flag and instead link it to filesystem support. See https://lwn.net/Articles/1066582/. This would bring coverage out of the box to XFS and ext4. We can reevaluate turning this on by default in the future. Note independent of kernel support the code pages need to be 2MiB aligned as well to get successfully mapped. To enforce this we unconditionally add a linker flag to the redpanda binary which forces this alignment. When doing the iteration with dl_iterate_phdr we would naturally include linked shared libs. However, we currently don't add the aforementioned linker flag to those and hence those will not map into hugepages. I think this is a fair tradeoff as those would bloat quite a bit and shouldn't matter that much from a perf POV. openssl is probably the only interesting one but I'd claim anything openssl is either crypto compute bound or syscall bound.
36e4969 to
418a502
Compare
Adds support to map code into hugepages.
Mapping the code/binary into hugepages decreases iTLB/sTLB misses which
helps with frontend-boundness.
This patch maps the main redpanda binary into hugepages combining two
approaches:
hugepages. Avoiding lazy paging by khugepaged.
This patch also makes this runtime switchable by forcing the pages off
hugepages if the feature is being disabled. This is mostly useful for
testing.
The two things above rely on a few things being available:
files.
The latter is more and more common but the former is currently still
disabled in some major distributions (Ubuntu, RHEL) hence it will
generally be not very available. Because of this we disable the feature
by default. Not because it would break on these systems but mostly
because it means we don't get much testing.
There is work in progress to remove the CONFIG_READ_ONLY_THP_FOR_FS
kernel flag and instead link it to filesystem support. See
https://lwn.net/Articles/1066582/. This would bring coverage out of the
box to XFS and ext4. We can reevaluate turning this on by default in the
future.
Note independent of kernel support the code pages need to be 2MiB
aligned as well to get successfully mapped.
To enforce this we unconditionally add a linker flag to the redpanda
binary which forces this alignment.
When doing the iteration with dl_iterate_phdr we would naturally include
linked shared libs. However, we currently don't add the aforementioned
linker flag to those and hence those will not map into hugepages.
I think this is a fair tradeoff as those would bloat quite a bit and
shouldn't matter that much from a perf POV. openssl is probably the only
interesting one but I'd claim anything openssl is either crypto compute
bound or syscall bound.
Backports Required
Release Notes