optionally use MADV_GUARD_INSTALL for large allocation guard pages#341
Open
thomasbuilds wants to merge 1 commit into
Open
optionally use MADV_GUARD_INSTALL for large allocation guard pages#341thomasbuilds wants to merge 1 commit into
thomasbuilds wants to merge 1 commit into
Conversation
Contributor
This produces a regression for this program. When CONFIG_GUARD_PAGES_USE_MADVISE is false, the program runs normally, but when CONFIG_GUARD_PAGES_USE_MADVISE is true, the malloc after mlockall(MCL_FUTURE | MCL_ONFAULT); fails with errno=22. |
9e3e3a6 to
f54ee16
Compare
f54ee16 to
ded5838
Compare
Add CONFIG_GUARD_PAGES_USE_MADVISE (default false) to install large allocation guard regions with MADV_GUARD_INSTALL (Linux 6.13+) inside a single read-write mapping instead of separate PROT_NONE mappings, keeping each large allocation to one VMA instead of three. This is preserved through allocate_pages(), allocate_pages_aligned(), the region quarantine and in-place realloc shrink so it holds under allocation churn, including under CONFIG_LABEL_MEMORY where the quarantined region is named as a whole to avoid splitting the VMA. Kernel support is probed and cached at runtime. Guard installation is best-effort: it falls back to the PROT_NONE scheme whenever madvise fails, including EINVAL on VM_LOCKED mappings from mlockall(MCL_FUTURE), which also latches the feature off to avoid retrying on every allocation. It is off by default because the guard bytes are then accounted as committed memory (resident memory and total address space are unchanged), which regresses strict overcommit (vm.overcommit_memory=2).
ded5838 to
35a0009
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses the high-VMA-count concern from
KERNEL_FEATURE_WISHLIST.md(see #258).MADV_GUARD_INSTALL(Linux 6.13+) lets guard regions live inside a single read-write mapping at the page-table level instead of as separatePROT_NONEVMAs.Change
Adds
CONFIG_GUARD_PAGES_USE_MADVISE(default false). When enabled, guard regions for large allocations are installed withMADV_GUARD_INSTALLinside one read-write mapping rather than carved out as separatePROT_NONEmappings, keeping each large allocation to a single VMA instead of three. This is applied inallocate_pages(),allocate_pages_aligned(), the region quarantine, and the in-place realloc shrink, so the single-VMA property holds under allocation churn rather than only for live allocations.Kernel support is probed once at runtime and cached, and guard installation is best-effort: any
madvisefailure falls back to the existingPROT_NONEscheme rather than failing the allocation. In particularMADV_GUARD_INSTALLreturnsEINVALonVM_LOCKEDmappings (e.g. undermlockall(MCL_FUTURE), which locks all future mappings); that falls back and latches the feature off so it isn't retried per allocation (themlockallregression reported by @rdevshp), preservingerrnoacross the fallback. UnderCONFIG_LABEL_MEMORYthe quarantined region is labeled as a whole soPR_SET_VMA_ANON_NAMEdoes not split the single VMA back into three.Why off by default
In #258 it was noted this would "require having full overcommit enabled if it doesn't reduce the accounted memory", and that is what I measured: the guard bytes become committed/accounted memory once they are part of a RW mapping. Resident memory and the total reserved address space are both unchanged (so
RLIMIT_ASis unaffected), but the private-writable commit charge grows by the guard size, which regresses strict overcommit (vm.overcommit_memory=2). Hence opt-in rather than a default behavior change.Measurements
2000 concurrently-live 256 KiB large allocations:
RSS and VmSize are unchanged; only the committed/writable accounting grows, by roughly the total guard size (~260 MiB here). Aligned large allocations go from ~2.9 to ~1.0 VMAs/alloc.
Under heavy churn (~1280 quarantined 1 MiB regions):
CONFIG_LABEL_MEMORYEarlier the quarantine
PROT_NONE'd the body and split the single VMA into three, so the scheme regressed under churn (≈3200 vs 661 without labeling, and no better thanPROT_NONEwith labeling). Guard-installing the quarantined body keeps it one VMA, so the madvise scheme is now <=PROT_NONEin every config, including a ~3.5x reduction underCONFIG_LABEL_MEMORY(the Android default). (Single run; counts vary with the randomized guard sizes.)Verification
-Werror, feature off and on, with and withoutCONFIG_LABEL_MEMORY; CI matrix covers clang and musl. Test suite passes feature off and on.MADV_GUARD_INSTALL(6.13+) kernel: large-allocation guards fault on overflow, underflow, use-after-free (quarantine), and after in-place realloc shrink, both with and withoutCONFIG_LABEL_MEMORY; quarantined and shrunk regions stay single-VMA.mlockall(MCL_FUTURE)falls back to thePROT_NONEscheme, latches the feature off, and preserveserrno.madvise's return value, so the feature must be validated on a real kernel; qemu-user silently no-opsMADV_GUARD_INSTALL.Open questions
Android.bpis intentionally not wired up; it falls back to thePROT_NONEscheme via the default define.