optionally use MADV_GUARD_INSTALL for large allocation guard pages by thomasbuilds · Pull Request #341 · GrapheneOS/hardened_malloc

thomasbuilds · 2026-05-29T20:45:59Z

Addresses the high-VMA-count concern from KERNEL_FEATURE_WISHLIST.md (see #258). MADV_GUARD_INSTALL (Linux 6.13+) lets guard regions live inside a single read-write mapping at the page-table level instead of as separate PROT_NONE VMAs.

Change

Adds CONFIG_GUARD_PAGES_USE_MADVISE (default false). When enabled, guard regions for large allocations are installed with MADV_GUARD_INSTALL inside one read-write mapping rather than carved out as separate PROT_NONE mappings, keeping each large allocation to a single VMA instead of three. This is applied in allocate_pages(), allocate_pages_aligned(), the region quarantine, and the in-place realloc shrink, so the single-VMA property holds under allocation churn rather than only for live allocations.

Kernel support is probed once at runtime and cached, and guard installation is best-effort: any madvise failure falls back to the existing PROT_NONE scheme rather than failing the allocation. In particular MADV_GUARD_INSTALL returns EINVAL on VM_LOCKED mappings (e.g. under mlockall(MCL_FUTURE), which locks all future mappings); that falls back and latches the feature off so it isn't retried per allocation (the mlockall regression reported by @rdevshp), preserving errno across the fallback. Under CONFIG_LABEL_MEMORY the quarantined region is labeled as a whole so PR_SET_VMA_ANON_NAME does not split the single VMA back into three.

Why off by default

In #258 it was noted this would "require having full overcommit enabled if it doesn't reduce the accounted memory", and that is what I measured: the guard bytes become committed/accounted memory once they are part of a RW mapping. Resident memory and the total reserved address space are both unchanged (so RLIMIT_AS is unaffected), but the private-writable commit charge grows by the guard size, which regresses strict overcommit (vm.overcommit_memory=2). Hence opt-in rather than a default behavior change.

Measurements

2000 concurrently-live 256 KiB large allocations:

metric	PROT_NONE	MADV_GUARD_INSTALL
VMAs	+4003	+4
VmRSS	+8108 KiB	+8108 KiB
VmSize (RLIMIT_AS)	+778064 KiB	+778360 KiB
VmData (committed)	+512092 KiB	+778452 KiB

RSS and VmSize are unchanged; only the committed/writable accounting grows, by roughly the total guard size (~260 MiB here). Aligned large allocations go from ~2.9 to ~1.0 VMAs/alloc.

Under heavy churn (~1280 quarantined 1 MiB regions):

config	PROT_NONE	MADV_GUARD_INSTALL
default (no labeling)	661	638
`CONFIG_LABEL_MEMORY`	3605	1040

Earlier the quarantine PROT_NONE'd the body and split the single VMA into three, so the scheme regressed under churn (≈3200 vs 661 without labeling, and no better than PROT_NONE with labeling). Guard-installing the quarantined body keeps it one VMA, so the madvise scheme is now <= PROT_NONE in every config, including a ~3.5x reduction under CONFIG_LABEL_MEMORY (the Android default). (Single run; counts vary with the randomized guard sizes.)

Verification

Builds clean under gcc with -Werror, feature off and on, with and without CONFIG_LABEL_MEMORY; CI matrix covers clang and musl. Test suite passes feature off and on.
On a real MADV_GUARD_INSTALL (6.13+) kernel: large-allocation guards fault on overflow, underflow, use-after-free (quarantine), and after in-place realloc shrink, both with and without CONFIG_LABEL_MEMORY; quarantined and shrunk regions stay single-VMA.
mlockall(MCL_FUTURE) falls back to the PROT_NONE scheme, latches the feature off, and preserves errno.
UBSan clean (suite + large-alloc churn + realloc shrink/grow). TSan multithreaded stress (8 threads churning large alloc/realloc-shrink/free, racing the one-time probe) is clean, including the probe's compare-exchange; the only cross-thread state is that single atomic flag.
The probe trusts madvise's return value, so the feature must be validated on a real kernel; qemu-user silently no-ops MADV_GUARD_INSTALL.

Open questions

The default value, and whether requiring overcommit is acceptable.
Android.bp is intentionally not wired up; it falls back to the PROT_NONE scheme via the default define.

rdevshp · 2026-05-30T12:34:14Z

#define _GNU_SOURCE

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <sys/mman.h>

int main(void) {
    const size_t size = 256 * 1024;

    errno = 0;
    void *warm = malloc(size);
    if (warm == NULL) {
        printf("warmup_large_malloc=failed errno=%d (%s)\n", errno, strerror(errno));
        return 2;
    }
    printf("warmup_large_malloc=ok ptr=%p\n", warm);

    errno = 0;
    int lock_ret = mlockall(MCL_FUTURE | MCL_ONFAULT);
    printf("mlockall_mcl_future_ret=%d errno=%d (%s)\n", lock_ret, errno, strerror(errno));

    errno = 0;
    void *after = malloc(size);
    if (after == NULL) {
        printf("post_mlock_large_malloc=failed errno=%d (%s)\n", errno, strerror(errno));
        return 1;
    }

    printf("post_mlock_large_malloc=ok ptr=%p\n", after);
    return 0;
}

This produces a regression for this program. When CONFIG_GUARD_PAGES_USE_MADVISE is false, the program runs normally, but when CONFIG_GUARD_PAGES_USE_MADVISE is true, the malloc after mlockall(MCL_FUTURE | MCL_ONFAULT); fails with errno=22.

Add CONFIG_GUARD_PAGES_USE_MADVISE (default false) to install large allocation guard regions with MADV_GUARD_INSTALL (Linux 6.13+) inside a single read-write mapping instead of separate PROT_NONE mappings, keeping each large allocation to one VMA instead of three. This is preserved through allocate_pages(), allocate_pages_aligned(), the region quarantine and in-place realloc shrink so it holds under allocation churn, including under CONFIG_LABEL_MEMORY where the quarantined region is named as a whole to avoid splitting the VMA. Kernel support is probed and cached at runtime. Guard installation is best-effort: it falls back to the PROT_NONE scheme whenever madvise fails, including EINVAL on VM_LOCKED mappings from mlockall(MCL_FUTURE), which also latches the feature off to avoid retrying on every allocation. It is off by default because the guard bytes are then accounted as committed memory (resident memory and total address space are unchanged), which regresses strict overcommit (vm.overcommit_memory=2).

thomasbuilds marked this pull request as draft May 30, 2026 19:21

thomasbuilds force-pushed the madvise-guard-install branch from 9e3e3a6 to f54ee16 Compare May 30, 2026 19:43

thomasbuilds marked this pull request as ready for review June 6, 2026 08:36

thomasbuilds force-pushed the madvise-guard-install branch from f54ee16 to ded5838 Compare June 6, 2026 11:48

thomasbuilds force-pushed the madvise-guard-install branch from ded5838 to 35a0009 Compare June 6, 2026 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optionally use MADV_GUARD_INSTALL for large allocation guard pages#341

optionally use MADV_GUARD_INSTALL for large allocation guard pages#341
thomasbuilds wants to merge 1 commit into
GrapheneOS:mainfrom
thomasbuilds:madvise-guard-install

thomasbuilds commented May 29, 2026 •

edited

Loading

Uh oh!

rdevshp commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

thomasbuilds commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change

Why off by default

Measurements

Verification

Open questions

Uh oh!

rdevshp commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thomasbuilds commented May 29, 2026 •

edited

Loading

rdevshp commented May 30, 2026 •

edited

Loading