Skip to content

optionally use MADV_GUARD_INSTALL for large allocation guard pages#341

Open
thomasbuilds wants to merge 1 commit into
GrapheneOS:mainfrom
thomasbuilds:madvise-guard-install
Open

optionally use MADV_GUARD_INSTALL for large allocation guard pages#341
thomasbuilds wants to merge 1 commit into
GrapheneOS:mainfrom
thomasbuilds:madvise-guard-install

Conversation

@thomasbuilds
Copy link
Copy Markdown
Contributor

@thomasbuilds thomasbuilds commented May 29, 2026

Addresses the high-VMA-count concern from KERNEL_FEATURE_WISHLIST.md (see #258). MADV_GUARD_INSTALL (Linux 6.13+) lets guard regions live inside a single read-write mapping at the page-table level instead of as separate PROT_NONE VMAs.

Change

Adds CONFIG_GUARD_PAGES_USE_MADVISE (default false). When enabled, guard regions for large allocations are installed with MADV_GUARD_INSTALL inside one read-write mapping rather than carved out as separate PROT_NONE mappings, keeping each large allocation to a single VMA instead of three. This is applied in allocate_pages(), allocate_pages_aligned(), the region quarantine, and the in-place realloc shrink, so the single-VMA property holds under allocation churn rather than only for live allocations.

Kernel support is probed once at runtime and cached, and guard installation is best-effort: any madvise failure falls back to the existing PROT_NONE scheme rather than failing the allocation. In particular MADV_GUARD_INSTALL returns EINVAL on VM_LOCKED mappings (e.g. under mlockall(MCL_FUTURE), which locks all future mappings); that falls back and latches the feature off so it isn't retried per allocation (the mlockall regression reported by @rdevshp), preserving errno across the fallback. Under CONFIG_LABEL_MEMORY the quarantined region is labeled as a whole so PR_SET_VMA_ANON_NAME does not split the single VMA back into three.

Why off by default

In #258 it was noted this would "require having full overcommit enabled if it doesn't reduce the accounted memory", and that is what I measured: the guard bytes become committed/accounted memory once they are part of a RW mapping. Resident memory and the total reserved address space are both unchanged (so RLIMIT_AS is unaffected), but the private-writable commit charge grows by the guard size, which regresses strict overcommit (vm.overcommit_memory=2). Hence opt-in rather than a default behavior change.

Measurements

2000 concurrently-live 256 KiB large allocations:

metric PROT_NONE MADV_GUARD_INSTALL
VMAs +4003 +4
VmRSS +8108 KiB +8108 KiB
VmSize (RLIMIT_AS) +778064 KiB +778360 KiB
VmData (committed) +512092 KiB +778452 KiB

RSS and VmSize are unchanged; only the committed/writable accounting grows, by roughly the total guard size (~260 MiB here). Aligned large allocations go from ~2.9 to ~1.0 VMAs/alloc.

Under heavy churn (~1280 quarantined 1 MiB regions):

config PROT_NONE MADV_GUARD_INSTALL
default (no labeling) 661 638
CONFIG_LABEL_MEMORY 3605 1040

Earlier the quarantine PROT_NONE'd the body and split the single VMA into three, so the scheme regressed under churn (≈3200 vs 661 without labeling, and no better than PROT_NONE with labeling). Guard-installing the quarantined body keeps it one VMA, so the madvise scheme is now <= PROT_NONE in every config, including a ~3.5x reduction under CONFIG_LABEL_MEMORY (the Android default). (Single run; counts vary with the randomized guard sizes.)

Verification

  • Builds clean under gcc with -Werror, feature off and on, with and without CONFIG_LABEL_MEMORY; CI matrix covers clang and musl. Test suite passes feature off and on.
  • On a real MADV_GUARD_INSTALL (6.13+) kernel: large-allocation guards fault on overflow, underflow, use-after-free (quarantine), and after in-place realloc shrink, both with and without CONFIG_LABEL_MEMORY; quarantined and shrunk regions stay single-VMA.
  • mlockall(MCL_FUTURE) falls back to the PROT_NONE scheme, latches the feature off, and preserves errno.
  • UBSan clean (suite + large-alloc churn + realloc shrink/grow). TSan multithreaded stress (8 threads churning large alloc/realloc-shrink/free, racing the one-time probe) is clean, including the probe's compare-exchange; the only cross-thread state is that single atomic flag.
  • The probe trusts madvise's return value, so the feature must be validated on a real kernel; qemu-user silently no-ops MADV_GUARD_INSTALL.

Open questions

  • The default value, and whether requiring overcommit is acceptable.
  • Android.bp is intentionally not wired up; it falls back to the PROT_NONE scheme via the default define.

@rdevshp
Copy link
Copy Markdown
Contributor

rdevshp commented May 30, 2026

#define _GNU_SOURCE

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <sys/mman.h>

int main(void) {
    const size_t size = 256 * 1024;

    errno = 0;
    void *warm = malloc(size);
    if (warm == NULL) {
        printf("warmup_large_malloc=failed errno=%d (%s)\n", errno, strerror(errno));
        return 2;
    }
    printf("warmup_large_malloc=ok ptr=%p\n", warm);

    errno = 0;
    int lock_ret = mlockall(MCL_FUTURE | MCL_ONFAULT);
    printf("mlockall_mcl_future_ret=%d errno=%d (%s)\n", lock_ret, errno, strerror(errno));

    errno = 0;
    void *after = malloc(size);
    if (after == NULL) {
        printf("post_mlock_large_malloc=failed errno=%d (%s)\n", errno, strerror(errno));
        return 1;
    }

    printf("post_mlock_large_malloc=ok ptr=%p\n", after);
    return 0;
}

This produces a regression for this program. When CONFIG_GUARD_PAGES_USE_MADVISE is false, the program runs normally, but when CONFIG_GUARD_PAGES_USE_MADVISE is true, the malloc after mlockall(MCL_FUTURE | MCL_ONFAULT); fails with errno=22.

@thomasbuilds thomasbuilds marked this pull request as draft May 30, 2026 19:21
@thomasbuilds thomasbuilds force-pushed the madvise-guard-install branch from 9e3e3a6 to f54ee16 Compare May 30, 2026 19:43
@thomasbuilds thomasbuilds marked this pull request as ready for review June 6, 2026 08:36
@thomasbuilds thomasbuilds force-pushed the madvise-guard-install branch from f54ee16 to ded5838 Compare June 6, 2026 11:48
Add CONFIG_GUARD_PAGES_USE_MADVISE (default false) to install large allocation
guard regions with MADV_GUARD_INSTALL (Linux 6.13+) inside a single read-write
mapping instead of separate PROT_NONE mappings, keeping each large allocation to
one VMA instead of three. This is preserved through allocate_pages(),
allocate_pages_aligned(), the region quarantine and in-place realloc shrink so it
holds under allocation churn, including under CONFIG_LABEL_MEMORY where the
quarantined region is named as a whole to avoid splitting the VMA.

Kernel support is probed and cached at runtime. Guard installation is
best-effort: it falls back to the PROT_NONE scheme whenever madvise fails,
including EINVAL on VM_LOCKED mappings from mlockall(MCL_FUTURE), which also
latches the feature off to avoid retrying on every allocation.

It is off by default because the guard bytes are then accounted as committed
memory (resident memory and total address space are unchanged), which regresses
strict overcommit (vm.overcommit_memory=2).
@thomasbuilds thomasbuilds force-pushed the madvise-guard-install branch from ded5838 to 35a0009 Compare June 6, 2026 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants