perform size checks on memcpy/memmove/memset by SkewedZeppelin · Pull Request #252 · GrapheneOS/hardened_malloc

SkewedZeppelin · 2025-03-21T22:51:09Z

for #231

largely works system wide on fedora 41 & 42

task list

compared to others:

redfat: memchr, memcmp, memcpy, memmove, memrchr, memset, strcasecmp, strcasestr, strcat, strchr, strchrnul, strcmp, strcpy, strlen, strncasecmp, strncat, strncmp, strncpy, strnlen, strrchr, strstr
this patchset: bcopy, memccpy, memcpy, memmove, mempcpy, memset, swab, wmemcpy, wmemmove, wmempcpy, wmemset
isoalloc: memcpy, memmove, memset
snmalloc: memcpy, and previously memmove (disabled due to issue found by fuzzing)

SkewedZeppelin · 2025-03-22T01:47:42Z

@thestinger is it expected that malloc_object_size would return negative values?

thestinger · 2025-03-22T01:49:41Z

@SkewedZeppelin It returns size_t so it can't be negative. When it doesn't know the answer, it returns SIZE_MAX. Are you treating it as signed somewhere?

SkewedZeppelin · 2025-03-22T02:03:09Z

hardened_malloc/h_malloc.c

Line 1852 in 4fe9018

size_t size = region == NULL ? SIZE_MAX : region->size;

is negative sometimes

thestinger · 2025-03-22T02:04:47Z

@SkewedZeppelin It's a size_t, it can't be negative. SIZE_MAX is the maximum size_t value. You must be printing it as a signed integer where it would be -1.

SkewedZeppelin · 2025-03-22T02:13:17Z

apologies I'm dumb and was using %ld not %lu

thestinger · 2025-03-22T02:15:35Z

@SkewedZeppelin It's not particularly important but you should use %zu for size_t.

thestinger · 2025-03-22T02:16:26Z

It matters on Windows where long is 32-bit on 64-bit but could at least theoretically be the case elsewhere.

SkewedZeppelin · 2025-03-22T02:16:54Z

is it worth it to zero the remainder of dst? my only issue is that sometimes it is close to size_max/unknown

thestinger · 2025-03-22T02:20:56Z

@SkewedZeppelin That wouldn't be safe since it's not known what they're doing with it. They could be intentionally only copying to part of it. It can be a copy to the middle of it, etc.

agnosticlines · 2026-04-29T19:03:06Z

Hi! Been following this PR for a while and would love to see it get landed, I saw the performance concerns and took a look into them, it seems you wrote off the "figure out if it is possible to use the real underlying functions for better per-arch performance" from the original issue, is there a specific reason for that beyond it feels unsafe?

I wrote some code that uses dlsym(RTLD_NEXT, ...) to resolve the functions relevant to the hot paths

#include <dlfcn.h>

static void *(*real_memcpy)(void *restrict, const void *restrict, size_t) = musl_memcpy;
static void *(*real_memmove)(void *, const void *, size_t) = musl_memmove;
static void *(*real_memset)(void *, int, size_t) = musl_memset;

__attribute__((constructor(102)))
static void resolve_block_ops(void) {
    void *sym;
    sym = dlsym(RTLD_NEXT, "memcpy");
    if (sym && sym != memcpy) real_memcpy = sym;
    sym = dlsym(RTLD_NEXT, "memmove");
    if (sym && sym != memmove) real_memmove = sym;
    sym = dlsym(RTLD_NEXT, "memset");
    if (sym && sym != memset) real_memset = sym;
}

From here you'd just call the real_* functions instead of the musl_* variants on the hot paths, the musl implementations would stay for h_memcpy_internal and as a pre constructor fallback (in case something calls memcpy before the constructor runs, maybe for symbol resolution? not sure if that's possible in practice but figured it's better to be safe than sorry)

I applied this locally (and ran the test suite which passed 52/52) and wrote a small benchmark:

benchmark source + output

// bench.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

void *(*volatile mcpy)(void*, const void*, size_t) = memcpy;
void *(*volatile mset)(void*, int, size_t) = memset;

int main(void) {
    size_t sizes[] = { 4<<20, 8<<20, 16<<20, 32<<20, 64<<20, 128<<20, 256<<20, 512<<20, 1024<<20 };
    int n = sizeof(sizes)/sizeof(sizes[0]);
    int iters = 50;
    for (int i = 0; i < n; i++) {
        void *src = malloc(sizes[i]);
        void *dst = malloc(sizes[i]);
        if (!src || !dst) break;
        mset(src, 0xAA, sizes[i]);
        // warmup
        for (int j = 0; j < 3; j++) {
            mcpy(dst, src, sizes[i]);
            asm volatile("" : : "r"(dst) : "memory");
        }
        struct timespec t0, t1;
        clock_gettime(CLOCK_MONOTONIC, &t0);
        for (int j = 0; j < iters; j++) {
            mcpy(dst, src, sizes[i]);
            asm volatile("" : : "r"(dst) : "memory");
        }
        clock_gettime(CLOCK_MONOTONIC, &t1);
        double ms = ((t1.tv_sec-t0.tv_sec)*1000.0 + (t1.tv_nsec-t0.tv_nsec)/1e6) / iters;
        printf("%4zuMB = %.3f ms\n", sizes[i]>>20, ms);
        free(src);
        free(dst);
    }
}

$ gcc -O3 -fno-builtin -o bench bench.c

$ ./bench
   4MB = 0.270 ms
   8MB = 0.545 ms
  16MB = 1.435 ms
  32MB = 2.976 ms
  64MB = 6.080 ms
 128MB = 12.225 ms
 256MB = 23.759 ms
 512MB = 43.041 ms
1024MB = 90.316 ms

$ LD_PRELOAD=libhardened_malloc_musl.so ./bench
   4MB = 0.270 ms
   8MB = 0.564 ms
  16MB = 1.875 ms
  32MB = 4.815 ms
  64MB = 10.517 ms
 128MB = 22.669 ms
 256MB = 41.873 ms
 512MB = 106.389 ms
1024MB = 169.493 ms

$ LD_PRELOAD=libhardened_malloc_dlsym.so ./bench
   4MB = 0.262 ms
   8MB = 0.524 ms
  16MB = 1.146 ms
  32MB = 2.215 ms
  64MB = 5.637 ms
 128MB = 12.210 ms
 256MB = 23.888 ms
 512MB = 44.732 ms
1024MB = 92.406 ms

	glibc	PR (musl)	PR + dlsym
4MB	0.270ms	0.270ms	0.262ms
8MB	0.545ms	0.564ms	0.524ms
16MB	1.435ms	1.875ms (1.31x)	1.146ms (0.80x)
32MB	2.976ms	4.815ms (1.62x)	2.215ms (0.74x)
64MB	6.080ms	10.517ms (1.73x)	5.637ms (0.93x)
128MB	12.225ms	22.669ms (1.85x)	12.210ms (1.00x)
256MB	23.759ms	41.873ms (1.76x)	23.888ms (1.01x)
512MB	43.041ms	106.389ms (2.47x)	44.732ms (1.04x)
1024MB	90.316ms	169.493ms (1.88x)	92.406ms (1.02x)

The perf hit is coming pretty much entirely from using the unoptimised musl libc memcpy, instead of the AVX optimised glibc one, malloc_object_size is only ~15ns per call:

malloc_object_size overhead measurement covering slab + large

// overhead.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

size_t malloc_object_size(void *p);

int main(void) {
    char *p = malloc(4 << 20);
    if (!p) return 1;

    int iters = 100000;
    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int i = 0; i < iters; i++) {
        size_t s = malloc_object_size(p);
        asm volatile("" : : "r"(s) : "memory");
    }
    clock_gettime(CLOCK_MONOTONIC, &t1);
    double ns = ((t1.tv_sec-t0.tv_sec)*1e9 + (t1.tv_nsec-t0.tv_nsec)) / iters;
    printf("malloc_object_size on 4MB alloc: %.0f ns/call\n", ns);

    char *q = malloc(64);
    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int i = 0; i < iters; i++) {
        size_t s = malloc_object_size(q);
        asm volatile("" : : "r"(s) : "memory");
    }
    clock_gettime(CLOCK_MONOTONIC, &t1);
    ns = ((t1.tv_sec-t0.tv_sec)*1e9 + (t1.tv_nsec-t0.tv_nsec)) / iters;
    printf("malloc_object_size on 64B alloc: %.0f ns/call\n", ns);

    free(p);
    free(q);
    return 0;
}

$ gcc -O3 -fno-builtin -o overhead overhead.c -lhardened_malloc
$ ./overhead
malloc_object_size on 4MB alloc: 10 ns/call
malloc_object_size on 64B alloc: 13 ns/call

To address some of your concerns "dlsym doesn't seem to work with all program such as mutter-x11-frames", you say you can't reproduce this anymore so I'm going to assume it was a transient issue?

"this doesn't necessarily pull from libc, but can pull from other libraries": I tested this with a second entry in the LD_PRELOAD chain which I assume is the concern you had:

test source + output

// other.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stddef.h>

void *memcpy(void *restrict dst, const void *restrict src, size_t n) {
    static void *(*next)(void*, const void*, size_t) = NULL;
    if (!next) next = dlsym(RTLD_NEXT, "memcpy");
    return next(dst, src, n);
}

// overflow_test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void *(*volatile mcpy)(void*, const void*, size_t) = memcpy;

int main(void) {
    char *a = malloc(16);
    char *b = malloc(8);
    memset(b, 'B', 8);
    mcpy(a, b, 16);
    printf("no abort\n");
    return 0;
}

$ LD_PRELOAD="libhardened_malloc_dlsym.so libother.so" ./overflow_test
fatal allocator error: memcpy read overflow
Aborted
exit: 134

$ LD_PRELOAD="libhardened_malloc_dlsym.so libother.so" ./bench
   4MB = 0.274 ms
   8MB = 0.561 ms
  16MB = 1.686 ms
  32MB = 3.009 ms
  64MB = 6.071 ms
 128MB = 12.473 ms
 256MB = 26.687 ms
 512MB = 45.956 ms
1024MB = 99.686 ms

"it feels unsafe": every failure mode falls back to musl which is correct just slow:

constructor hasn't fired yet -> use musl_memcpy from static init
dlsym returns NULL -> if (sym && ...) fails -> keeps musl_memcpy
dlsym returns our own symbol -> sym == memcpy -> keeps musl_memcpy
dlsym internally calls memcpy during resolution -> wrapper fires -> calls real_memcpy which is still musl_memcpy at that point -> no recursion

For environmental concerns (some of these are stretches just to show I've thought about all the possible failure modes):

musl is the system libc (alpine, etc) -> sym != memcpy (they are different addresses: musl's impl vs our impl) -> we resolve a slower memcpy
statically linked hardened_malloc -> dlsym returns NULL -> keeps musl_memcpy
library providing resolved memcpy gets dlclosed (stretch) -> can't dlclose libc
IFUNC not resolved yet (stretch) -> IFUNC resolvers run at load time (before constructors)
signal during constructor (definitely a stretch but I couldn't think of any other failure modes) -> pointer write is naturally aligned, atomic on both x86 and arm64
android/bionic support -> not tested at all but CONFIG_BLOCK_OPS_CHECK_SIZE defaults to false in Android.bp, though bionic supports dlsym(RTLD_NEXT, ...) so it should work if enabled

If the deal breaker for this PR was the perf hit I believe this approach would solve it and hopefully get this landed! (great work btw!)

SkewedZeppelin · 2026-04-29T20:03:59Z

I tested this with a second entry in the LD_PRELOAD chain which I assume is the concern you had

it is weird, see #252 (comment)
I rebased and pushed your code as a separate commit for anyone to test with.

there was also force pushed commit somewhere a bit back where I did import some assembly versions from musl, but my issue with that was the added maintenance burden for per-arch and any changes

deal breaker

it still doesn't work under clang for some reason (at least as of a year ago or whatever)

agnosticlines · 2026-04-29T21:30:10Z

Amazing decision from Github to hide comments like that, though it's partially on me as I've been following this thread for a while and should have remembered.

So the clang issues, you've mentioned a bunch of things and I think this may fix all of them, I don't have a non headless Linux system atm so I can't test chrome/gnome/etc, you'll need to do that sorry :(.

I tried to build it locally and had to pass -fuse-ld=lld because clang LTO doesn't link with GNU ld (I assume Fedora links fine with the gold plugin, which Ubuntu doesn't seem to have) which worked but it segfaulted immediately, I tracked the issue down to the LoopIdiomRecognize pass, which sees the byte/word copy loop inside musl_memcpy and replaces it with call memcpy@plt which causes infinite recursion.

Can you build it locally with -fno-builtin -mllvm -disable-loop-idiom-all on the musl files and see if the issues you described (size max, gnome app crashes, mutter not working) still occur?

Not sure if I'm just fixing an issue with my setup or if this is the bigger underlying issue for clang but I can see a possible path that would cause this bug to present in the ways you described, either way with those options it builds fine and everything seems to work: 52/52, overflow detection works, malloc_object_size() works.

It's a pretty interesting bug I might dig into a little more, if it doesn't work let me know and I'll take a look tomorrow.

SkewedZeppelin · 2026-04-29T21:37:49Z

last I tried it wasn't recursing, it did in earlier versions, but it was instead effectively no-op since the size was always max instead of the actual allocation size

agnosticlines · 2026-04-29T21:54:48Z

Ah... I think I scrolled past you talking about that, sorry about that. I can't test gnome etc but I tried rustc mentioned earlier as not working and it worked fine for me.

What specifically wasn't working for you in clang? Because it works fine for me, so the next thing to work out is what's different about our setups.

agnosticlines · 2026-04-30T12:07:05Z

last I tried it wasn't recursing, it did in earlier versions, but it was instead effectively no-op since the size was always max instead of the actual allocation size

the theory I originally had before I found the recursion was that clangs LTO was constant folding ro.slab_region_start as 0 since it's set at runtime in the constructor, but first initialised to 0 statically, which would optimise the slab region check to always false. every pointer would fall through to regions_find -> NULL -> SIZE_MAX

fwiw enforce_init isn't present in malloc_object_size, so if the allocator was in a broken state it would silently return SIZE_MAX instead of aborting, which would match what you saw, but I think you'd notice failures sooner, what test code were you running for malloc_object_size? The only way I can think of that being a problem is if the SIGSEGV from the recursion was caught by another signal handler (gnome-session/Glib crash handling?) and the allocator continued on in a broken state.

Both of these are speculation, if you can get a reproduction or steps so I can build it on the same setup as you I can try reproduce and look into it some more, because again, the clang build works fine for me currently.

SkewedZeppelin · 2026-04-30T18:29:26Z

reproduction

the included test cases:

gcc 16.0.1
....................................................
----------------------------------------------------------------------
Ran 52 tests in 0.108s

OK


clang 22.1.4
....................................................
----------------------------------------------------------------------
Ran 52 tests in 0.108s

OK


gcc+bosc
................................................................
----------------------------------------------------------------------
Ran 64 tests in 0.134s

OK


clang+bosc
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF..FF.F.FFFFFFFFFFFFFFFFFFFFF..FF.
----------------------------------------------------------------------
Ran 64 tests in 0.664s

FAILED (failures=57)

agnosticlines · 2026-04-30T18:50:32Z

Oh, sorry I thought I was running them already but they were all commented out in the branch. I uncommented all 12 and ran them again and got 60/64.

Looked into the ones that failed (memcpy and memmove) and saw the memcpy wrapper wasn't there, clang had inlined it.

Fixed by adding: -fno-builtin to SHARED_FLAGS.

64/64, works fine.

................................................................
----------------------------------------------------------------------
Ran 64 tests in 0.854s

OK

Tested clang 18, 20 and 23. Did you try building/testing on a debian distro? Also was "clang+bosc" in your testing output with the extra flags? Because without the original fix it will fail in exactly the same way:

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF..FF.F.FFFFFFFFFFFFFFFFFFFFF..FF.
----------------------------------------------------------------------
Ran 64 tests in 1.376s

FAILED (failures=57)

The 7 tests that pass all expect a SIGSEGV, so they pass by accident:

getrandom("\x00\x85\xc0\x70...", 40, 0)                              = 40
mmap(NULL, 271971237888, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, ...)   = 0x7e823aba6000
mprotect(0x7e8f82d4f000, 25862144, PROT_READ|PROT_WRITE)             = 0
mprotect(0x7e8f845f9000, 3072, PROT_READ|PROT_WRITE)                 = 0
mmap(NULL, 13469017440256, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, ...) = 0x72423aa00000
...
mprotect(0x72663c6cf000, 4096, PROT_READ|PROT_WRITE)                 = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7ee3a7eb2ff8} ---
+++ killed by SIGSEGV +++

Disassembly:

0000000000000000 <musl_memcpy>:
   0:   f3 0f 1e fa             endbr64
   4:   41 57                   push   %r15
   6:   41 56                   push   %r14
   8:   41 54                   push   %r12
   a:   53                      push   %rbx
   b:   50                      push   %rax
   c:   49 89 d6                mov    %rdx,%r14
   f:   49 89 f7                mov    %rsi,%r15
  12:   48 89 fb                mov    %rdi,%rbx
  15:   41 f6 c7 03             test   $0x3,%r15b
  19:   0f 84 e6 00 00 00       je     105 <musl_memcpy+0x105>
  ...
  47:   ff 15 00 00 00 00       call   *0x0(%rip)
                    49: R_X86_64_GOTPCRELX    memcpy-0x4

memcpy is resolved through the GOT so we land back in the wrapper again, which calls musl_memcpy again, which lands us here again.

You don't need -mllvm -disable-loop-idiom-all, I just tested it again with just -fno-builtin and it works fine, I must have been using old object files and forgot to make clean or something.

Also your tests won't pass on Ubuntu because it injects -D_FORTIFY_SOURCE=3 which replaces memcpy with __memcpy_chk bypassing the wrapper.

Patch:

fix makefiles

diff --git a/Makefile b/Makefile
index b8ce9dd..90444f4 100644
--- a/Makefile
+++ b/Makefile
@@ -146,17 +146,17 @@ $(OUT)/util.o: util.c util.h $(CONFIG_FILE) | $(OUT)
 	$(COMPILE.c) $(OUTPUT_OPTION) $<
 
 $(OUT)/memcpy.o: memcpy.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
 $(OUT)/memccpy.o: memccpy.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
 $(OUT)/memmove.o: memmove.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
 $(OUT)/memset.o: memset.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
 $(OUT)/swab.o: swab.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
 $(OUT)/wmemset.o: wmemset.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin $(OUTPUT_OPTION) $<
 
 check: tidy
 
diff --git a/test/Makefile b/test/Makefile
index 80221cc..6596801 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -16,7 +16,7 @@ CPPFLAGS := \
     -DSLAB_CANARY=$(CONFIG_SLAB_CANARY) \
     -DCONFIG_EXTENDED_SIZE_CLASSES=$(CONFIG_EXTENDED_SIZE_CLASSES)
 
-SHARED_FLAGS := -O3
+SHARED_FLAGS := -O3 -fno-builtin
 
 CFLAGS := -std=c17 $(SHARED_FLAGS) -Wmissing-prototypes
 CXXFLAGS := -std=c++17 -fsized-deallocation $(SHARED_FLAGS)

Test results (with -U_FORTIFY_SOURCE in SHARED_FLAGS as well)

gcc:
................................................................
----------------------------------------------------------------------
Ran 64 tests in 0.972s
OK

clang 18:
................................................................
----------------------------------------------------------------------
Ran 64 tests in 0.984s
OK

clang 23:
................................................................
----------------------------------------------------------------------
Ran 64 tests in 0.905s
OK

In the tests I also added -U_FORTIFY_SOURCE in SHARED_FLAGS but this won't change anything, just a workaround for the tests not working on my dev ubuntu box.

agnosticlines · 2026-05-05T12:57:29Z

I was having another look and noticed a few things

h_memcpy_internal is defined in two places (random.h and include/h_malloc.h), it doesn't error because it's the same structure but if it's changed in one place and not the other it would cause a build error due to -Werror, I know why it's being done but it might be worth extracting into a musl.h? Though it probably won't change much so I'm not sure.

In the memccpy implementation:

EXPORT void *memccpy(void *restrict dst, const void *restrict src, int value, size_t len) {
...
    if (unlikely(len > malloc_object_size(src) && value != 0)) {
        fatal_error("memccpy read overflow");
...

The && value != 0 invalidates the read overflow check on the src when value is 0:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

int main(void) {
    char *src = malloc(8);
    char *dst = malloc(128);
    if (!src || !dst) return 1;

    memset(src, 'A', 8);

    fprintf(stderr, "calling memccpy(dst, src, 0, 64)\n");
    memccpy(dst, src, 0, 64);
    fprintf(stderr, "survived\n");

    free(src);
    free(dst);
    return 0;
}

$ gcc -O0 -o memccpy_bypass memccpy_bypass.c
$ LD_PRELOAD=libhardened_malloc.so ./memccpy_bypass
calling memccpy(dst, src, 0, 64)
survived

What happened:

memccpy(dst, src, 0, 64) where src is 8 bytes, dst is 128 bytes
the wrapper checks if len > malloc_object_size(src) -> 64 > 8 -> true -> read overflow detected, should abort
then checks value != 0 -> 0 != 0 -> false
true && false -> false -> read overflow detection ignored, no longer aborts
musl_memccpy runs, copies bytes from src looking for byte 0
if src doesn't contain 0 then memccpy reads 64 bytes (56 bytes past the allocation)

The fix is to remove && value != 0.

You also use !firstbuffer && !secondbuffer for all the tests instead of !firstbuffer || !secondbuffer so they only exit if both mallocs fail instead of if any of them fail, not sure if that was intentional or not but just wanted to flag it.

agnosticlines · 2026-05-05T14:45:02Z

Also while I'm here, can I ask why CONFIG_BLOCK_OPS_CHECK_SIZE is mutually exclusive with MTE? Is it just performance? (Also do we have exact numbers on real world workloads?)

Would it be worth supporting the ability to use both for users building it themselves? For example on ARM servers where the performance tradeoff is worth it. Looking at recently disclosed/patched bugs in server software a decent chunk of them are things like attacker controlled data being passed as a length to memcpy which would be better covered with deterministic software protection vs probabilistic hardware protection (through MTE).

This is especially important because it's likely in these cases that crashes will just trigger an automatic restart giving the attacker another chance to try again.

There's also cases where MTE alone just won't cover, like a 1 byte overflow within the same 16 byte granule won't trigger MTE but would trigger the block ops check.

I know there's a bunch of caveats where even the classes I just mentioned wouldn't be caught by these size checks, even if they do go through the wrappers, like if the memory corruption happens as a result of an out of bounds indexed array access or if the destination isn't allocator aware and the size check returns SIZE_MAX, though hopefully these would be caught by MTE.

I'm not suggesting we use this instead of MTE, just that they should be allowed to stack if you accept the performance hit.

Currently there's no way to even opt in if you're building it yourself as it's gated by:

#if CONFIG_BLOCK_OPS_CHECK_SIZE && !defined(HAS_ARM_MTE)

Maybe a question for @thestinger?

thestinger · 2026-05-05T14:50:03Z

@agnosticlines MTE provides deterministic protection against linear and small overflows with hardened_malloc since there are guaranteed to be distinct tags for adjacent allocations. MTE being enabled is dynamic though and it needs to be handled similarly to the write-after-free check which is similarly deterministically caught by MTE instead via the dedicated free tag.

agnosticlines · 2026-05-05T19:51:35Z

Btw @SkewedZeppelin small nit, can you remove the comment I added, I just wanted to explain why there was an arbitrary priority (102) to anyone reviewing, I phrased it badly though, it's more appropriate to say "Allocators own early init path uses priority 101, so we use the next available number"

SkewedZeppelin · 2026-05-05T20:08:13Z

invalidates the read overflow

see this case: #252 (comment)

agnosticlines · 2026-05-05T20:12:55Z

invalidates the read overflow

see this case: #252 (comment)

Okay, Github is awful... I only had one set of hidden comments with 68 entries, now that I've clicked the comment I have two (the other with 87) but if I refresh it disappears again, and also the numbers keep changing across refreshes??? This platform is genuinely terrible... I'm really sorry about that! I think this may have been why I missed the original discussion about dlsym too as I only saw it when I clicked your comment link...

Looks like it's a known issue lol https://github.com/orgs/community/discussions/193340

agnosticlines · 2026-05-18T13:39:13Z

@SkewedZeppelin sorry to ping again but did you test the clang fix? if this works on clang now with the extra compiler flag (which it has in my tests) is there anything left before it can be reviewed/merged? This is a meaningful security improvement for non MTE devices

Signed-off-by: Tavi <tavi@divested.dev> Co-authored-by: =?UTF-8?q?Christian=20G=C3=B6ttsche?= <cgzones@googlemail.com>

nohm 4 MB = 0.091741 ms 8 MB = 0.186662 ms 16 MB = 0.375295 ms 32 MB = 0.643256 ms 64 MB = 1.293962 ms 128 MB = 2.658412 ms 256 MB = 5.288432 ms hm 4 MB = 0.091336 ms 8 MB = 0.187152 ms 16 MB = 0.343821 ms 32 MB = 0.638406 ms 64 MB = 1.281708 ms 128 MB = 2.563310 ms 256 MB = 5.109415 ms hm+bosc 4 MB = 0.092013 ms 8 MB = 0.185993 ms 16 MB = 0.360132 ms 32 MB = 0.941173 ms 64 MB = 2.724979 ms 128 MB = 6.140287 ms 256 MB = 12.867246 ms hm+bosc+dlsym 4 MB = 0.091810 ms 8 MB = 0.188023 ms 16 MB = 0.375594 ms 32 MB = 0.647143 ms 64 MB = 1.288610 ms 128 MB = 2.557970 ms 256 MB = 5.114027 ms Signed-off-by: Tavi <tavi@divested.dev>

agnosticlines · 2026-06-07T00:23:00Z

Got another email and took another look and there's a few places where you use len * sizeof(wchar_t) without checking if it overflows.

I'm on my phone right now or I'd submit a patch but in wmemset:

if (unlikely((len * sizeof(wchar_t)) > malloc_object_size(dst))) {
    fatal_error("wmemset buffer overflow");
}
return musl_wmemset(dst, value, len);

Which calls into musl_wmemset to perform the write:

wchar_t *musl_wmemset(wchar_t *d, wchar_t c, size_t n)
{
    wchar_t *ret = d;
    while (n--) *d++ = c;
    return ret;
}

The flow is:

len is passed in with some huge value
len * sizeof(wchar_t) wraps to 0
bounds check sees 0 bytes and passes when it should fail
musl_wmemset still receives the huge len and writes len wchar_ts

This same issue is present in wmemcpy, wmemmove and wmempcpy with varying failure modes.

I think there's also a way to solve the memccpy(value == 0) "bypass" , depending on how much you care about it. I can share more once I've ran the benchmarks.

Also memccpy(len == 0) returns the wrong value. It returns dst when it should return NULL

SkewedZeppelin marked this pull request as draft March 21, 2025 22:51

SkewedZeppelin force-pushed the memcpy-sanity branch 2 times, most recently from 39b1b13 to b2fbe6e Compare March 22, 2025 00:30

SkewedZeppelin changed the title ~~override memcpy and perform sanity checks~~ override memcpy and perform size checks Mar 22, 2025

SkewedZeppelin force-pushed the memcpy-sanity branch from b2fbe6e to e5bf6a4 Compare March 22, 2025 04:05

SkewedZeppelin changed the title ~~override memcpy and perform size checks~~ perform size checks on memcpy/memmove/memset Mar 22, 2025

SkewedZeppelin marked this pull request as ready for review March 22, 2025 04:07

SkewedZeppelin force-pushed the memcpy-sanity branch 9 times, most recently from b8cdf16 to 849055d Compare March 22, 2025 09:01

jvoisin reviewed Mar 22, 2025

View reviewed changes

Comment thread README.md Outdated

Comment thread h_malloc.c Outdated

SkewedZeppelin force-pushed the memcpy-sanity branch 2 times, most recently from 4461652 to e38da0d Compare March 23, 2025 01:19

SkewedZeppelin marked this pull request as draft March 23, 2025 02:33

SkewedZeppelin force-pushed the memcpy-sanity branch from e38da0d to 291c6c4 Compare March 23, 2025 03:02

SkewedZeppelin force-pushed the memcpy-sanity branch from f9b1948 to 29753ce Compare March 2, 2026 20:45

SkewedZeppelin force-pushed the memcpy-sanity branch from 29753ce to 4c3dab3 Compare April 4, 2026 16:13

thestinger force-pushed the main branch 4 times, most recently from 9d5802c to 074d47a Compare April 24, 2026 14:53

SkewedZeppelin force-pushed the memcpy-sanity branch 2 times, most recently from abaf70b to 584d965 Compare April 29, 2026 19:57

thestinger force-pushed the main branch from 4e91a2d to 1250cae Compare May 4, 2026 10:20

SkewedZeppelin force-pushed the memcpy-sanity branch from 584d965 to 956a661 Compare May 5, 2026 19:40

SkewedZeppelin force-pushed the memcpy-sanity branch from 956a661 to 5c815c8 Compare May 5, 2026 20:11

SkewedZeppelin force-pushed the memcpy-sanity branch from 5c815c8 to 6126477 Compare June 6, 2026 14:30

SkewedZeppelin and others added 2 commits June 6, 2026 10:32

perform size checks on various operations

8f2f14e

Signed-off-by: Tavi <tavi@divested.dev> Co-authored-by: =?UTF-8?q?Christian=20G=C3=B6ttsche?= <cgzones@googlemail.com>

Uh oh!

Conversation

SkewedZeppelin commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

task list

compared to others:

Uh oh!

SkewedZeppelin commented Mar 22, 2025

Uh oh!

thestinger commented Mar 22, 2025

Uh oh!

SkewedZeppelin commented Mar 22, 2025

Uh oh!

thestinger commented Mar 22, 2025

Uh oh!

SkewedZeppelin commented Mar 22, 2025

Uh oh!

thestinger commented Mar 22, 2025

Uh oh!

thestinger commented Mar 22, 2025

Uh oh!

SkewedZeppelin commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thestinger commented Mar 22, 2025

Uh oh!

Uh oh!

Uh oh!

agnosticlines commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SkewedZeppelin commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agnosticlines commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SkewedZeppelin commented Apr 29, 2026

Uh oh!

agnosticlines commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agnosticlines commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SkewedZeppelin commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agnosticlines commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agnosticlines commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agnosticlines commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thestinger commented May 5, 2026

Uh oh!

agnosticlines commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SkewedZeppelin commented May 5, 2026

Uh oh!

agnosticlines commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agnosticlines commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agnosticlines commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

SkewedZeppelin commented Mar 21, 2025 •

edited

Loading

SkewedZeppelin commented Mar 22, 2025 •

edited

Loading

agnosticlines commented Apr 29, 2026 •

edited

Loading

SkewedZeppelin commented Apr 29, 2026 •

edited

Loading

agnosticlines commented Apr 29, 2026 •

edited

Loading

agnosticlines commented Apr 29, 2026 •

edited

Loading

agnosticlines commented Apr 30, 2026 •

edited

Loading

SkewedZeppelin commented Apr 30, 2026 •

edited

Loading

agnosticlines commented Apr 30, 2026 •

edited

Loading

agnosticlines commented May 5, 2026 •

edited

Loading

agnosticlines commented May 5, 2026 •

edited

Loading

agnosticlines commented May 5, 2026 •

edited

Loading

agnosticlines commented May 5, 2026 •

edited

Loading

agnosticlines commented May 18, 2026 •

edited

Loading

agnosticlines commented Jun 7, 2026 •

edited

Loading