Skip to content

perform size checks on memcpy/memmove/memset#252

Draft
SkewedZeppelin wants to merge 2 commits into
GrapheneOS:mainfrom
SkewedZeppelin:memcpy-sanity
Draft

perform size checks on memcpy/memmove/memset#252
SkewedZeppelin wants to merge 2 commits into
GrapheneOS:mainfrom
SkewedZeppelin:memcpy-sanity

Conversation

@SkewedZeppelin
Copy link
Copy Markdown

@SkewedZeppelin SkewedZeppelin commented Mar 21, 2025

for #231

  • largely works system wide on fedora 41 & 42

task list

  • gcc compiler doing weird things
    • no longer an issue with using real underlying functions
  • clang compiler doing weird things
    • it runs now, but size checks are always max
  • whole object size checks (fast path)
  • object remaining size checks (non-fast path)
  • optimized assembly functions
  • memcpy
  • memccpy
    • handle common read overflow case
  • memmove
  • memset
  • wmemcpy
  • wmemmove
  • wmemset
  • bypass overrides for self
  • licensing
  • makefile bits
  • readme
    • could be expanded on
  • test case for memcpy
    • overlap test
  • test case for memccpy
  • test case for memmove
  • test case for memset
  • test case for wmemcpy
  • test case for wmemmove
  • test case for wmemset
  • run all the test cases
    • the feature is default disabled so they can't be run without failing
  • figure out why test cases fail under CI when enabled
    • they all pass on my end
    • still not working on latest patchset
  • figure out why so many gnome apps crash
    • fatal allocator error: invalid malloc_object_size
    • conflict with gjs/mozjs?
    • crashes under f42, but not f41: clocks, calculator, baobab, fileroller, logs
    • crashes under f41, but not f42: gnome-shell when clicking top bar controls
    • can't reproduce anymore, unsure why
  • figure out how to handle chromium/electron crash/conflict
    • can't reproduce anymore, only happens on fast path
  • figure out if it is possible to use the real underlying functions for better per-arch performance
    • dlsym doesn't seem to work with all program such as mutter-x11-frames
      • can't reproduce anymore
    • this doesn't necessarily pull from libc, but can pull from other libraries
    • it feels unsafe

compared to others:

  • redfat: memchr, memcmp, memcpy, memmove, memrchr, memset, strcasecmp, strcasestr, strcat, strchr, strchrnul, strcmp, strcpy, strlen, strncasecmp, strncat, strncmp, strncpy, strnlen, strrchr, strstr
  • this patchset: bcopy, memccpy, memcpy, memmove, mempcpy, memset, swab, wmemcpy, wmemmove, wmempcpy, wmemset
  • isoalloc: memcpy, memmove, memset
  • snmalloc: memcpy, and previously memmove (disabled due to issue found by fuzzing)

@SkewedZeppelin SkewedZeppelin marked this pull request as draft March 21, 2025 22:51
@SkewedZeppelin SkewedZeppelin force-pushed the memcpy-sanity branch 2 times, most recently from 39b1b13 to b2fbe6e Compare March 22, 2025 00:30
@SkewedZeppelin SkewedZeppelin changed the title override memcpy and perform sanity checks override memcpy and perform size checks Mar 22, 2025
@SkewedZeppelin
Copy link
Copy Markdown
Author

@thestinger is it expected that malloc_object_size would return negative values?

@thestinger
Copy link
Copy Markdown
Member

@SkewedZeppelin It returns size_t so it can't be negative. When it doesn't know the answer, it returns SIZE_MAX. Are you treating it as signed somewhere?

@SkewedZeppelin
Copy link
Copy Markdown
Author

size_t size = region == NULL ? SIZE_MAX : region->size;
is negative sometimes

@thestinger
Copy link
Copy Markdown
Member

@SkewedZeppelin It's a size_t, it can't be negative. SIZE_MAX is the maximum size_t value. You must be printing it as a signed integer where it would be -1.

@SkewedZeppelin
Copy link
Copy Markdown
Author

apologies I'm dumb and was using %ld not %lu

@thestinger
Copy link
Copy Markdown
Member

@SkewedZeppelin It's not particularly important but you should use %zu for size_t.

@thestinger
Copy link
Copy Markdown
Member

It matters on Windows where long is 32-bit on 64-bit but could at least theoretically be the case elsewhere.

@SkewedZeppelin
Copy link
Copy Markdown
Author

SkewedZeppelin commented Mar 22, 2025

is it worth it to zero the remainder of dst? my only issue is that sometimes it is close to size_max/unknown

@thestinger
Copy link
Copy Markdown
Member

@SkewedZeppelin That wouldn't be safe since it's not known what they're doing with it. They could be intentionally only copying to part of it. It can be a copy to the middle of it, etc.

@SkewedZeppelin SkewedZeppelin changed the title override memcpy and perform size checks perform size checks on memcpy/memmove/memset Mar 22, 2025
@SkewedZeppelin SkewedZeppelin marked this pull request as ready for review March 22, 2025 04:07
@SkewedZeppelin SkewedZeppelin force-pushed the memcpy-sanity branch 9 times, most recently from b8cdf16 to 849055d Compare March 22, 2025 09:01
Comment thread README.md Outdated
Comment thread h_malloc.c Outdated
@SkewedZeppelin SkewedZeppelin force-pushed the memcpy-sanity branch 2 times, most recently from 4461652 to e38da0d Compare March 23, 2025 01:19
@SkewedZeppelin SkewedZeppelin marked this pull request as draft March 23, 2025 02:33
@agnosticlines
Copy link
Copy Markdown

agnosticlines commented Apr 29, 2026

Hi! Been following this PR for a while and would love to see it get landed, I saw the performance concerns and took a look into them, it seems you wrote off the "figure out if it is possible to use the real underlying functions for better per-arch performance" from the original issue, is there a specific reason for that beyond it feels unsafe?

I wrote some code that uses dlsym(RTLD_NEXT, ...) to resolve the functions relevant to the hot paths

#include <dlfcn.h>

static void *(*real_memcpy)(void *restrict, const void *restrict, size_t) = musl_memcpy;
static void *(*real_memmove)(void *, const void *, size_t) = musl_memmove;
static void *(*real_memset)(void *, int, size_t) = musl_memset;

__attribute__((constructor(102)))
static void resolve_block_ops(void) {
    void *sym;
    sym = dlsym(RTLD_NEXT, "memcpy");
    if (sym && sym != memcpy) real_memcpy = sym;
    sym = dlsym(RTLD_NEXT, "memmove");
    if (sym && sym != memmove) real_memmove = sym;
    sym = dlsym(RTLD_NEXT, "memset");
    if (sym && sym != memset) real_memset = sym;
}

From here you'd just call the real_* functions instead of the musl_* variants on the hot paths, the musl implementations would stay for h_memcpy_internal and as a pre constructor fallback (in case something calls memcpy before the constructor runs, maybe for symbol resolution? not sure if that's possible in practice but figured it's better to be safe than sorry)

I applied this locally (and ran the test suite which passed 52/52) and wrote a small benchmark:

benchmark source + output
// bench.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

void *(*volatile mcpy)(void*, const void*, size_t) = memcpy;
void *(*volatile mset)(void*, int, size_t) = memset;

int main(void) {
    size_t sizes[] = { 4<<20, 8<<20, 16<<20, 32<<20, 64<<20, 128<<20, 256<<20, 512<<20, 1024<<20 };
    int n = sizeof(sizes)/sizeof(sizes[0]);
    int iters = 50;
    for (int i = 0; i < n; i++) {
        void *src = malloc(sizes[i]);
        void *dst = malloc(sizes[i]);
        if (!src || !dst) break;
        mset(src, 0xAA, sizes[i]);
        // warmup
        for (int j = 0; j < 3; j++) {
            mcpy(dst, src, sizes[i]);
            asm volatile("" : : "r"(dst) : "memory");
        }
        struct timespec t0, t1;
        clock_gettime(CLOCK_MONOTONIC, &t0);
        for (int j = 0; j < iters; j++) {
            mcpy(dst, src, sizes[i]);
            asm volatile("" : : "r"(dst) : "memory");
        }
        clock_gettime(CLOCK_MONOTONIC, &t1);
        double ms = ((t1.tv_sec-t0.tv_sec)*1000.0 + (t1.tv_nsec-t0.tv_nsec)/1e6) / iters;
        printf("%4zuMB = %.3f ms\n", sizes[i]>>20, ms);
        free(src);
        free(dst);
    }
}
$ gcc -O3 -fno-builtin -o bench bench.c

$ ./bench
   4MB = 0.270 ms
   8MB = 0.545 ms
  16MB = 1.435 ms
  32MB = 2.976 ms
  64MB = 6.080 ms
 128MB = 12.225 ms
 256MB = 23.759 ms
 512MB = 43.041 ms
1024MB = 90.316 ms

$ LD_PRELOAD=libhardened_malloc_musl.so ./bench
   4MB = 0.270 ms
   8MB = 0.564 ms
  16MB = 1.875 ms
  32MB = 4.815 ms
  64MB = 10.517 ms
 128MB = 22.669 ms
 256MB = 41.873 ms
 512MB = 106.389 ms
1024MB = 169.493 ms

$ LD_PRELOAD=libhardened_malloc_dlsym.so ./bench
   4MB = 0.262 ms
   8MB = 0.524 ms
  16MB = 1.146 ms
  32MB = 2.215 ms
  64MB = 5.637 ms
 128MB = 12.210 ms
 256MB = 23.888 ms
 512MB = 44.732 ms
1024MB = 92.406 ms
glibc PR (musl) PR + dlsym
4MB 0.270ms 0.270ms 0.262ms
8MB 0.545ms 0.564ms 0.524ms
16MB 1.435ms 1.875ms (1.31x) 1.146ms (0.80x)
32MB 2.976ms 4.815ms (1.62x) 2.215ms (0.74x)
64MB 6.080ms 10.517ms (1.73x) 5.637ms (0.93x)
128MB 12.225ms 22.669ms (1.85x) 12.210ms (1.00x)
256MB 23.759ms 41.873ms (1.76x) 23.888ms (1.01x)
512MB 43.041ms 106.389ms (2.47x) 44.732ms (1.04x)
1024MB 90.316ms 169.493ms (1.88x) 92.406ms (1.02x)

The perf hit is coming pretty much entirely from using the unoptimised musl libc memcpy, instead of the AVX optimised glibc one, malloc_object_size is only ~15ns per call:

malloc_object_size overhead measurement covering slab + large
// overhead.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

size_t malloc_object_size(void *p);

int main(void) {
    char *p = malloc(4 << 20);
    if (!p) return 1;

    int iters = 100000;
    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int i = 0; i < iters; i++) {
        size_t s = malloc_object_size(p);
        asm volatile("" : : "r"(s) : "memory");
    }
    clock_gettime(CLOCK_MONOTONIC, &t1);
    double ns = ((t1.tv_sec-t0.tv_sec)*1e9 + (t1.tv_nsec-t0.tv_nsec)) / iters;
    printf("malloc_object_size on 4MB alloc: %.0f ns/call\n", ns);

    char *q = malloc(64);
    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int i = 0; i < iters; i++) {
        size_t s = malloc_object_size(q);
        asm volatile("" : : "r"(s) : "memory");
    }
    clock_gettime(CLOCK_MONOTONIC, &t1);
    ns = ((t1.tv_sec-t0.tv_sec)*1e9 + (t1.tv_nsec-t0.tv_nsec)) / iters;
    printf("malloc_object_size on 64B alloc: %.0f ns/call\n", ns);

    free(p);
    free(q);
    return 0;
}
$ gcc -O3 -fno-builtin -o overhead overhead.c -lhardened_malloc
$ ./overhead
malloc_object_size on 4MB alloc: 10 ns/call
malloc_object_size on 64B alloc: 13 ns/call

To address some of your concerns "dlsym doesn't seem to work with all program such as mutter-x11-frames", you say you can't reproduce this anymore so I'm going to assume it was a transient issue?

"this doesn't necessarily pull from libc, but can pull from other libraries": I tested this with a second entry in the LD_PRELOAD chain which I assume is the concern you had:

test source + output
// other.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stddef.h>

void *memcpy(void *restrict dst, const void *restrict src, size_t n) {
    static void *(*next)(void*, const void*, size_t) = NULL;
    if (!next) next = dlsym(RTLD_NEXT, "memcpy");
    return next(dst, src, n);
}
// overflow_test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void *(*volatile mcpy)(void*, const void*, size_t) = memcpy;

int main(void) {
    char *a = malloc(16);
    char *b = malloc(8);
    memset(b, 'B', 8);
    mcpy(a, b, 16);
    printf("no abort\n");
    return 0;
}
$ LD_PRELOAD="libhardened_malloc_dlsym.so libother.so" ./overflow_test
fatal allocator error: memcpy read overflow
Aborted
exit: 134

$ LD_PRELOAD="libhardened_malloc_dlsym.so libother.so" ./bench
   4MB = 0.274 ms
   8MB = 0.561 ms
  16MB = 1.686 ms
  32MB = 3.009 ms
  64MB = 6.071 ms
 128MB = 12.473 ms
 256MB = 26.687 ms
 512MB = 45.956 ms
1024MB = 99.686 ms

"it feels unsafe": every failure mode falls back to musl which is correct just slow:

  • constructor hasn't fired yet -> use musl_memcpy from static init
  • dlsym returns NULL -> if (sym && ...) fails -> keeps musl_memcpy
  • dlsym returns our own symbol -> sym == memcpy -> keeps musl_memcpy
  • dlsym internally calls memcpy during resolution -> wrapper fires -> calls real_memcpy which is still musl_memcpy at that point -> no recursion

For environmental concerns (some of these are stretches just to show I've thought about all the possible failure modes):

  • musl is the system libc (alpine, etc) -> sym != memcpy (they are different addresses: musl's impl vs our impl) -> we resolve a slower memcpy
  • statically linked hardened_malloc -> dlsym returns NULL -> keeps musl_memcpy
  • library providing resolved memcpy gets dlclosed (stretch) -> can't dlclose libc
  • IFUNC not resolved yet (stretch) -> IFUNC resolvers run at load time (before constructors)
  • signal during constructor (definitely a stretch but I couldn't think of any other failure modes) -> pointer write is naturally aligned, atomic on both x86 and arm64
  • android/bionic support -> not tested at all but CONFIG_BLOCK_OPS_CHECK_SIZE defaults to false in Android.bp, though bionic supports dlsym(RTLD_NEXT, ...) so it should work if enabled

If the deal breaker for this PR was the perf hit I believe this approach would solve it and hopefully get this landed! (great work btw!)

@SkewedZeppelin SkewedZeppelin force-pushed the memcpy-sanity branch 2 times, most recently from abaf70b to 584d965 Compare April 29, 2026 19:57
@SkewedZeppelin
Copy link
Copy Markdown
Author

SkewedZeppelin commented Apr 29, 2026

I tested this with a second entry in the LD_PRELOAD chain which I assume is the concern you had

it is weird, see #252 (comment)
I rebased and pushed your code as a separate commit for anyone to test with.

there was also force pushed commit somewhere a bit back where I did import some assembly versions from musl, but my issue with that was the added maintenance burden for per-arch and any changes

deal breaker

it still doesn't work under clang for some reason (at least as of a year ago or whatever)

@agnosticlines
Copy link
Copy Markdown

agnosticlines commented Apr 29, 2026

Amazing decision from Github to hide comments like that, though it's partially on me as I've been following this thread for a while and should have remembered.

So the clang issues, you've mentioned a bunch of things and I think this may fix all of them, I don't have a non headless Linux system atm so I can't test chrome/gnome/etc, you'll need to do that sorry :(.

I tried to build it locally and had to pass -fuse-ld=lld because clang LTO doesn't link with GNU ld (I assume Fedora links fine with the gold plugin, which Ubuntu doesn't seem to have) which worked but it segfaulted immediately, I tracked the issue down to the LoopIdiomRecognize pass, which sees the byte/word copy loop inside musl_memcpy and replaces it with call memcpy@plt which causes infinite recursion.

Can you build it locally with -fno-builtin -mllvm -disable-loop-idiom-all on the musl files and see if the issues you described (size max, gnome app crashes, mutter not working) still occur?

Not sure if I'm just fixing an issue with my setup or if this is the bigger underlying issue for clang but I can see a possible path that would cause this bug to present in the ways you described, either way with those options it builds fine and everything seems to work: 52/52, overflow detection works, malloc_object_size() works.

It's a pretty interesting bug I might dig into a little more, if it doesn't work let me know and I'll take a look tomorrow.

@SkewedZeppelin
Copy link
Copy Markdown
Author

last I tried it wasn't recursing, it did in earlier versions, but it was instead effectively no-op since the size was always max instead of the actual allocation size

@agnosticlines
Copy link
Copy Markdown

agnosticlines commented Apr 29, 2026

Ah... I think I scrolled past you talking about that, sorry about that. I can't test gnome etc but I tried rustc mentioned earlier as not working and it worked fine for me.

What specifically wasn't working for you in clang? Because it works fine for me, so the next thing to work out is what's different about our setups.

@agnosticlines
Copy link
Copy Markdown

agnosticlines commented Apr 30, 2026

last I tried it wasn't recursing, it did in earlier versions, but it was instead effectively no-op since the size was always max instead of the actual allocation size

the theory I originally had before I found the recursion was that clangs LTO was constant folding ro.slab_region_start as 0 since it's set at runtime in the constructor, but first initialised to 0 statically, which would optimise the slab region check to always false. every pointer would fall through to regions_find -> NULL -> SIZE_MAX

fwiw enforce_init isn't present in malloc_object_size, so if the allocator was in a broken state it would silently return SIZE_MAX instead of aborting, which would match what you saw, but I think you'd notice failures sooner, what test code were you running for malloc_object_size? The only way I can think of that being a problem is if the SIGSEGV from the recursion was caught by another signal handler (gnome-session/Glib crash handling?) and the allocator continued on in a broken state.

Both of these are speculation, if you can get a reproduction or steps so I can build it on the same setup as you I can try reproduce and look into it some more, because again, the clang build works fine for me currently.

@SkewedZeppelin
Copy link
Copy Markdown
Author

SkewedZeppelin commented Apr 30, 2026

reproduction

the included test cases:

gcc 16.0.1
....................................................
----------------------------------------------------------------------
Ran 52 tests in 0.108s

OK


clang 22.1.4
....................................................
----------------------------------------------------------------------
Ran 52 tests in 0.108s

OK


gcc+bosc
................................................................
----------------------------------------------------------------------
Ran 64 tests in 0.134s

OK


clang+bosc
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF..FF.F.FFFFFFFFFFFFFFFFFFFFF..FF.
----------------------------------------------------------------------
Ran 64 tests in 0.664s

FAILED (failures=57)

@agnosticlines
Copy link
Copy Markdown

agnosticlines commented Apr 30, 2026

Oh, sorry I thought I was running them already but they were all commented out in the branch. I uncommented all 12 and ran them again and got 60/64.

Looked into the ones that failed (memcpy and memmove) and saw the memcpy wrapper wasn't there, clang had inlined it.

Fixed by adding: -fno-builtin to SHARED_FLAGS.

64/64, works fine.

................................................................
----------------------------------------------------------------------
Ran 64 tests in 0.854s

OK

Tested clang 18, 20 and 23. Did you try building/testing on a debian distro? Also was "clang+bosc" in your testing output with the extra flags? Because without the original fix it will fail in exactly the same way:

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF..FF.F.FFFFFFFFFFFFFFFFFFFFF..FF.
----------------------------------------------------------------------
Ran 64 tests in 1.376s

FAILED (failures=57)

The 7 tests that pass all expect a SIGSEGV, so they pass by accident:

getrandom("\x00\x85\xc0\x70...", 40, 0)                              = 40
mmap(NULL, 271971237888, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, ...)   = 0x7e823aba6000
mprotect(0x7e8f82d4f000, 25862144, PROT_READ|PROT_WRITE)             = 0
mprotect(0x7e8f845f9000, 3072, PROT_READ|PROT_WRITE)                 = 0
mmap(NULL, 13469017440256, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, ...) = 0x72423aa00000
...
mprotect(0x72663c6cf000, 4096, PROT_READ|PROT_WRITE)                 = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x7ee3a7eb2ff8} ---
+++ killed by SIGSEGV +++

Disassembly:

0000000000000000 <musl_memcpy>:
   0:   f3 0f 1e fa             endbr64
   4:   41 57                   push   %r15
   6:   41 56                   push   %r14
   8:   41 54                   push   %r12
   a:   53                      push   %rbx
   b:   50                      push   %rax
   c:   49 89 d6                mov    %rdx,%r14
   f:   49 89 f7                mov    %rsi,%r15
  12:   48 89 fb                mov    %rdi,%rbx
  15:   41 f6 c7 03             test   $0x3,%r15b
  19:   0f 84 e6 00 00 00       je     105 <musl_memcpy+0x105>
  ...
  47:   ff 15 00 00 00 00       call   *0x0(%rip)
                    49: R_X86_64_GOTPCRELX    memcpy-0x4

memcpy is resolved through the GOT so we land back in the wrapper again, which calls musl_memcpy again, which lands us here again.

You don't need -mllvm -disable-loop-idiom-all, I just tested it again with just -fno-builtin and it works fine, I must have been using old object files and forgot to make clean or something.

Also your tests won't pass on Ubuntu because it injects -D_FORTIFY_SOURCE=3 which replaces memcpy with __memcpy_chk bypassing the wrapper.

Patch:

fix makefiles
diff --git a/Makefile b/Makefile
index b8ce9dd..90444f4 100644
--- a/Makefile
+++ b/Makefile
@@ -146,17 +146,17 @@ $(OUT)/util.o: util.c util.h $(CONFIG_FILE) | $(OUT)
 	$(COMPILE.c) $(OUTPUT_OPTION) $<
 
 $(OUT)/memcpy.o: memcpy.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
 $(OUT)/memccpy.o: memccpy.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
 $(OUT)/memmove.o: memmove.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
 $(OUT)/memset.o: memset.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
 $(OUT)/swab.o: swab.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
 $(OUT)/wmemset.o: wmemset.c musl.h $(CONFIG_FILE) | $(OUT)
-	$(COMPILE.c) $(OUTPUT_OPTION) $<
+	$(COMPILE.c) -fno-builtin $(OUTPUT_OPTION) $<
 
 check: tidy
 
diff --git a/test/Makefile b/test/Makefile
index 80221cc..6596801 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -16,7 +16,7 @@ CPPFLAGS := \
     -DSLAB_CANARY=$(CONFIG_SLAB_CANARY) \
     -DCONFIG_EXTENDED_SIZE_CLASSES=$(CONFIG_EXTENDED_SIZE_CLASSES)
 
-SHARED_FLAGS := -O3
+SHARED_FLAGS := -O3 -fno-builtin
 
 CFLAGS := -std=c17 $(SHARED_FLAGS) -Wmissing-prototypes
 CXXFLAGS := -std=c++17 -fsized-deallocation $(SHARED_FLAGS) 

Test results (with -U_FORTIFY_SOURCE in SHARED_FLAGS as well)

gcc:
................................................................
----------------------------------------------------------------------
Ran 64 tests in 0.972s
OK

clang 18:
................................................................
----------------------------------------------------------------------
Ran 64 tests in 0.984s
OK

clang 23:
................................................................
----------------------------------------------------------------------
Ran 64 tests in 0.905s
OK

In the tests I also added -U_FORTIFY_SOURCE in SHARED_FLAGS but this won't change anything, just a workaround for the tests not working on my dev ubuntu box.

@agnosticlines
Copy link
Copy Markdown

agnosticlines commented May 5, 2026

I was having another look and noticed a few things

h_memcpy_internal is defined in two places (random.h and include/h_malloc.h), it doesn't error because it's the same structure but if it's changed in one place and not the other it would cause a build error due to -Werror, I know why it's being done but it might be worth extracting into a musl.h? Though it probably won't change much so I'm not sure.

In the memccpy implementation:

EXPORT void *memccpy(void *restrict dst, const void *restrict src, int value, size_t len) {
...
    if (unlikely(len > malloc_object_size(src) && value != 0)) {
        fatal_error("memccpy read overflow");
...

The && value != 0 invalidates the read overflow check on the src when value is 0:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

int main(void) {
    char *src = malloc(8);
    char *dst = malloc(128);
    if (!src || !dst) return 1;

    memset(src, 'A', 8);

    fprintf(stderr, "calling memccpy(dst, src, 0, 64)\n");
    memccpy(dst, src, 0, 64);
    fprintf(stderr, "survived\n");

    free(src);
    free(dst);
    return 0;
}
$ gcc -O0 -o memccpy_bypass memccpy_bypass.c
$ LD_PRELOAD=libhardened_malloc.so ./memccpy_bypass
calling memccpy(dst, src, 0, 64)
survived

What happened:

  • memccpy(dst, src, 0, 64) where src is 8 bytes, dst is 128 bytes
  • the wrapper checks if len > malloc_object_size(src) -> 64 > 8 -> true -> read overflow detected, should abort
  • then checks value != 0 -> 0 != 0 -> false
  • true && false -> false -> read overflow detection ignored, no longer aborts
  • musl_memccpy runs, copies bytes from src looking for byte 0
  • if src doesn't contain 0 then memccpy reads 64 bytes (56 bytes past the allocation)

The fix is to remove && value != 0.

You also use !firstbuffer && !secondbuffer for all the tests instead of !firstbuffer || !secondbuffer so they only exit if both mallocs fail instead of if any of them fail, not sure if that was intentional or not but just wanted to flag it.

@agnosticlines
Copy link
Copy Markdown

agnosticlines commented May 5, 2026

Also while I'm here, can I ask why CONFIG_BLOCK_OPS_CHECK_SIZE is mutually exclusive with MTE? Is it just performance? (Also do we have exact numbers on real world workloads?)

Would it be worth supporting the ability to use both for users building it themselves? For example on ARM servers where the performance tradeoff is worth it. Looking at recently disclosed/patched bugs in server software a decent chunk of them are things like attacker controlled data being passed as a length to memcpy which would be better covered with deterministic software protection vs probabilistic hardware protection (through MTE).

This is especially important because it's likely in these cases that crashes will just trigger an automatic restart giving the attacker another chance to try again.

There's also cases where MTE alone just won't cover, like a 1 byte overflow within the same 16 byte granule won't trigger MTE but would trigger the block ops check.

I know there's a bunch of caveats where even the classes I just mentioned wouldn't be caught by these size checks, even if they do go through the wrappers, like if the memory corruption happens as a result of an out of bounds indexed array access or if the destination isn't allocator aware and the size check returns SIZE_MAX, though hopefully these would be caught by MTE.

I'm not suggesting we use this instead of MTE, just that they should be allowed to stack if you accept the performance hit.

Currently there's no way to even opt in if you're building it yourself as it's gated by:

#if CONFIG_BLOCK_OPS_CHECK_SIZE && !defined(HAS_ARM_MTE)

Maybe a question for @thestinger?

@thestinger
Copy link
Copy Markdown
Member

@agnosticlines MTE provides deterministic protection against linear and small overflows with hardened_malloc since there are guaranteed to be distinct tags for adjacent allocations. MTE being enabled is dynamic though and it needs to be handled similarly to the write-after-free check which is similarly deterministically caught by MTE instead via the dedicated free tag.

@agnosticlines
Copy link
Copy Markdown

agnosticlines commented May 5, 2026

Btw @SkewedZeppelin small nit, can you remove the comment I added, I just wanted to explain why there was an arbitrary priority (102) to anyone reviewing, I phrased it badly though, it's more appropriate to say "Allocators own early init path uses priority 101, so we use the next available number"

@SkewedZeppelin
Copy link
Copy Markdown
Author

invalidates the read overflow

see this case: #252 (comment)

@agnosticlines
Copy link
Copy Markdown

agnosticlines commented May 5, 2026

invalidates the read overflow

see this case: #252 (comment)

Okay, Github is awful... I only had one set of hidden comments with 68 entries, now that I've clicked the comment I have two (the other with 87) but if I refresh it disappears again, and also the numbers keep changing across refreshes??? This platform is genuinely terrible... I'm really sorry about that! I think this may have been why I missed the original discussion about dlsym too as I only saw it when I clicked your comment link...

Looks like it's a known issue lol https://github.com/orgs/community/discussions/193340

@agnosticlines
Copy link
Copy Markdown

agnosticlines commented May 18, 2026

@SkewedZeppelin sorry to ping again but did you test the clang fix? if this works on clang now with the extra compiler flag (which it has in my tests) is there anything left before it can be reviewed/merged? This is a meaningful security improvement for non MTE devices

SkewedZeppelin and others added 2 commits June 6, 2026 10:32
Signed-off-by: Tavi <tavi@divested.dev>
Co-authored-by: =?UTF-8?q?Christian=20G=C3=B6ttsche?= <cgzones@googlemail.com>
nohm
 4 MB = 0.091741 ms
 8 MB = 0.186662 ms
16 MB = 0.375295 ms
32 MB = 0.643256 ms
64 MB = 1.293962 ms
128 MB = 2.658412 ms
256 MB = 5.288432 ms

hm
 4 MB = 0.091336 ms
 8 MB = 0.187152 ms
16 MB = 0.343821 ms
32 MB = 0.638406 ms
64 MB = 1.281708 ms
128 MB = 2.563310 ms
256 MB = 5.109415 ms

hm+bosc
 4 MB = 0.092013 ms
 8 MB = 0.185993 ms
16 MB = 0.360132 ms
32 MB = 0.941173 ms
64 MB = 2.724979 ms
128 MB = 6.140287 ms
256 MB = 12.867246 ms

hm+bosc+dlsym
 4 MB = 0.091810 ms
 8 MB = 0.188023 ms
16 MB = 0.375594 ms
32 MB = 0.647143 ms
64 MB = 1.288610 ms
128 MB = 2.557970 ms
256 MB = 5.114027 ms

Signed-off-by: Tavi <tavi@divested.dev>
@agnosticlines
Copy link
Copy Markdown

agnosticlines commented Jun 7, 2026

Got another email and took another look and there's a few places where you use len * sizeof(wchar_t) without checking if it overflows.

I'm on my phone right now or I'd submit a patch but in wmemset:

if (unlikely((len * sizeof(wchar_t)) > malloc_object_size(dst))) {
    fatal_error("wmemset buffer overflow");
}
return musl_wmemset(dst, value, len);

Which calls into musl_wmemset to perform the write:

wchar_t *musl_wmemset(wchar_t *d, wchar_t c, size_t n)
{
    wchar_t *ret = d;
    while (n--) *d++ = c;
    return ret;
}

The flow is:

  • len is passed in with some huge value
  • len * sizeof(wchar_t) wraps to 0
  • bounds check sees 0 bytes and passes when it should fail
  • musl_wmemset still receives the huge len and writes len wchar_ts

This same issue is present in wmemcpy, wmemmove and wmempcpy with varying failure modes.

I think there's also a way to solve the memccpy(value == 0) "bypass" , depending on how much you care about it. I can share more once I've ran the benchmarks.

Also memccpy(len == 0) returns the wrong value. It returns dst when it should return NULL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants