perform size checks on memcpy/memmove/memset#252
Conversation
39b1b13 to
b2fbe6e
Compare
|
@thestinger is it expected that malloc_object_size would return negative values? |
|
@SkewedZeppelin It returns |
|
Line 1852 in 4fe9018 |
|
@SkewedZeppelin It's a |
|
apologies I'm dumb and was using %ld not %lu |
|
@SkewedZeppelin It's not particularly important but you should use |
|
It matters on Windows where |
|
is it worth it to zero the remainder of dst? my only issue is that sometimes it is close to size_max/unknown |
|
@SkewedZeppelin That wouldn't be safe since it's not known what they're doing with it. They could be intentionally only copying to part of it. It can be a copy to the middle of it, etc. |
b2fbe6e to
e5bf6a4
Compare
b8cdf16 to
849055d
Compare
4461652 to
e38da0d
Compare
e38da0d to
291c6c4
Compare
f9b1948 to
29753ce
Compare
29753ce to
4c3dab3
Compare
9d5802c to
074d47a
Compare
|
Hi! Been following this PR for a while and would love to see it get landed, I saw the performance concerns and took a look into them, it seems you wrote off the "figure out if it is possible to use the real underlying functions for better per-arch performance" from the original issue, is there a specific reason for that beyond it feels unsafe? I wrote some code that uses #include <dlfcn.h>
static void *(*real_memcpy)(void *restrict, const void *restrict, size_t) = musl_memcpy;
static void *(*real_memmove)(void *, const void *, size_t) = musl_memmove;
static void *(*real_memset)(void *, int, size_t) = musl_memset;
__attribute__((constructor(102)))
static void resolve_block_ops(void) {
void *sym;
sym = dlsym(RTLD_NEXT, "memcpy");
if (sym && sym != memcpy) real_memcpy = sym;
sym = dlsym(RTLD_NEXT, "memmove");
if (sym && sym != memmove) real_memmove = sym;
sym = dlsym(RTLD_NEXT, "memset");
if (sym && sym != memset) real_memset = sym;
}From here you'd just call the I applied this locally (and ran the test suite which passed 52/52) and wrote a small benchmark: benchmark source + output// bench.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
void *(*volatile mcpy)(void*, const void*, size_t) = memcpy;
void *(*volatile mset)(void*, int, size_t) = memset;
int main(void) {
size_t sizes[] = { 4<<20, 8<<20, 16<<20, 32<<20, 64<<20, 128<<20, 256<<20, 512<<20, 1024<<20 };
int n = sizeof(sizes)/sizeof(sizes[0]);
int iters = 50;
for (int i = 0; i < n; i++) {
void *src = malloc(sizes[i]);
void *dst = malloc(sizes[i]);
if (!src || !dst) break;
mset(src, 0xAA, sizes[i]);
// warmup
for (int j = 0; j < 3; j++) {
mcpy(dst, src, sizes[i]);
asm volatile("" : : "r"(dst) : "memory");
}
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int j = 0; j < iters; j++) {
mcpy(dst, src, sizes[i]);
asm volatile("" : : "r"(dst) : "memory");
}
clock_gettime(CLOCK_MONOTONIC, &t1);
double ms = ((t1.tv_sec-t0.tv_sec)*1000.0 + (t1.tv_nsec-t0.tv_nsec)/1e6) / iters;
printf("%4zuMB = %.3f ms\n", sizes[i]>>20, ms);
free(src);
free(dst);
}
}
The perf hit is coming pretty much entirely from using the unoptimised musl libc malloc_object_size overhead measurement covering slab + large// overhead.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
size_t malloc_object_size(void *p);
int main(void) {
char *p = malloc(4 << 20);
if (!p) return 1;
int iters = 100000;
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int i = 0; i < iters; i++) {
size_t s = malloc_object_size(p);
asm volatile("" : : "r"(s) : "memory");
}
clock_gettime(CLOCK_MONOTONIC, &t1);
double ns = ((t1.tv_sec-t0.tv_sec)*1e9 + (t1.tv_nsec-t0.tv_nsec)) / iters;
printf("malloc_object_size on 4MB alloc: %.0f ns/call\n", ns);
char *q = malloc(64);
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int i = 0; i < iters; i++) {
size_t s = malloc_object_size(q);
asm volatile("" : : "r"(s) : "memory");
}
clock_gettime(CLOCK_MONOTONIC, &t1);
ns = ((t1.tv_sec-t0.tv_sec)*1e9 + (t1.tv_nsec-t0.tv_nsec)) / iters;
printf("malloc_object_size on 64B alloc: %.0f ns/call\n", ns);
free(p);
free(q);
return 0;
}To address some of your concerns "dlsym doesn't seem to work with all program such as mutter-x11-frames", you say you can't reproduce this anymore so I'm going to assume it was a transient issue? "this doesn't necessarily pull from libc, but can pull from other libraries": I tested this with a second entry in the LD_PRELOAD chain which I assume is the concern you had: test source + output// other.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stddef.h>
void *memcpy(void *restrict dst, const void *restrict src, size_t n) {
static void *(*next)(void*, const void*, size_t) = NULL;
if (!next) next = dlsym(RTLD_NEXT, "memcpy");
return next(dst, src, n);
}// overflow_test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void *(*volatile mcpy)(void*, const void*, size_t) = memcpy;
int main(void) {
char *a = malloc(16);
char *b = malloc(8);
memset(b, 'B', 8);
mcpy(a, b, 16);
printf("no abort\n");
return 0;
}"it feels unsafe": every failure mode falls back to musl which is correct just slow:
For environmental concerns (some of these are stretches just to show I've thought about all the possible failure modes):
If the deal breaker for this PR was the perf hit I believe this approach would solve it and hopefully get this landed! (great work btw!) |
abaf70b to
584d965
Compare
it is weird, see #252 (comment) there was also force pushed commit somewhere a bit back where I did import some assembly versions from musl, but my issue with that was the added maintenance burden for per-arch and any changes
it still doesn't work under clang for some reason (at least as of a year ago or whatever) |
|
Amazing decision from Github to hide comments like that, though it's partially on me as I've been following this thread for a while and should have remembered. So the clang issues, you've mentioned a bunch of things and I think this may fix all of them, I don't have a non headless Linux system atm so I can't test chrome/gnome/etc, you'll need to do that sorry :(. I tried to build it locally and had to pass Can you build it locally with Not sure if I'm just fixing an issue with my setup or if this is the bigger underlying issue for clang but I can see a possible path that would cause this bug to present in the ways you described, either way with those options it builds fine and everything seems to work: 52/52, overflow detection works, It's a pretty interesting bug I might dig into a little more, if it doesn't work let me know and I'll take a look tomorrow. |
|
last I tried it wasn't recursing, it did in earlier versions, but it was instead effectively no-op since the size was always max instead of the actual allocation size |
|
Ah... I think I scrolled past you talking about that, sorry about that. I can't test gnome etc but I tried rustc mentioned earlier as not working and it worked fine for me. What specifically wasn't working for you in clang? Because it works fine for me, so the next thing to work out is what's different about our setups. |
the theory I originally had before I found the recursion was that clangs LTO was constant folding fwiw Both of these are speculation, if you can get a reproduction or steps so I can build it on the same setup as you I can try reproduce and look into it some more, because again, the clang build works fine for me currently. |
the included test cases: |
|
Oh, sorry I thought I was running them already but they were all commented out in the branch. I uncommented all 12 and ran them again and got 60/64. Looked into the ones that failed ( Fixed by adding: 64/64, works fine. Tested clang 18, 20 and 23. Did you try building/testing on a debian distro? Also was "clang+bosc" in your testing output with the extra flags? Because without the original fix it will fail in exactly the same way: The 7 tests that pass all expect a SIGSEGV, so they pass by accident: Disassembly:
You don't need Also your tests won't pass on Ubuntu because it injects Patch: fix makefilesdiff --git a/Makefile b/Makefile
index b8ce9dd..90444f4 100644
--- a/Makefile
+++ b/Makefile
@@ -146,17 +146,17 @@ $(OUT)/util.o: util.c util.h $(CONFIG_FILE) | $(OUT)
$(COMPILE.c) $(OUTPUT_OPTION) $<
$(OUT)/memcpy.o: memcpy.c musl.h $(CONFIG_FILE) | $(OUT)
- $(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+ $(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
$(OUT)/memccpy.o: memccpy.c musl.h $(CONFIG_FILE) | $(OUT)
- $(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+ $(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
$(OUT)/memmove.o: memmove.c musl.h $(CONFIG_FILE) | $(OUT)
- $(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+ $(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
$(OUT)/memset.o: memset.c musl.h $(CONFIG_FILE) | $(OUT)
- $(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+ $(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
$(OUT)/swab.o: swab.c musl.h $(CONFIG_FILE) | $(OUT)
- $(COMPILE.c) -Wno-cast-align $(OUTPUT_OPTION) $<
+ $(COMPILE.c) -fno-builtin -Wno-cast-align $(OUTPUT_OPTION) $<
$(OUT)/wmemset.o: wmemset.c musl.h $(CONFIG_FILE) | $(OUT)
- $(COMPILE.c) $(OUTPUT_OPTION) $<
+ $(COMPILE.c) -fno-builtin $(OUTPUT_OPTION) $<
check: tidy
diff --git a/test/Makefile b/test/Makefile
index 80221cc..6596801 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -16,7 +16,7 @@ CPPFLAGS := \
-DSLAB_CANARY=$(CONFIG_SLAB_CANARY) \
-DCONFIG_EXTENDED_SIZE_CLASSES=$(CONFIG_EXTENDED_SIZE_CLASSES)
-SHARED_FLAGS := -O3
+SHARED_FLAGS := -O3 -fno-builtin
CFLAGS := -std=c17 $(SHARED_FLAGS) -Wmissing-prototypes
CXXFLAGS := -std=c++17 -fsized-deallocation $(SHARED_FLAGS) Test results (with In the tests I also added |
|
I was having another look and noticed a few things
In the EXPORT void *memccpy(void *restrict dst, const void *restrict src, int value, size_t len) {
...
if (unlikely(len > malloc_object_size(src) && value != 0)) {
fatal_error("memccpy read overflow");
...The #include <stdlib.h>
#include <string.h>
#include <stdio.h>
int main(void) {
char *src = malloc(8);
char *dst = malloc(128);
if (!src || !dst) return 1;
memset(src, 'A', 8);
fprintf(stderr, "calling memccpy(dst, src, 0, 64)\n");
memccpy(dst, src, 0, 64);
fprintf(stderr, "survived\n");
free(src);
free(dst);
return 0;
}What happened:
The fix is to remove You also use |
|
Also while I'm here, can I ask why Would it be worth supporting the ability to use both for users building it themselves? For example on ARM servers where the performance tradeoff is worth it. Looking at recently disclosed/patched bugs in server software a decent chunk of them are things like attacker controlled data being passed as a length to This is especially important because it's likely in these cases that crashes will just trigger an automatic restart giving the attacker another chance to try again. There's also cases where MTE alone just won't cover, like a 1 byte overflow within the same 16 byte granule won't trigger MTE but would trigger the block ops check. I know there's a bunch of caveats where even the classes I just mentioned wouldn't be caught by these size checks, even if they do go through the wrappers, like if the memory corruption happens as a result of an out of bounds indexed array access or if the destination isn't allocator aware and the size check returns I'm not suggesting we use this instead of MTE, just that they should be allowed to stack if you accept the performance hit. Currently there's no way to even opt in if you're building it yourself as it's gated by: #if CONFIG_BLOCK_OPS_CHECK_SIZE && !defined(HAS_ARM_MTE)
Maybe a question for @thestinger? |
|
@agnosticlines MTE provides deterministic protection against linear and small overflows with hardened_malloc since there are guaranteed to be distinct tags for adjacent allocations. MTE being enabled is dynamic though and it needs to be handled similarly to the write-after-free check which is similarly deterministically caught by MTE instead via the dedicated free tag. |
584d965 to
956a661
Compare
|
Btw @SkewedZeppelin small nit, can you remove the comment I added, I just wanted to explain why there was an arbitrary priority (102) to anyone reviewing, I phrased it badly though, it's more appropriate to say "Allocators own early init path uses priority 101, so we use the next available number" |
see this case: #252 (comment) |
956a661 to
5c815c8
Compare
Okay, Github is awful... I only had one set of hidden comments with 68 entries, now that I've clicked the comment I have two (the other with 87) but if I refresh it disappears again, and also the numbers keep changing across refreshes??? This platform is genuinely terrible... I'm really sorry about that! I think this may have been why I missed the original discussion about dlsym too as I only saw it when I clicked your comment link... Looks like it's a known issue lol https://github.com/orgs/community/discussions/193340 |
|
@SkewedZeppelin sorry to ping again but did you test the clang fix? if this works on clang now with the extra compiler flag (which it has in my tests) is there anything left before it can be reviewed/merged? This is a meaningful security improvement for non MTE devices |
5c815c8 to
6126477
Compare
Signed-off-by: Tavi <tavi@divested.dev> Co-authored-by: =?UTF-8?q?Christian=20G=C3=B6ttsche?= <cgzones@googlemail.com>
nohm 4 MB = 0.091741 ms 8 MB = 0.186662 ms 16 MB = 0.375295 ms 32 MB = 0.643256 ms 64 MB = 1.293962 ms 128 MB = 2.658412 ms 256 MB = 5.288432 ms hm 4 MB = 0.091336 ms 8 MB = 0.187152 ms 16 MB = 0.343821 ms 32 MB = 0.638406 ms 64 MB = 1.281708 ms 128 MB = 2.563310 ms 256 MB = 5.109415 ms hm+bosc 4 MB = 0.092013 ms 8 MB = 0.185993 ms 16 MB = 0.360132 ms 32 MB = 0.941173 ms 64 MB = 2.724979 ms 128 MB = 6.140287 ms 256 MB = 12.867246 ms hm+bosc+dlsym 4 MB = 0.091810 ms 8 MB = 0.188023 ms 16 MB = 0.375594 ms 32 MB = 0.647143 ms 64 MB = 1.288610 ms 128 MB = 2.557970 ms 256 MB = 5.114027 ms Signed-off-by: Tavi <tavi@divested.dev>
|
Got another email and took another look and there's a few places where you use I'm on my phone right now or I'd submit a patch but in if (unlikely((len * sizeof(wchar_t)) > malloc_object_size(dst))) {
fatal_error("wmemset buffer overflow");
}
return musl_wmemset(dst, value, len);Which calls into wchar_t *musl_wmemset(wchar_t *d, wchar_t c, size_t n)
{
wchar_t *ret = d;
while (n--) *d++ = c;
return ret;
}The flow is:
This same issue is present in I think there's also a way to solve the Also |
for #231
task list
fatal allocator error: invalid malloc_object_sizecompared to others: