Redesign `intrinsic-test` to use simple comparison by sayantn · Pull Request #2063 · rust-lang/stdarch

sayantn · 2026-03-16T00:10:47Z

Currently intrinsic-test prints the outputs and then compares the outputs manually. This PR uses a different approach -- generate C wrappers for the intrinsics, link to them from Rust, and then just use simple rust tests to compare outputs

Blocked on ~~#2095~~ ~~#2096~~ ~~#2099~~ ~~#2100~~ ~~#2101~~ ~~#2105~~ #2106 ~~#2107~~ ~~#2108~~ #2111 #2112

edit: I won't split this any further, so after #2106, #2111 and #2112 merges, I will mark this ready for review

sayantn · 2026-03-16T01:23:31Z

---- test_vdupq_n_f16 stdout ----

thread 'test_vdupq_n_f16' (2187) panicked at mod_0/src/lib.rs:13773:17:
assertion `left == right` failed: 
  left: [NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0)]
 right: [NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(0.0), NiceF16(1.43e-5), NiceF16(0.0), NiceF16(-50430.0), NiceF16(1.79e-5)]

This seems weird (left is the Rust output, right is the C one, and NiceF16 is a wrapper which implements PartialEq as a == b || (a.is_nan() && b.is_nan())). This looks like ABI-related issue. For reference, the declaration looks like

unsafe extern "C" {
    fn vdup_n_f16_wrapper(value: f16) -> float16x4_t;
}

In fact most f16 tests fail in armv7. @folkertdev can you help?

Edit:

To work around this issue I have modified the tool to communicate with C via pointers (e.g. the C wrapper for _mm_add_ps looks like void _mm_add_ps_wrapper(__m128 *dst, const __m128* a, const __m128* b). This fixed the AArch64 and ARMv7 problems, but now the AArch64BE tests are failing, because apparently C and Rust have different pointer load semantics for matrix-like vectors (e.g. uint64x2x2_t) https://godbolt.org/z/j1d16z1P9

@Amanieu is this intended behavior or a bug?

sayantn · 2026-03-16T08:46:41Z

Btw the time gains are significant, it reduces the Arm and aarch64 times to 2-3 minutes, and the full x86 run (we did 20% previously) to around 12 mins for release and 17 mins for dev

folkertdev · 2026-03-16T19:40:36Z

Great work. Quick sanity check on f16, perhaps we're still using LLVM 21 to compile the C? If LLVM 21 is used (and also on windows apparently still with LLVM 22) then on some targets the ABI is inconsistent.

sayantn · 2026-03-16T20:06:47Z

@folkertdev ooh, that makes sense. I don't particularly care about windows, but we are using LLVM20 in the CI. I can change it to use the build from kernel.org

folkertdev · 2026-03-16T20:10:12Z

I'm seeing clang-18 here even https://triage.rust-lang.org/gha-logs/rust-lang/stdarch/67182959392?pr=2063. I'm not sure what the best solution is really. You could ask T-infra if they have ideas.

sayantn · 2026-03-16T20:14:58Z

yeah, but I can use the LLVM github builds or the kernel.org builds

tgross35 · 2026-03-16T20:56:46Z

Can f16 tests just be gated with #[cfg(target_has_reliable_f16)]? That's likely easier than working around the Windows and old LLVM failures.

sayantn · 2026-03-16T20:58:30Z

@tgross35 the f16 tests are mostly fine now. More concerning is that a lot of tests are failing in all 3 arm archs, e.g. vzipq. The C version seems to return all zeros

edit: sorry, my mistake, they are still failing in ARMv7. I will gate them against the flag

folkertdev · 2026-03-16T21:20:26Z

With LLVM 22 f16 should work on armv7 though?

tgross35 · 2026-03-16T21:27:37Z

FTZ/DAZ-related perhaps?

sayantn · 2026-03-17T06:01:39Z

FTZ/DAZ-related perhaps?

I don't really think so, the outputs seem completely distinct.

I noticed that vzipq etc was failing so I tried out the assemblies.
In aarch64_be-unknown-linux-gnu,

use core::arch::aarch64::*;

#[unsafe(no_mangle)]
#[target_feature(enable = "neon")]
pub unsafe extern "C" fn foo(dst: *mut uint8x16x2_t, a: *const uint8x16_t, b: *const uint8x16_t) {
    unsafe {
        *dst = vzipq_u8(*a, *b);
    }
}

produces

foo:
        ld1 { v0.16b }, [x1]
        ld1 { v1.16b }, [x2]
        add x8, x0, #16
        zip1 v2.16b, v0.16b, v1.16b
        zip2 v0.16b, v0.16b, v1.16b
        st1 { v2.16b }, [x0]
        st1 { v0.16b }, [x8]
        ret

But the C code seemingly has different behavior on GCC and clang https://godbolt.org/z/T3YnrejjG

@adamgemmell can you help in this?

adamgemmell · 2026-03-18T14:55:11Z

I'm not sure it will fix your issue but the difference in instructions comes from the fact that in arm_neon.h, they reverse every vector before and after the operation on big endian. It's not always actually necessary so we only do it if it's broken without it - however, the intrinsic test tool doesn't detect the difference in behaviour because both arguments it picks are identical.

e.g.:

| a               | 15 | 14 | 13 | ... | 2  | 1  | 0  |
| b               | 31 | 30 | 29 | ... | 18 | 17 | 16 |
| a = rev(a)      | 0  | 1  | 2  | ... | 13 | 14 | 15 |
| b = rev(b)      | 16 | 17 | 18 | ... | 29 | 30 | 31 |
| ret = zip(a, b) | 0  | 16 | 1  | ... | 30 | 15 | 31 |
| rev(ret)        | 31 | 15 | 30 | ... | 1  | 16 | 0  |

adamgemmell · 2026-03-18T15:53:48Z

You can try adding big_endian_inverse: true to the definition in the yaml, regenerating to see if it changes anything. We can probably offset the pointers to the values array for arguments slightly to ensure they're different

Also I don't actually see vzipq_u8 on the latest CI run, why is that?

sayantn · 2026-03-18T16:25:37Z

Also I don't actually see vzipq_u8 on the latest CI run, why is that?

I have no idea, I can confirm that locally the test is generated and run.

You can try adding big_endian_inverse: true to the definition in the yaml, regenerating to see if it changes anything. We can probably offset the pointers to the values array for arguments slightly to ensure they're different

I will check. Thanks

Edit: @adamgemmell adding big_endian_inverse: true to the functions seem to work. The only question remains is that is it the correct behavior, or is clang buggy here

Edit2: sorry, vzipq_u8 is not getting tested even locally. I will look into it

adamgemmell · 2026-03-18T17:36:27Z

None of the unsigned variants of vzipq seem to be seen there, weird.

I'd quite like to know why this patch detects the difference - when I looked locally the codegen of the tests seemed very similar

sayantn · 2026-03-18T18:19:32Z

Yeah I fixed the test not being included, I used / instead of div_ceil when computing the chunk sizes, so the last module wasn't getting included 😅

This reverts commit 5f676a5.

This enables the `big_endian_inverse` transformation to apply to some of these intrinsics only when appropriate.

…sted specifically

Instead of doing an `as {i,u}16` cast, we can use the same LLVM intrinsics as when converting to `{i,u}{32,64}`, which is what Clang does and ensures the intrinsic result matches.

…files wrappers

sayantn force-pushed the intrinsic-test branch 4 times, most recently from feb1dcd to 6ef8b8f Compare March 16, 2026 00:59

sayantn force-pushed the intrinsic-test branch 4 times, most recently from e2346ff to db1b2ca Compare March 16, 2026 06:16

sayantn force-pushed the intrinsic-test branch 2 times, most recently from ce53e81 to 76dd339 Compare March 16, 2026 20:48

sayantn force-pushed the intrinsic-test branch from 76dd339 to c4b138f Compare March 17, 2026 06:18