GH-47769: [C++] SVE dynamic dispatch#49756
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
2925550 to
ff8566b
Compare
pitrou
left a comment
There was a problem hiding this comment.
Thanks for doing this! Here are a number of comments, questions, and suggestions.
| // under the License. | ||
|
|
||
| #if defined(ARROW_HAVE_NEON) | ||
| # define UNPACK_PLATFORM unpack_neon |
There was a problem hiding this comment.
Can we just include bpacking_simd_internal.h and reuse the UNPACK_ARCH128 macro?
There was a problem hiding this comment.
Possibly, though I thought it best that the macro is #undef at the end of the header (making it useless here).
We can make it more explicit (ARROW_BPACKING_UNPACK_ARCH128) an not undefining it.
| #if defined(ARROW_HAVE_NEON) | ||
| # define UNPACK_ARCH128 unpack_neon | ||
| #elif defined(ARROW_HAVE_SSE4_2) | ||
| # define UNPACK_ARCH128 unpack_sse4_2 | ||
| #endif |
There was a problem hiding this comment.
Relying on ARROW_HAVE_NEON etc. is why we need the "128 alt" case, right?
Perhaps we can also depend on which target the file is being compiled for.
For example we could have:
macro(append_runtime_sve128_src SRCS SRC)
if(ARROW_HAVE_RUNTIME_SVE128)
list(APPEND ${SRCS} ${SRC})
set_source_files_properties(${SRC}
PROPERTIES COMPILE_OPTIONS "${ARROW_SVE128_FLAGS}"
COMPILE_DEFINITIONS
"ARROW_COMPILING_FOR_SVE128")
endif()
endmacro()and then:
#if defined(ARROW_COMPILING_FOR_SVE128)
# define UNPACK_ARCH128 unpack_sve128
#elif defined(ARROW_HAVE_NEON)
# define UNPACK_ARCH128 unpack_neon
#elif defined(ARROW_HAVE_SSE4_2)
# define UNPACK_ARCH128 unpack_sse4_2
#endifThere was a problem hiding this comment.
The issue is we need the file compiled twice in ARM (Neon + sve128).
I think it is not possible directly in CMake. The solution is copy to build tree then compile each with different flags.
Is a CMake-only solution satisfactory?
There was a problem hiding this comment.
The issue is we need the file compiled twice in ARM (Neon + sve128).
I think it is not possible directly in CMake.
The easy workaround is to have the same .h file included in two different stub .cc files.
For example have bpacking_simd128_internal.h included by both bpacking_neon.cc and bpacking_sve128.cc.
|
@pitrou I definitely agree with the duplication of the different files, it's pretty tedious. |
ff8566b to
23dc5f0
Compare
|
Something isn't quite right on ARM64 Ubuntu and ARM64 macOS. |
| return dispatch.func(in, out, opts); | ||
| #endif | ||
| auto constexpr kImplementations = UnpackDynamicFunction<Uint>::implementations(); | ||
| if constexpr (kImplementations.size() == 1) { |
There was a problem hiding this comment.
Is this condition actually useful? I guess it's a shortcut, but it's not obvious that it applies to common cases (x86 or ARM with default SIMD options).
At worse, this could be added generically to DynamicDispatch instead. But I doubt it's worth it.
There was a problem hiding this comment.
It is worth it to avoid additional #ifdef, for instance on Macos there is only neon and no SVE (no need to dyn dispatch).
Previously we'd exclude the Neon version from the dynamic dispatch and go #ifdef ARROW_HAVE_NEON then go straight to Neon implementation.
At worse, this could be added generically to
DynamicDispatchinstead. But I doubt it's worth it.
Actually done in GH-49840 so either way here (we'd need to adapt the PR that is not merged first).
There was a problem hiding this comment.
Actually done in GH-49840 so either way here
That PR might prove difficult to adapt for all the lousy compilers we have to support, so I'd rather focus on this one first :)
23dc5f0 to
a8f0ddf
Compare
a8f0ddf to
80e0ab2
Compare
| ->ArgsProduct(kBitWidthsNumValues64); | ||
| #endif | ||
|
|
||
| #if defined(ARROW_HAVE_RUNTIME_SVE128) |
There was a problem hiding this comment.
I wonder if there's an easy way to reduce the duplication we're doing for each runtime SIMD level?
For example if we could write something like:
BENCHMARK_SIMD_UNPACK(Bool, bool, SVE128, Sve128, sve128);and it would expand to:
BENCHMARK_CAPTURE(BM_UnpackBool, Sve128Unaligned, false, &bpacking::unpack_sve128<bool>,
!CpuInfo::GetInstance()->IsSupported(CpuInfo::SVE128),
"Sve128 not available")
->ArgsProduct(kBitWidthsNumValues<bool>);There was a problem hiding this comment.
You mean with a macro?
|
@AntoinePrv Is it possible to run some ARM benchmarks and paste the results somewhere once you're satisfied with the PR? |
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
7ecbd23 to
cc15bf1
Compare
|
Bad manipulation... I pushed a commit with more aggressive inlining strategy. All SIMD code seems to be doing better wrt to scala, but the previous conclusions remain: |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit e11d5ee. Watch https://buildkite.com/apache-arrow and https://conbench.arrow-dev.org for updates. A comment will be posted here when the runs are complete. |
|
It seems that
I think the solution is to add |
|
Just for reference Did a quick poke with AI coding agent. It analyzed the reason why Neon code is not inlined and proposed a fix to xsimd: neon-bitcast-inline.patch Unit test passed. Neon code is slightly faster than SVE128, matches expectation. I only tested one case. I suspected xsimd bitcast Neon code may be too complicated for compiler to inline (maybe related to my old PR to fix an issue, but I forgot details). |
|
Thanks for your patience. Conbench analyzed the 3 benchmarking runs that have been run so far on PR commit e11d5ee. There were 27 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
Our ARM benchmarking platform, which has NEON but not SVE (I think it's a Graviton 2 machine), shows very nice speedups on some Parquet reading/decoding benchmarks. 🎉 |
@cyb70289 for unrelated reasons (building WIN ARM64), I ended up flattening this implementation as well in xtensor-stack/xsimd#1317. We should be merging it and releasing soon after so we'll see how it performs. |
Rationale for this change
Just like we dynamically dispatch to AVX2 on x86 CPUs, we want to dynamically dispatch to more advanced SIMD extension on ARM64 chips.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
No.