Skip to content

GH-47769: [C++] SVE dynamic dispatch#49756

Open
AntoinePrv wants to merge 21 commits intoapache:mainfrom
AntoinePrv:sve-dispatch
Open

GH-47769: [C++] SVE dynamic dispatch#49756
AntoinePrv wants to merge 21 commits intoapache:mainfrom
AntoinePrv:sve-dispatch

Conversation

@AntoinePrv
Copy link
Copy Markdown
Contributor

@AntoinePrv AntoinePrv commented Apr 15, 2026

Rationale for this change

Just like we dynamically dispatch to AVX2 on x86 CPUs, we want to dynamically dispatch to more advanced SIMD extension on ARM64 chips.

What changes are included in this PR?

  • A new macro to enable selecting the runtime SVE version
  • Detection of the ARM64 CPU features available at runtime
  • Adding SVE to the dynamic dispatch for bit unpacking algorithms.

Are these changes tested?

Are there any user-facing changes?

No.

@AntoinePrv AntoinePrv changed the title Sve dynamic dispatch GH-47769: [C++] Sve dynamic dispatch Apr 15, 2026
@github-actions
Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #47769 has been automatically assigned in GitHub to PR creator.

@AntoinePrv AntoinePrv force-pushed the sve-dispatch branch 4 times, most recently from 2925550 to ff8566b Compare April 21, 2026 12:47
@AntoinePrv AntoinePrv marked this pull request as ready for review April 21, 2026 14:01
@pitrou pitrou changed the title GH-47769: [C++] Sve dynamic dispatch GH-47769: [C++] SVE dynamic dispatch Apr 21, 2026
Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! Here are a number of comments, questions, and suggestions.

Comment thread cpp/cmake_modules/SetupCxxFlags.cmake Outdated
Comment thread cpp/cmake_modules/DefineOptions.cmake
Comment thread cpp/cmake_modules/DefineOptions.cmake
Comment thread cpp/cmake_modules/SetupCxxFlags.cmake
Comment thread cpp/src/arrow/util/bpacking.cc
// under the License.

#if defined(ARROW_HAVE_NEON)
# define UNPACK_PLATFORM unpack_neon
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just include bpacking_simd_internal.h and reuse the UNPACK_ARCH128 macro?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, though I thought it best that the macro is #undef at the end of the header (making it useless here).
We can make it more explicit (ARROW_BPACKING_UNPACK_ARCH128) an not undefining it.

Comment on lines 27 to +31
#if defined(ARROW_HAVE_NEON)
# define UNPACK_ARCH128 unpack_neon
#elif defined(ARROW_HAVE_SSE4_2)
# define UNPACK_ARCH128 unpack_sse4_2
#endif
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on ARROW_HAVE_NEON etc. is why we need the "128 alt" case, right?

Perhaps we can also depend on which target the file is being compiled for.
For example we could have:

macro(append_runtime_sve128_src SRCS SRC)
  if(ARROW_HAVE_RUNTIME_SVE128)
    list(APPEND ${SRCS} ${SRC})
    set_source_files_properties(${SRC}
                                PROPERTIES COMPILE_OPTIONS "${ARROW_SVE128_FLAGS}"
                                           COMPILE_DEFINITIONS
                                           "ARROW_COMPILING_FOR_SVE128")
  endif()
endmacro()

and then:

#if defined(ARROW_COMPILING_FOR_SVE128)
#  define UNPACK_ARCH128 unpack_sve128
#elif defined(ARROW_HAVE_NEON)
#  define UNPACK_ARCH128 unpack_neon
#elif defined(ARROW_HAVE_SSE4_2)
#  define UNPACK_ARCH128 unpack_sse4_2
#endif

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is we need the file compiled twice in ARM (Neon + sve128).
I think it is not possible directly in CMake. The solution is copy to build tree then compile each with different flags.
Is a CMake-only solution satisfactory?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is we need the file compiled twice in ARM (Neon + sve128).
I think it is not possible directly in CMake.

The easy workaround is to have the same .h file included in two different stub .cc files.

For example have bpacking_simd128_internal.h included by both bpacking_neon.cc and bpacking_sve128.cc.

Comment thread cpp/src/arrow/util/bpacking_test.cc
Comment thread cpp/src/arrow/util/cpu_info.h
Comment thread cpp/src/arrow/util/dispatch_internal.h
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 21, 2026
@AntoinePrv
Copy link
Copy Markdown
Contributor Author

@pitrou I definitely agree with the duplication of the different files, it's pretty tedious.
I think it will too large this PR, but we should definitely think of something, including providing some CMake utilities in xsimd.

@pitrou pitrou added the CI: Extra: C++ Run extra C++ CI label Apr 22, 2026
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 22, 2026

Something isn't quite right on ARM64 Ubuntu and ARM64 macOS. -march=armv8-a+sve is added to the default compiler flags even though we have ARROW_SIMD_LEVEL=NEON.

Comment thread cpp/cmake_modules/SetupCxxFlags.cmake Outdated
Comment thread cpp/cmake_modules/SetupCxxFlags.cmake
return dispatch.func(in, out, opts);
#endif
auto constexpr kImplementations = UnpackDynamicFunction<Uint>::implementations();
if constexpr (kImplementations.size() == 1) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this condition actually useful? I guess it's a shortcut, but it's not obvious that it applies to common cases (x86 or ARM with default SIMD options).

At worse, this could be added generically to DynamicDispatch instead. But I doubt it's worth it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth it to avoid additional #ifdef, for instance on Macos there is only neon and no SVE (no need to dyn dispatch).
Previously we'd exclude the Neon version from the dynamic dispatch and go #ifdef ARROW_HAVE_NEON then go straight to Neon implementation.

At worse, this could be added generically to DynamicDispatch instead. But I doubt it's worth it.

Actually done in GH-49840 so either way here (we'd need to adapt the PR that is not merged first).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually done in GH-49840 so either way here

That PR might prove difficult to adapt for all the lousy compilers we have to support, so I'd rather focus on this one first :)

Comment thread cpp/src/arrow/util/bpacking_benchmark.cc Outdated
Comment thread cpp/cmake_modules/SetupCxxFlags.cmake Outdated
->ArgsProduct(kBitWidthsNumValues64);
#endif

#if defined(ARROW_HAVE_RUNTIME_SVE128)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's an easy way to reduce the duplication we're doing for each runtime SIMD level?

For example if we could write something like:

BENCHMARK_SIMD_UNPACK(Bool, bool, SVE128, Sve128, sve128);

and it would expand to:

BENCHMARK_CAPTURE(BM_UnpackBool, Sve128Unaligned, false, &bpacking::unpack_sve128<bool>,
                  !CpuInfo::GetInstance()->IsSupported(CpuInfo::SVE128),
                  "Sve128 not available")
    ->ArgsProduct(kBitWidthsNumValues<bool>);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean with a macro?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Comment thread cpp/src/arrow/util/bpacking_test.cc Outdated
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 23, 2026

@AntoinePrv Is it possible to run some ARM benchmarks and paste the results somewhere once you're satisfied with the PR?

@github-actions github-actions Bot removed the CI: Extra: C++ Run extra C++ CI label Apr 23, 2026
@AntoinePrv
Copy link
Copy Markdown
Contributor Author

Bad manipulation...

I pushed a commit with more aggressive inlining strategy. All SIMD code seems to be doing better wrt to scala, but the previous conclusions remain: sve128 is still better than Neon and sometimes Scalar is the best.
The fix was on the pro/epilogue so this should also benefit x86.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 27, 2026

@ursabot please benchmark lang=C++

@rok
Copy link
Copy Markdown
Member

rok commented Apr 27, 2026

Benchmark runs are scheduled for commit e11d5ee. Watch https://buildkite.com/apache-arrow and https://conbench.arrow-dev.org for updates. A comment will be posted here when the runs are complete.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 27, 2026

It seems that __forceinline on MSVC implies inline and so the compiler complains when you combine them:

D:\a\arrow\arrow\cpp\src\arrow/util/bpacking_dispatch_internal.h(63): error C2220: the following warning is treated as an error
D:\a\arrow\arrow\cpp\src\arrow/util/bpacking_dispatch_internal.h(63): warning C4141: 'inline': used more than once

I think the solution is to add inline to the ARROW_FORCE_INLINE definition on non-MSVC compilers.

@cyb70289
Copy link
Copy Markdown
Contributor

Just for reference

Did a quick poke with AI coding agent. It analyzed the reason why Neon code is not inlined and proposed a fix to xsimd: neon-bitcast-inline.patch

Unit test passed. Neon code is slightly faster than SVE128, matches expectation. I only tested one case.

# neon
BM_UnpackBool/NeonUnaligned/1/32       6.56 ns         6.56 ns    107048724 items_per_second=4.87937G/s

# sve128
BM_UnpackBool/Sve128Unaligned/1/32       7.06 ns         7.06 ns     99251620 items_per_second=4.53545G/s

I suspected xsimd bitcast Neon code may be too complicated for compiler to inline (maybe related to my old PR to fix an issue, but I forgot details).
Debug report from coding agent (I haven't read it carefully): findings.md

@conbench-apache-arrow
Copy link
Copy Markdown

Thanks for your patience. Conbench analyzed the 3 benchmarking runs that have been run so far on PR commit e11d5ee.

There were 27 benchmark results indicating a performance regression:

The full Conbench report has more details.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 27, 2026

Our ARM benchmarking platform, which has NEON but not SVE (I think it's a Graviton 2 machine), shows very nice speedups on some Parquet reading/decoding benchmarks. 🎉

@AntoinePrv
Copy link
Copy Markdown
Contributor Author

I suspected xsimd bitcast Neon code may be too complicated for compiler to inline (maybe related to my old PR to fix an issue, but I forgot details).

@cyb70289 for unrelated reasons (building WIN ARM64), I ended up flattening this implementation as well in xtensor-stack/xsimd#1317. We should be merging it and releasing soon after so we'll see how it performs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants