Skip to content

Add SIMD optimization for int_to_float conversion#580

Merged
veluca93 merged 14 commits intolibjxl:mainfrom
hjanuschka:simd-int-to-float
Jan 21, 2026
Merged

Add SIMD optimization for int_to_float conversion#580
veluca93 merged 14 commits intolibjxl:mainfrom
hjanuschka:simd-int-to-float

Conversation

@hjanuschka
Copy link
Copy Markdown
Collaborator

@hjanuschka hjanuschka commented Dec 22, 2025

SIMD fast paths for the int_to_float function which converts custom bit-depth floats stored as i32 back to f32.

32-bit float: straightforward bitcast via SIMD.

16-bit float (f16): SIMD handles normal values, zeros, and inf/nan. Subnormals fall back to scalar since they need a variable-iteration normalization loop.

Waiting for perf CI to see the impact.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 22, 2025

Benchmark @ c1f5321

MULTI-FILE BENCHMARK RESULTS (4 files)
  CPU architecture: x86_64
  WARNING: System appears noisy: high system load (2.04). Results may be unreliable.
Statistics:
  Confidence:               99.0%
  Max relative error:        3.0%

Comparing: 47e5c029 (Base) vs 37f8a2a9 (PR)

File Base (MP/s) PR (MP/s) Δ%
bike.jxl 23.506 23.416 -0.38% ±2.1%
green_queen_modular_e3.jxl 7.724 7.830 +1.36% ±1.4%
green_queen_vardct_e3.jxl 21.171 21.140 -0.14% ±1.3%
sunset_logo.jxl 2.240 2.234 -0.27% ±0.4%

Comment thread jxl/src/render/stages/convert.rs
Comment thread jxl_simd/src/x86_64/avx.rs Outdated
Comment thread jxl_simd/src/aarch64/neon.rs Outdated
Comment thread jxl_simd/src/x86_64/avx512.rs Outdated
Comment thread jxl_simd/src/x86_64/avx512.rs Outdated
Comment thread jxl_simd/src/x86_64/avx512.rs Outdated
Comment thread jxl_simd/src/x86_64/avx512.rs Outdated
Comment thread jxl_simd/src/scalar.rs Outdated
Comment thread jxl/src/render/stages/convert.rs Outdated
Comment thread jxl/src/render/stages/convert.rs Outdated
Comment thread jxl/src/render/stages/convert.rs
Comment thread jxl_simd/src/x86_64/avx.rs Outdated
Comment thread jxl_simd/src/x86_64/avx.rs Outdated
Comment thread jxl_simd/src/lib.rs Outdated
Comment thread jxl/src/render/stages/convert.rs
Comment thread jxl/src/render/stages/convert.rs Outdated
hjanuschka and others added 12 commits January 21, 2026 13:25
Add SIMD fast paths for converting custom bit-depth floats to f32:
- 32-bit float passthrough: Simple bitcast using SIMD
- 16-bit float (f16/half-precision): SIMD conversion with scalar fallback
  for subnormal values

The 16-bit float SIMD path handles normal, zero, and inf/nan cases directly,
falling back to scalar for the rare subnormal case which requires
variable-iteration normalization.

Also adds BitDepth::f16() test helper and comprehensive unit tests for
the conversion functions.
Address veluca93 review: add load_f16_bits() and store_f16() methods
to F32SimdVec trait instead of implementing conversion in convert.rs.

- AVX2+F16C: Hardware _mm256_cvtph_ps/_mm256_cvtps_ph
- AVX-512: Hardware _mm512_cvtph_ps/_mm512_cvtps_ph
- SSE4.2/NEON/Scalar: Scalar fallback

Simplifies convert.rs by ~100 lines.
- AVX: Always require f16c for AVX2 path (removes runtime check)
- AVX512: Restructure inner functions to not be unsafe, only wrap
  memory operations in unsafe blocks with SAFETY comments
- NEON: Use inline ASM for f16 conversion (fcvtl/fcvtn) since
  stdarch incorrectly requires fp16 feature for basic conversion
- Add f16 type module to jxl_simd and use it instead of u16/standalone
  functions throughout the crate
…mainder

- Add I32Vec::store_u16() method to extract lower 16 bits from each i32 lane
  and store as u16 values, implemented for all SIMD backends
- Remove scalar remainder handling in int_to_float functions since
  render pipeline buffers are always padded to SIMD width
- Use div_ceil pattern consistent with other SIMD functions in convert.rs
The method takes &mut [u16] (raw bits), so the name should match
load_f16_bits for consistency.
The SIMD conversion functions were using chunks_exact() which only
processes complete SIMD vectors, leaving remainder elements unprocessed.
This caused test failures when the row size wasn't divisible by the
SIMD width (e.g., 244 pixels with AVX2 width of 8).

Fix by adding scalar fallback loops to handle remainder elements for
both 32-bit float passthrough and 16-bit float conversion paths.

Also use const assert to verify the buffer size assumption at compile
time rather than runtime.
@veluca93 veluca93 enabled auto-merge (squash) January 21, 2026 17:45
@veluca93 veluca93 merged commit 3619cd9 into libjxl:main Jan 21, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants