Add SIMD optimization for int_to_float conversion by hjanuschka · Pull Request #580 · libjxl/jxl-rs

hjanuschka · 2025-12-22T09:02:40Z

SIMD fast paths for the int_to_float function which converts custom bit-depth floats stored as i32 back to f32.

32-bit float: straightforward bitcast via SIMD.

16-bit float (f16): SIMD handles normal values, zeros, and inf/nan. Subnormals fall back to scalar since they need a variable-iteration normalization loop.

Waiting for perf CI to see the impact.

github-actions · 2025-12-22T09:23:18Z

Benchmark @ `c1f5321`

MULTI-FILE BENCHMARK RESULTS (4 files)
  CPU architecture: x86_64
  WARNING: System appears noisy: high system load (2.04). Results may be unreliable.
Statistics:
  Confidence:               99.0%
  Max relative error:        3.0%

Comparing: 47e5c029 (Base) vs 37f8a2a9 (PR)

File	Base (MP/s)	PR (MP/s)	Δ%
bike.jxl	23.506	23.416	-0.38% ±2.1%
green_queen_modular_e3.jxl	7.724	7.830	+1.36% ±1.4%
green_queen_vardct_e3.jxl	21.171	21.140	-0.14% ±1.3%
sunset_logo.jxl	2.240	2.234	-0.27% ±0.4%

Add SIMD fast paths for converting custom bit-depth floats to f32: - 32-bit float passthrough: Simple bitcast using SIMD - 16-bit float (f16/half-precision): SIMD conversion with scalar fallback for subnormal values The 16-bit float SIMD path handles normal, zero, and inf/nan cases directly, falling back to scalar for the rare subnormal case which requires variable-iteration normalization. Also adds BitDepth::f16() test helper and comprehensive unit tests for the conversion functions.

Address veluca93 review: add load_f16_bits() and store_f16() methods to F32SimdVec trait instead of implementing conversion in convert.rs. - AVX2+F16C: Hardware _mm256_cvtph_ps/_mm256_cvtps_ph - AVX-512: Hardware _mm512_cvtph_ps/_mm512_cvtps_ph - SSE4.2/NEON/Scalar: Scalar fallback Simplifies convert.rs by ~100 lines.

- AVX: Always require f16c for AVX2 path (removes runtime check) - AVX512: Restructure inner functions to not be unsafe, only wrap memory operations in unsafe blocks with SAFETY comments - NEON: Use inline ASM for f16 conversion (fcvtl/fcvtn) since stdarch incorrectly requires fp16 feature for basic conversion - Add f16 type module to jxl_simd and use it instead of u16/standalone functions throughout the crate

…mainder - Add I32Vec::store_u16() method to extract lower 16 bits from each i32 lane and store as u16 values, implemented for all SIMD backends - Remove scalar remainder handling in int_to_float functions since render pipeline buffers are always padded to SIMD width - Use div_ceil pattern consistent with other SIMD functions in convert.rs

The method takes &mut [u16] (raw bits), so the name should match load_f16_bits for consistency.

The SIMD conversion functions were using chunks_exact() which only processes complete SIMD vectors, leaving remainder elements unprocessed. This caused test failures when the row size wasn't divisible by the SIMD width (e.g., 244 pixels with AVX2 width of 8). Fix by adding scalar fallback loops to handle remainder elements for both 32-bit float passthrough and 16-bit float conversion paths. Also use const assert to verify the buffer size assumption at compile time rather than runtime.

veluca93 reviewed Dec 22, 2025

View reviewed changes

Comment thread jxl/src/render/stages/convert.rs

veluca93 reviewed Jan 17, 2026

View reviewed changes

hjanuschka force-pushed the simd-int-to-float branch from 7d1f674 to 4c8acef Compare January 20, 2026 21:18