Sync with Microsoft ONNX Runtime - 16052026 by ai-fw-intg · Pull Request #1091 · intel/onnxruntime

ai-fw-intg · 2026-05-15T20:33:26Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

… folding output size limit (microsoft#28055) ### Description Harden the constant folding optimizer and the `Expand` CPU kernel against integer overflow attacks from crafted ONNX models. **Problem:** The `Expand::Compute()` kernel performs cumulative dimension multiplications (`input_count *= input_dim`, `output_count *= output_dim`) using raw `int64_t` arithmetic. When triggered during constant folding at `CreateSession()` time via a crafted model with extreme shape values, signed integer overflow can produce corrupted values used for buffer offset calculations and `memcpy` lengths, creating a potential out-of-bounds write. The downstream SafeInt check in the allocator catches overflow only when the total byte count wraps, but carefully chosen dimensions can make the overflowed value appear valid. Additionally, the constant folding optimizer has no output size budget — any deterministic node with constant inputs is eligible for constant folding regardless of output size, enabling memory exhaustion attacks at model load time. ### Key Changes **1. SafeInt-protected arithmetic in `expand.cc`** Wraps all dimension accumulation and offset/length calculations with `SafeInt<int64_t>` or `SafeInt<size_t>` to catch overflow before it can corrupt buffer arithmetic: | Location | Before | After | |---|---|---| | Accumulator loop (L97-98) | `input_count *= input_dim` | `SafeInt<int64_t>(input_count) * input_dim` | | Accumulator loop (L109) | `last_dim_size *= expand_dim_size[...]` | `SafeInt<int64_t>(last_dim_size) * ...` | | copy_byte (L116) | `copy_len * sizeof(T)` | `SafeInt<size_t>(copy_len) * sizeof(T)` | | input_offset (L122) | `i * copy_len` | `SafeInt<int64_t>(i) * copy_len` | | output_offset (L126) | `output_offset += current_count * ...` | `SafeInt<int64_t>(output_offset) + SafeInt<int64_t>(current_count) * ...` | **2. Constant folding output size limit in `constant_folding.cc`** - **Pre-execution check**: `EstimateNodeOutputSizeInBytes()` uses shape inference results with SafeInt-protected arithmetic to estimate total output bytes. Nodes exceeding the limit are skipped. - **Post-execution check**: After `kernel->Compute()`, actual output `SizeInBytes()` is verified against the limit (catches cases where shape inference couldn't determine output size). - **Exception isolation**: `kernel->Compute()` is wrapped in `try/catch` so that SafeInt overflow exceptions from individual nodes skip the node rather than aborting the entire optimization pass. - **Configurable limit**: New session option `optimization.constant_folding_max_output_size_in_bytes` (default: 1 GB, `"0"` to disable). **3. Session option** New key `kOrtSessionOptionsConstantFoldingMaxOutputSizeInBytes` in `onnxruntime_session_options_config_keys.h`. ### Motivation and Context This addresses a security vulnerability where a malicious ONNX model can cause signed integer overflow in the Expand kernel during constant folding at model load time (`CreateSession()`), potentially leading to out-of-bounds memory writes. The constant folding size limit provides defense-in-depth against memory exhaustion attacks from untrusted models. ### Testing - `ConstantFoldingOutputSizeLimit` — Verifies 4 MB Expand is blocked at 1 MB limit, allowed at 8 MB limit. - `ConstantFoldingDefaultLimitBlocksLargeExpand` — Verifies 1 GB ConstantOfShape is blocked at 512 MB limit. - `ConstantFoldingSmallOutputAllowed` — Verifies small Expand (64 bytes) is still folded normally. - `ConstantFoldingExpandOverflowDimsSkipped` — Verifies Expand with `[2^32, 2^32]` dimensions (int64 overflow) is gracefully skipped during constant folding.

…icrosoft#28481) This pull request improves the handling of pre-allocated output buffers in ONNX Runtime, especially for models with dynamic output shapes. The changes ensure that when a user provides an output buffer whose shape does not match the computed output shape, the library returns a clear error message. Additionally, the error handling and testing around this scenario are strengthened. The most important changes are: **Pre-allocated Output Buffer Shape Validation:** * Enhanced the logic in `IExecutionFrame::GetOrCreateNodeOutputMLValue` to check if the shape of a pre-allocated output OrtValue matches the computed output shape. If there is a mismatch (typically due to dynamic shapes), the code now returns an explicit `INVALID_ARGUMENT` error with a detailed message, guiding the user to fix their usage. **API and Error Handling Improvements:** * Updated `OpKernelContext::OutputMLValue` to throw an exception with the detailed error message if output OrtValue allocation fails, ensuring that shape mismatches are surfaced clearly to the caller. * Added a catch block for `OnnxRuntimeException` in `sequential_executor.cc` to convert exceptions into proper `Status` objects, improving robustness and error propagation. **Testing and Regression Coverage:** * Added a comprehensive regression test (`ExecutionFrameTestInit.FetchWithMismatchedDynamicShapes`) to verify correct handling of pre-allocated outputs with mismatched shapes, including both error and success cases. This test covers scenarios where output buffers are reused across runs with different dynamic shapes, ensuring the new logic works as intended.

…icrosoft#26834) * Modification to the CPU EP to specify channels_last when data format is NWHC * Added a FusedNhwcConv kernel * Implementation of the kernel in mlas * Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed) * Added unittests ### Description Currently OnnxRT supports NCHW as a default datalayout. For optimisations and kernels that operate better in NHWC layout, or where the datalayout is NHWC in the first place Transposes are added around the layers. This patch seeks to eliminate them in cases of convolutions where it would cause a performance decrease. ### Motivation and Context KleidiAi specific implementation of this feature. Only supports convolutions, DepthWise to follow. Currently a little strict with the filters as a result. --------- Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

apply microsoft#27780 for nvidia. I see 10-15% performance improvement for prefill on rtx5060ti

…ntion; harden integer arithmetic in QAttention and AttentionBase (microsoft#28480) ### Description The CPU `QAttention` kernel did not validate the shape of per-column `weight_scale` and `weight_zero_point` inputs against the expected `3 * hidden_size`. A model could supply a per-column tensor smaller than expected, causing the GEMM dequantization loop to read past the end of the buffer (offsets up to `~3 * hidden_size - head_size`). This PR adds the missing shape validation and, while in the area, hardens integer arithmetic across `QAttention` and `AttentionBase` against malformed shape attributes / dimensions. ### Changes **`onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc`** - Validate per-column `weight_scale` and `weight_zero_point` are 1-D with size `3 * hidden_size`; reject otherwise. - Use `narrow<int>` / `narrow<size_t>` when converting `int64_t` shape dims, so out-of-range values throw rather than silently truncating. - Use `SafeInt` for multiplications whose operands are not provably bounded by upstream validation (`loop_len`, `input_offset`, `qkv_offset`, the gemm allocation, and `packed_weights_data_size` in `PrePack`). - Refactor the gemm allocation and Q/K/V pointer arithmetic to share a single `SafeInt`-validated `batch_size * sequence_length * hidden_size` value. - Drop a few redundant `static_cast<int>`s in the per-iteration index math. - Remove the `hidden_size_x3 % 3 == 0` and `hidden_size % num_heads_ == 0` checks here; they are now enforced uniformly in `AttentionBase::CheckInputs` with clearer error messages. **`onnxruntime/contrib_ops/cpu/bert/attention_base.h`** - Replace `static_cast<int>` with `narrow<int>` for `num_heads_`, `rotary_embedding_`, the `parameters` struct outputs, and `GetPresent`'s `past_sequence_length`. Without this, any `int64_t` value outside the `int` range (e.g., a `num_heads` attribute of `2^31`, or a `past` sequence length of `2^31`) silently truncates to an unrelated `int` value that is then propagated to downstream kernels and used in arithmetic, enabling division by zero, sign flips, or out-of-bounds indexing. - Drop the `static_cast<int>` from the `past_dims[2]` / `past_dims[4]` shape comparisons so the equality check uses the full `int64_t` value; previously a `past` tensor whose dim's low 32 bits happened to match `num_heads_` (or `k_hidden_size / num_heads_`) would pass validation despite having the wrong physical shape. - In `CheckInputs`, when `require_same_hidden_size_` is true, reject `bias_dims[0]` not a multiple of 3 with a clear error (Q, K, V are packed and share a hidden size). - In `CheckInputs`, when `qkv_hidden_sizes` is not set, also reject `q_hidden_size % num_heads_ != 0` (mirrors the existing check on the `qkv_hidden_sizes` path). **`onnxruntime/test/contrib_ops/quantize_attention_op_test.cc`** - 4 regression tests for the per-column shape validation: - `InvalidWeightScalePerColumnShape` - `InvalidWeightScalePerColumnRank` - `InvalidWeightZeroPointPerColumnShape` - `InvalidWeightZeroPointPerColumnRank` - 3 regression tests for the divisibility / narrowing checks (sharing a `RunQAttentionExpectFailure` helper): - `InvalidBiasDimNotMultipleOfThree` - `InvalidHiddenSizeNotDivisibleByNumHeads` - `InvalidNumHeadsOverflowsInt` (`num_heads = INT_MAX + 1` triggers `gsl::narrowing_error`) ### Testing All `QAttention*` / `AttentionTest*` / `MultiHeadAttention*` tests (97/97) pass locally on CPU Release build. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…pset 22 (microsoft#27737) ### Description Extends LSTM CUDA kernel registration from opset 14 to opset 22. - **`lstm.cc`**: Cap existing opset 14 kernel to versioned 14–21, add new non-versioned kernel at opset 22 - **`cuda_execution_provider.cc`**: Update forward declarations and `BuildKernelCreateInfo` entries accordingly (versioned 14–21 + non-versioned 22) for all three types (`float`, `double`, `MLFloat16`) - **`deep_cpu_lstm_op_test.cc`**: Add `ONNXRuntime_TestLSTMForward_OpSet22_CUDA` test targeting the new registration - **`docs/OperatorKernels.md`**: Update CUDA LSTM entry from `14+` to `[14, 21]` and `22+` No spec-level behavior changes between opsets 14 and 22 for LSTM — this is purely a registration gap fill so the CUDA EP correctly claims nodes exported at newer opset versions. ### Motivation and Context LSTM CUDA kernel was registered only up to opset 14 while the ONNX spec defines LSTM through opset 22. Models exported at opset ≥15 would fall back to CPU. Follows the same pattern established by other opset gap PRs (ConvTranspose, MaxPool, Pad, etc.) referenced in microsoft#27729.  --- 📍 Connect Copilot coding agent with [Jira](https://gh.io/cca-jira-docs), [Azure Boards](https://gh.io/cca-azure-boards-docs) or [Linear](https://gh.io/cca-linear-docs) to delegate work to Copilot in one click without leaving your project management tool. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

While building onnxruntime from source for AIX I ran into macro pre-defined errors for the POWER10 and POWER11 machines. This patch resolves the issue. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

@yuslepukhin

…icrosoft#28509) ### Description Adds a Windows-only fallback to `WeaklyCanonicalPath` (in `onnxruntime/core/framework/tensorprotoutils.cc`) for use inside Windows AppContainers, where `std::filesystem::weakly_canonical` always fails with `ERROR_ACCESS_DENIED` because the underlying `GetFinalPathNameByHandleW(VOLUME_NAME_DOS)` call goes through the Volume Mount Manager, which AppContainer tokens cannot query regardless of file ACL grants. On `ERROR_ACCESS_DENIED`, fall back to a manual canonicalization that uses `GetFinalPathNameByHandleW(FILE_NAME_NORMALIZED | VOLUME_NAME_NT)` and prefixes the result with `\\?\GLOBALROOT` so it remains a valid Win32 path. All other error paths, non-Windows builds, and non-AppContainer Windows runs are unchanged. `VOLUME_NAME_NT` (not `VOLUME_NAME_NONE`) is required: it preserves volume identity, so the cross-volume escape rejection in `ValidateExternalDataPath` introduced by microsoft#26776 continues to hold. 8 new unit tests cover the fallback helper directly (existing dir/file, non-existent leaf, multi-component miss, all-non-existent → false, equivalence with `weakly_canonical`, symlink resolution, `..` collapse). The AppContainer trigger itself cannot be reproduced in a unit test environment. ### Motivation and Context Fixes microsoft#28508. Regression introduced in v1.24.1 by microsoft#26776 (`ValidateExternalDataPath`); current `WeaklyCanonicalPath` wrapper from microsoft#27539 in v1.25.0. Loading any ONNX model with external data fails inside a Windows AppContainer with: ``` Failed to get the weakly canonical path: "<path>" - Access is denied. ``` Affected callers have no in-process workaround. Downstream report: microsoft/Foundry-Local#709. CC @yuslepukhin (microsoft#26776), @adrianlizarraga (microsoft#27539), @tianleiwu (microsoft#27374). --------- Co-authored-by: Brenden Sosnader <brsosnad@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…n graph transformer (microsoft#28465) ### Description Validate that `frame_step` and `dft_size` (derived from `frame_length`) are positive before they are used in buffer sizing and loop arithmetic. Use SafeInt for the weight buffer allocation to guard against overflow. Also fix an unconditional dereference of `window_recipient` which is nullptr when the STFT node has no window input. ### Motivation and Context A model with non-positive initializer values for frame_length or frame_step causes signed-to-unsigned wrapping in size computations, leading to out-of-bounds writes during graph optimization. The nullptr dereference is a crash on any windowless STFT node. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description - ETW log level takes precedence over the QNN log level - This is now changed to only using ETW log level when it provides higher fidelity ### Motivation and Context - ETW log level takes precedence over the QNN log level - If ETW log level = Baisc ; Qnn log level = detailed, QNN EP still picks basic logging over detailed.

…soft#28507) ### Description The outer `#ifndef __clang__` in `mlasi_sve.h` (line 20 to line 679) was intended to wrap the GCC-specific `#pragma GCC` directives, but it also ends up hiding every SVE kernel declaration and the typedefs from clang. The `#ifdef __clang__` block that defines `MLAS_SVE_TARGET` for clang's per-function `__attribute__((target("...")))` syntax is unreachable for the same reason. This moves the closing `#endif` up to right after the GCC pragmas so only the pragmas are GCC-only, and the rest of the header (typedefs, kernel declarations, MLAS_SVE_TARGET) is visible to both compilers. ### Motivation and Context Without this, building MLAS with clang for aarch64 fails at platform.cpp - the `MLAS_USE_SVE` runtime-dispatch block references `MlasSveErfKernel`, `MlasSveLogisticKernel`, and friends, all undeclared from clang's perspective. Confirmed working with clang 20.1.8.

tianleiwu and others added 12 commits May 14, 2026 14:08

enable dynamic max_k_step in FA for nvidia (microsoft#28511)

b36d8db

apply microsoft#27780 for nvidia. I see 10-15% performance improvement for prefill on rtx5060ti

Merge remote-tracking branch 'origin/master' into sync_msft_16052026

03eebc9

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel May 15, 2026 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 16052026#1091

Sync with Microsoft ONNX Runtime - 16052026#1091
ai-fw-intg wants to merge 12 commits into
ovep-developfrom
sync_msft_16052026

ai-fw-intg commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

ai-fw-intg commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants