Skip to content

Sync with Microsoft ONNX Runtime - 16052026#1091

Open
ai-fw-intg wants to merge 12 commits into
ovep-developfrom
sync_msft_16052026
Open

Sync with Microsoft ONNX Runtime - 16052026#1091
ai-fw-intg wants to merge 12 commits into
ovep-developfrom
sync_msft_16052026

Conversation

@ai-fw-intg
Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

tianleiwu and others added 12 commits May 14, 2026 14:08
… folding output size limit (microsoft#28055)

### Description

Harden the constant folding optimizer and the `Expand` CPU kernel
against integer overflow attacks from crafted ONNX models.

**Problem:** The `Expand::Compute()` kernel performs cumulative
dimension multiplications (`input_count *= input_dim`, `output_count *=
output_dim`) using raw `int64_t` arithmetic. When triggered during
constant folding at `CreateSession()` time via a crafted model with
extreme shape values, signed integer overflow can produce corrupted
values used for buffer offset calculations and `memcpy` lengths,
creating a potential out-of-bounds write. The downstream SafeInt check
in the allocator catches overflow only when the total byte count wraps,
but carefully chosen dimensions can make the overflowed value appear
valid.

Additionally, the constant folding optimizer has no output size budget —
any deterministic node with constant inputs is eligible for constant
folding regardless of output size, enabling memory exhaustion attacks at
model load time.

### Key Changes

**1. SafeInt-protected arithmetic in `expand.cc`**

Wraps all dimension accumulation and offset/length calculations with
`SafeInt<int64_t>` or `SafeInt<size_t>` to catch overflow before it can
corrupt buffer arithmetic:

| Location | Before | After |
|---|---|---|
| Accumulator loop (L97-98) | `input_count *= input_dim` |
`SafeInt<int64_t>(input_count) * input_dim` |
| Accumulator loop (L109) | `last_dim_size *= expand_dim_size[...]` |
`SafeInt<int64_t>(last_dim_size) * ...` |
| copy_byte (L116) | `copy_len * sizeof(T)` | `SafeInt<size_t>(copy_len)
* sizeof(T)` |
| input_offset (L122) | `i * copy_len` | `SafeInt<int64_t>(i) *
copy_len` |
| output_offset (L126) | `output_offset += current_count * ...` |
`SafeInt<int64_t>(output_offset) + SafeInt<int64_t>(current_count) *
...` |

**2. Constant folding output size limit in `constant_folding.cc`**

- **Pre-execution check**: `EstimateNodeOutputSizeInBytes()` uses shape
inference results with SafeInt-protected arithmetic to estimate total
output bytes. Nodes exceeding the limit are skipped.
- **Post-execution check**: After `kernel->Compute()`, actual output
`SizeInBytes()` is verified against the limit (catches cases where shape
inference couldn't determine output size).
- **Exception isolation**: `kernel->Compute()` is wrapped in `try/catch`
so that SafeInt overflow exceptions from individual nodes skip the node
rather than aborting the entire optimization pass.
- **Configurable limit**: New session option
`optimization.constant_folding_max_output_size_in_bytes` (default: 1 GB,
`"0"` to disable).

**3. Session option**

New key `kOrtSessionOptionsConstantFoldingMaxOutputSizeInBytes` in
`onnxruntime_session_options_config_keys.h`.

### Motivation and Context

This addresses a security vulnerability where a malicious ONNX model can
cause signed integer overflow in the Expand kernel during constant
folding at model load time (`CreateSession()`), potentially leading to
out-of-bounds memory writes. The constant folding size limit provides
defense-in-depth against memory exhaustion attacks from untrusted
models.

### Testing

- `ConstantFoldingOutputSizeLimit` — Verifies 4 MB Expand is blocked at
1 MB limit, allowed at 8 MB limit.
- `ConstantFoldingDefaultLimitBlocksLargeExpand` — Verifies 1 GB
ConstantOfShape is blocked at 512 MB limit.
- `ConstantFoldingSmallOutputAllowed` — Verifies small Expand (64 bytes)
is still folded normally.
- `ConstantFoldingExpandOverflowDimsSkipped` — Verifies Expand with
`[2^32, 2^32]` dimensions (int64 overflow) is gracefully skipped during
constant folding.
…icrosoft#28481)

This pull request improves the handling of pre-allocated output buffers
in ONNX Runtime, especially for models with dynamic output shapes. The
changes ensure that when a user provides an output buffer whose shape
does not match the computed output shape, the library returns a clear
error message. Additionally, the error handling and testing around this
scenario are strengthened.

The most important changes are:

**Pre-allocated Output Buffer Shape Validation:**
* Enhanced the logic in `IExecutionFrame::GetOrCreateNodeOutputMLValue`
to check if the shape of a pre-allocated output OrtValue matches the
computed output shape. If there is a mismatch (typically due to dynamic
shapes), the code now returns an explicit `INVALID_ARGUMENT` error with
a detailed message, guiding the user to fix their usage.

**API and Error Handling Improvements:**
* Updated `OpKernelContext::OutputMLValue` to throw an exception with
the detailed error message if output OrtValue allocation fails, ensuring
that shape mismatches are surfaced clearly to the caller.
* Added a catch block for `OnnxRuntimeException` in
`sequential_executor.cc` to convert exceptions into proper `Status`
objects, improving robustness and error propagation.

**Testing and Regression Coverage:**
* Added a comprehensive regression test
(`ExecutionFrameTestInit.FetchWithMismatchedDynamicShapes`) to verify
correct handling of pre-allocated outputs with mismatched shapes,
including both error and success cases. This test covers scenarios where
output buffers are reused across runs with different dynamic shapes,
ensuring the new logic works as intended.
…icrosoft#26834)

* Modification to the CPU EP to specify channels_last when data format
is NWHC
* Added a FusedNhwcConv kernel
* Implementation of the kernel in mlas
* Added compiler guards so it is inly used with KleidiAi (for now, can
be removed if needed)
* Added unittests

### Description
Currently OnnxRT supports NCHW as a default datalayout. For
optimisations and kernels that operate better in NHWC layout, or where
the datalayout is NHWC in the first place Transposes are added around
the layers. This patch seeks to eliminate them in cases of convolutions
where it would cause a performance decrease.


### Motivation and Context
KleidiAi specific implementation of this feature. Only supports
convolutions, DepthWise to follow. Currently a little strict with the
filters as a result.

---------

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
apply microsoft#27780 for nvidia.

I see 10-15% performance improvement for prefill on rtx5060ti
…ntion; harden integer arithmetic in QAttention and AttentionBase (microsoft#28480)

### Description

The CPU `QAttention` kernel did not validate the shape of per-column
`weight_scale` and `weight_zero_point` inputs against the expected `3 *
hidden_size`. A model could supply a per-column
tensor smaller than expected, causing the GEMM dequantization loop to
read past the end of the buffer (offsets up to `~3 * hidden_size -
head_size`).

This PR adds the missing shape validation and, while in the area,
hardens integer arithmetic across `QAttention` and `AttentionBase`
against malformed shape attributes / dimensions.

### Changes

**`onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc`**
- Validate per-column `weight_scale` and `weight_zero_point` are 1-D
with size `3 * hidden_size`; reject otherwise.
- Use `narrow<int>` / `narrow<size_t>` when converting `int64_t` shape
dims, so out-of-range values throw rather than silently truncating.
- Use `SafeInt` for multiplications whose operands are not provably
bounded by upstream validation (`loop_len`, `input_offset`,
`qkv_offset`, the gemm allocation, and
`packed_weights_data_size` in `PrePack`).
- Refactor the gemm allocation and Q/K/V pointer arithmetic to share a
single `SafeInt`-validated `batch_size * sequence_length * hidden_size`
value.
- Drop a few redundant `static_cast<int>`s in the per-iteration index
math.
- Remove the `hidden_size_x3 % 3 == 0` and `hidden_size % num_heads_ ==
0` checks here; they are now enforced uniformly in
`AttentionBase::CheckInputs` with clearer error messages.

**`onnxruntime/contrib_ops/cpu/bert/attention_base.h`**
- Replace `static_cast<int>` with `narrow<int>` for `num_heads_`,
`rotary_embedding_`, the `parameters` struct outputs, and `GetPresent`'s
`past_sequence_length`. Without this, any
`int64_t` value outside the `int` range (e.g., a `num_heads` attribute
of `2^31`, or a `past` sequence length of `2^31`) silently truncates to
an unrelated `int` value that is then
propagated to downstream kernels and used in arithmetic, enabling
division by zero, sign flips, or out-of-bounds indexing.
- Drop the `static_cast<int>` from the `past_dims[2]` / `past_dims[4]`
shape comparisons so the equality check uses the full `int64_t` value;
previously a `past` tensor whose dim's low 32
bits happened to match `num_heads_` (or `k_hidden_size / num_heads_`)
would pass validation despite having the wrong physical shape.
- In `CheckInputs`, when `require_same_hidden_size_` is true, reject
`bias_dims[0]` not a multiple of 3 with a clear error (Q, K, V are
packed and share a hidden size).
- In `CheckInputs`, when `qkv_hidden_sizes` is not set, also reject
`q_hidden_size % num_heads_ != 0` (mirrors the existing check on the
`qkv_hidden_sizes` path).

**`onnxruntime/test/contrib_ops/quantize_attention_op_test.cc`**
- 4 regression tests for the per-column shape validation:
  - `InvalidWeightScalePerColumnShape`
  - `InvalidWeightScalePerColumnRank`
  - `InvalidWeightZeroPointPerColumnShape`
  - `InvalidWeightZeroPointPerColumnRank`
- 3 regression tests for the divisibility / narrowing checks (sharing a
`RunQAttentionExpectFailure` helper):
  - `InvalidBiasDimNotMultipleOfThree`
  - `InvalidHiddenSizeNotDivisibleByNumHeads`
- `InvalidNumHeadsOverflowsInt` (`num_heads = INT_MAX + 1` triggers
`gsl::narrowing_error`)

### Testing

All `QAttention*` / `AttentionTest*` / `MultiHeadAttention*` tests
(97/97) pass locally on CPU Release build.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…pset 22 (microsoft#27737)

### Description

Extends LSTM CUDA kernel registration from opset 14 to opset 22.

- **`lstm.cc`**: Cap existing opset 14 kernel to versioned 14–21, add
new non-versioned kernel at opset 22
- **`cuda_execution_provider.cc`**: Update forward declarations and
`BuildKernelCreateInfo` entries accordingly (versioned 14–21 +
non-versioned 22) for all three types (`float`, `double`, `MLFloat16`)
- **`deep_cpu_lstm_op_test.cc`**: Add
`ONNXRuntime_TestLSTMForward_OpSet22_CUDA` test targeting the new
registration
- **`docs/OperatorKernels.md`**: Update CUDA LSTM entry from `14+` to
`[14, 21]` and `22+`

No spec-level behavior changes between opsets 14 and 22 for LSTM — this
is purely a registration gap fill so the CUDA EP correctly claims nodes
exported at newer opset versions.

### Motivation and Context

LSTM CUDA kernel was registered only up to opset 14 while the ONNX spec
defines LSTM through opset 22. Models exported at opset ≥15 would fall
back to CPU. Follows the same pattern established by other opset gap PRs
(ConvTranspose, MaxPool, Pad, etc.) referenced in microsoft#27729.

<!-- START COPILOT CODING AGENT TIPS -->
---

📍 Connect Copilot coding agent with [Jira](https://gh.io/cca-jira-docs),
[Azure Boards](https://gh.io/cca-azure-boards-docs) or
[Linear](https://gh.io/cca-linear-docs) to delegate work to Copilot in
one click without leaving your project management tool.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
While building onnxruntime from source for AIX I ran into macro
pre-defined errors for the POWER10 and POWER11 machines. This patch
resolves the issue.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…icrosoft#28509)

### Description

Adds a Windows-only fallback to `WeaklyCanonicalPath` (in
`onnxruntime/core/framework/tensorprotoutils.cc`) for use inside Windows
AppContainers, where `std::filesystem::weakly_canonical` always fails
with `ERROR_ACCESS_DENIED` because the underlying
`GetFinalPathNameByHandleW(VOLUME_NAME_DOS)` call goes through the
Volume Mount Manager, which AppContainer tokens cannot query regardless
of file ACL grants.

On `ERROR_ACCESS_DENIED`, fall back to a manual canonicalization that
uses `GetFinalPathNameByHandleW(FILE_NAME_NORMALIZED | VOLUME_NAME_NT)`
and prefixes the result with `\\?\GLOBALROOT` so it remains a valid
Win32 path. All other error paths, non-Windows builds, and
non-AppContainer Windows runs are unchanged.

`VOLUME_NAME_NT` (not `VOLUME_NAME_NONE`) is required: it preserves
volume identity, so the cross-volume escape rejection in
`ValidateExternalDataPath` introduced by microsoft#26776 continues to hold.

8 new unit tests cover the fallback helper directly (existing dir/file,
non-existent leaf, multi-component miss, all-non-existent → false,
equivalence with `weakly_canonical`, symlink resolution, `..` collapse).
The AppContainer trigger itself cannot be reproduced in a unit test
environment.

### Motivation and Context

Fixes microsoft#28508.

Regression introduced in v1.24.1 by microsoft#26776 (`ValidateExternalDataPath`);
current `WeaklyCanonicalPath` wrapper from microsoft#27539 in v1.25.0. Loading
any ONNX model with external data fails inside a Windows AppContainer
with:

```
Failed to get the weakly canonical path: "<path>" - Access is denied.
```

Affected callers have no in-process workaround. Downstream report:
microsoft/Foundry-Local#709.

CC @yuslepukhin (microsoft#26776), @adrianlizarraga (microsoft#27539), @tianleiwu
(microsoft#27374).

---------

Co-authored-by: Brenden Sosnader <brsosnad@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n graph transformer (microsoft#28465)

### Description

Validate that `frame_step` and `dft_size` (derived from `frame_length`)
are positive before they are used in buffer sizing and loop arithmetic.
Use SafeInt for the weight buffer allocation to guard against overflow.
Also fix an unconditional dereference of `window_recipient` which is
nullptr when the STFT node has no window input.

### Motivation and Context

A model with non-positive initializer values for frame_length or
frame_step causes signed-to-unsigned wrapping in size computations,
leading to out-of-bounds writes during graph optimization. The nullptr
dereference is a crash on any windowless STFT node.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description
- ETW log level takes precedence over the QNN log level
- This is now changed to only using ETW log level when it provides
higher fidelity



### Motivation and Context
- ETW log level takes precedence over the QNN log level
- If ETW log level = Baisc ; Qnn log level = detailed, QNN EP still
picks basic logging over detailed.
…soft#28507)

### Description

The outer `#ifndef __clang__` in `mlasi_sve.h` (line 20 to line 679) was
intended to wrap the GCC-specific `#pragma GCC` directives, but it also
ends up hiding every SVE kernel declaration and the typedefs from clang.
The `#ifdef __clang__` block that defines `MLAS_SVE_TARGET` for clang's
per-function `__attribute__((target("...")))` syntax is unreachable for
the same reason.

This moves the closing `#endif` up to right after the GCC pragmas so
only the pragmas are GCC-only, and the rest of the header (typedefs,
kernel declarations, MLAS_SVE_TARGET) is visible to both compilers.

### Motivation and Context

Without this, building MLAS with clang for aarch64 fails at platform.cpp
- the `MLAS_USE_SVE` runtime-dispatch block references
`MlasSveErfKernel`, `MlasSveLogisticKernel`, and friends, all undeclared
from clang's perspective. Confirmed working with clang 20.1.8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.