Sync with Microsoft ONNX Runtime - 07052026 by ai-fw-intg · Pull Request #1075 · intel/onnxruntime

ai-fw-intg · 2026-05-06T20:32:32Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

### Description Bump version to 1.27.0.

### Description  Bump plugin-ep-webgpu/VERSION_NUMBER to 0.2.0. ### Motivation and Context  Version bump after creating the release branch.

…t#28294) ### Description Avoid having to depend on setup job/task for build date/time. Use pipeline var & runtime expression instead. ### Motivation and Context Faster, no need to wait for an ad-hoc job to set pipeline variables. Easier to read/reason about, reduces cross-stage deps.

### Description  Add release info doc for WebGPU plugin EP. ### Motivation and Context  Document release info.

…r dispatch with causal alignment fix (microsoft#27992) ## Motivation Eliminate the legacy MHA Unfused path (`QkvToContext` in `attention_impl.cu`) from the ONNX standard Attention op, simplifying the CUDA dispatch to a clean 3-tier cascade. ## Design ``` Flash Attention → Memory-Efficient Attention (MEA) → Unified Unfused Attention ``` - **Flash**: Handles fp16/bf16 with head_size ≤ 256, no explicit attn_mask. Fastest path. - **MEA (CUTLASS)**: Handles cases Flash cannot (explicit masks, softcap+mask combos). Requires head_size % 8 == 0. - **Unified Unfused**: Fallback for everything else — fp32, small heads, H≠H_v, output_qk. Handles both MHA and GQA via FP32 QK accumulation. The legacy `RunUnfusedAttention` wrapper (which called contrib ops `QkvToContext`) is deleted. The contrib MHA op is unaffected. ## Key Behavior Changes - **Unified unfused kernel** replaces separate GQA-only and MHA-only unfused paths - **Causal alignment**: lower-right when past_key is present, upper-left otherwise (per ONNX spec) - **H≠H_v + past KV** now supported (separate K/V concat calls) - **output_qk (mode 0)** supported in unified kernel via `ScaledCopyQkKernel` - **29 ONNX backend test filters removed** — tests now pass natively ## Testing All existing tests pass (40 C++ attention tests, 215 Python parametrized cases) plus new coverage for causal alignment on CPU EP and softcap ordering verification. Closes microsoft#27880. Related: microsoft#27516, microsoft#28198. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description Shifting by >= the bit width of an unsigned type is undefined behavior in C++. On x86-64, the hardware masks 64-bit shift amounts to 6 bits, so `x >> 64` silently becomes `x >> 0`, returning the original value instead of 0. Added `SafeShiftLeft`/`SafeShiftRight` helpers that return 0 when `shift >= sizeof(T) * 8`, applied across all three broadcast code paths (scalar-X, scalar-Y, element-wise). ```cpp template <typename T> inline T SafeShiftRight(T value, T shift) { return shift >= sizeof(T) * 8 ? T{0} : value >> shift; } ``` Added tests covering: - Shift by exact bit width (32, 64) for `uint32_t` and `uint64_t` - Shift by more than bit width (65, 128) - All three broadcast paths (scalar-X, scalar-Y, element-wise) - New tests are excluded for DirectML EP, which has the same hardware-level shift masking behavior ### Motivation and Context `BitShift` with `direction="RIGHT"` on `uint64` inputs with shift amount 64 returns the original values instead of zeros. Reproduces with `CPUExecutionProvider` and `ORT_DISABLE_ALL` (constant folding masks the bug under `ORT_ENABLE_ALL`). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

…icrosoft#28300) test run: https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=1201168&view=results --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com>

…ybind11 3.0 (microsoft#28251) ## Summary - Probes `-Wmaybe-uninitialized` via `check_cxx_compiler_flag` and applies `-Wno-maybe-uninitialized` only to the `onnxruntime_pybind11_state` target when the compiler accepts it. - Fixes the GCC build break introduced when ORT is compiled against pybind11 3.0, currently blocking Fedora's pybind11 3.0 package update. ## Motivation pybind11 3.0 rewrote `def_readwrite` to use a `property_cpp_function_classic` template that generates a lambda capturing a member pointer by value. GCC's `-Wmaybe-uninitialized` flow analysis flags that lambda inside pybind11's own headers, so any consumer compiling ORT's Python bindings against system pybind11 3.0 fails the build. This is a header-side false positive — there is no real uninitialized read in ORT code or in pybind11. Fixes microsoft#25681 ## Changes - `cmake/CMakeLists.txt`: add `check_cxx_compiler_flag(-Wno-maybe-uninitialized HAS_NO_MAYBE_UNINITIALIZED)` next to the existing `HAS_CAST_FUNCTION_TYPE` probe. - `cmake/onnxruntime_python.cmake`: when `HAS_NO_MAYBE_UNINITIALIZED` is set, append `-Wno-maybe-uninitialized` to the `onnxruntime_pybind11_state` target's private compile options. Mirrors the established `HAS_CAST_FUNCTION_TYPE` pattern in the same file. The suppression is target-scoped (only the Python binding shared library), compiler-scoped (only when the flag is accepted — effectively GCC), and warning-scoped (only the flow-sensitive `-Wmaybe-uninitialized`, not the strict `-Wuninitialized`). ## Test Plan - [x] `lintrunner -a` clean on the diff. - [ ] CI: confirm Linux GCC builds remain green. - [ ] Downstream verification: Fedora packagers can rebuild ORT against system pybind11 3.0 without `-Wmaybe-uninitialized` errors (per issue reporter).

… dims (microsoft#28349) ### Description `ReshapeFusion::FuseContiguousReshapes` collapses a chain of `Reshape` / `Squeeze` / `Unsqueeze` nodes into a single `Reshape` whose shape data is taken verbatim from the fully-inferred output shape of the last node in the chain. The new node is created without an `allowzero` attribute, so it defaults to `allowzero = 0`. When that inferred shape contains a literal `0` dim (legitimate when the original chain used `allowzero=1`, or when intermediate tensors had zero-sized dimensions), the fused `Reshape` misinterprets the `0` as "copy the corresponding dim from the input tensor" — but the input here is the original input of the *first* reshape in the chain, with unrelated dims. The result is a silently wrong output shape (and a benign-looking `MergeShapeInfo` warning at graph load). ### Repro (before the fix) ```python import numpy as np, onnx, onnxruntime as ort, onnx.reference from onnx import helper, TensorProto X = helper.make_tensor_value_info("X", TensorProto.FLOAT, [0, 6, 2]) Y = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [None, None, None]) s1 = helper.make_tensor("s1", TensorProto.INT64, [3], [3, 2, -1]) s2 = helper.make_tensor("s2", TensorProto.INT64, [3], [0, 0, 3]) n1 = helper.make_node("Reshape", ["X", "s1"], ["mid"]) n2 = helper.make_node("Reshape", ["mid", "s2"], ["Y"], allowzero=1) m = helper.make_model(helper.make_graph([n1, n2], "g", [X], [Y], initializer=[s1, s2]), opset_imports=[helper.make_opsetid("", 18)]) inp = np.random.default_rng(7).random((0, 6, 2), dtype=np.float32) print("REF:", onnx.reference.ReferenceEvaluator(m).run(None, {"X": inp})[0].shape) print("ORT:", ort.InferenceSession(m.SerializeToString(), providers=["CPUExecutionProvider"]).run(None, {"X": inp})[0].shape) ``` Output on `main` (`40c9f85f69`): ``` REF: (0, 0, 3) [W ... graph.cc:122 MergeShapeInfo] Error merging shape info for output. 'Y' source:{0,6,3} target:{0,0,3}. Falling back to lenient merge. ORT: (0, 6, 3) ❌ ``` ### Fix Setting `allowzero=1` on the fused node would also work but requires opset >= 14, which this transformer cannot assume (it accepts `Reshape` opset 5+). Bail out of fusion conservatively when `shape_value` contains any literal `0` dim. ### Test Adds `ReshapeFusionContiguousReshapesWithZeroDim` that builds the bug repro programmatically and asserts: - the two reshapes are NOT collapsed - the inferred output shape stays `(0, 0, 3)` The existing happy-path test `ReshapeFusion_Contiguous_Reshape` (added in microsoft#22494) is unaffected — its inferred output shape `(2, 1, 64, 32)` contains no zero dims, so the new guard does not trigger. ### Provenance `FuseContiguousReshapes` was introduced in microsoft#22494 (Feb 2025). The bug has been latent in `main` since then. ### Motivation and Context Found while reviewing microsoft/onnxscript#2907 — the rewriter rule under test there is semantically correct, but its numerical-equivalence check using ORT as the oracle fails because of this fusion bug. Fixes microsoft#28348. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

## Summary - Defer `sympy` import so `import onnxruntime.quantization` succeeds without sympy installed - Move `SymbolicShapeInference` import in `quant_pre_process` behind `skip_symbolic_shape` gate - Defer sympy-dependent imports in `transformers.onnx_model` and `transformers.shape_infer_helper` - Raise a clear, actionable `ImportError` instructing users to install sympy when needed ## Motivation Fixes microsoft#24872. `sympy` (~29 MB plus `mpmath` ~2 MB) was a hard runtime dependency even though it is only needed for symbolic shape inference. Pure-inference users — the common case — pay the install/import cost for functionality they do not use. `setup.py` already declares sympy as an optional extra (`"symbolic": ["sympy"]`), but top-level imports forced it to load unconditionally. ## Changes - `onnxruntime/python/tools/quantization/shape_inference.py`: move `from onnxruntime.tools.symbolic_shape_infer import SymbolicShapeInference` from module top-level into `quant_pre_process`, guarded by `if not skip_symbolic_shape`. Wrap in `try/except ImportError` that re-raises with install instructions. - `onnxruntime/python/tools/transformers/onnx_model.py`: move the `from shape_infer_helper import SymbolicShapeInferenceHelper` from module top-level into the two methods that instantiate it. Add `TYPE_CHECKING`-guarded import for type annotations. - `onnxruntime/python/tools/transformers/shape_infer_helper.py`: wrap the import of `symbolic_shape_infer` in `try/except ImportError`. The `SymbolicShapeInferenceHelper.__init__` now raises a clear `ImportError` when sympy is unavailable, instead of failing at module load time. - `onnxruntime/test/python/quantization/test_quant_preprocess.py`: add `test_skip_symbolic_shape_does_not_require_sympy` which removes sympy from `sys.modules` and verifies `quant_pre_process(..., skip_symbolic_shape=True)` completes successfully. No public API signatures change. Users who want symbolic shape inference install sympy as before (`pip install sympy` or `pip install onnxruntime[symbolic]`). ## Test Plan - `python -m pytest onnxruntime/test/python/quantization/test_quant_preprocess.py -v` — all tests pass including the new coverage. - Smoke-tested locally: `import onnxruntime.quantization` no longer pulls `sympy` into `sys.modules`. - `lintrunner -a` clean on all changed files. Fixes microsoft#24872

microsoft#28346) ### Description Fix `ReleaseVersionSuffix` passing in 'Nuget Test' pipeline.

### Description Rename `ort_api_1_to_26` -> `ort_api_1_to_27`. ### Motivation and Context This should have been done in microsoft#28324, but we wanted to merge ASAP.

…n GQA (microsoft#28358) ## Summary ## Problem The Memory-Efficient Attention (MEA) path crashes with `cudaErrorMisalignedAddress` when: - GQA mode (`q_num_heads != kv_num_heads`) - `head_size != v_head_size` (e.g., Q.head_dim=256, K.head_dim=512) - `seq_len >= 4` (Flash Attention not eligible due to attention mask) This is because MEA's `LaunchUngroup` requires equal head sizes, but the dispatch logic only checked this constraint for the past_key case (line 1380), not the general GQA case. ## Fix Skip MEA for GQA when head sizes differ. The Unfused Attention fallback handles this correctly. ## Affected Models Gemma 4 was not affected. This was a previously incorrect graph. But the fix is still good to have that improves robustness anyways. ~~**Gemma4** (google/gemma-4-e2b-it) with KV sharing:~~ - Layers 15-34 borrow K,V from source layers - Q projection: 1536 → 2048 (8 heads × 256) - K/V from source: [batch, 1, seq, 512] - `head_size = 256`, `v_head_size = 512` ## Testing Minimal repro (from microsoft#28357): ```python # Attention(Q=[1,S,2048], K=[1,S,512], V=[1,S,512], q_num_heads=8, kv_num_heads=1) # Before fix: seq=4+ crashes with misaligned address # After fix: all seq lengths work ``` Full Gemma4 decoder (35 layers, 15 GQA + 20 standard Attention): - Prefill seq=32: ✅ - Decode seq=1: ✅ Fixes microsoft#28357 Signed-off-by: Justin Chu <justinchu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description When loading data into tensors from memory buffers from external files, byteswap it if necessary. Also add a fix for deleter when byteswapping: keep copy of AllocatorPtr instead of reference. ### Motivation and Context  While trying to setup local s390x CI, I've found 4 more tests that fail on s390x: CApiTest.TestLoadModelFromArrayWithExternalInitializerFromFileArray CApiTest.TestLoadModelFromArrayWithExternalInitializersFromFileArray CApiTest.TestLoadModelFromArrayWithExternalInitializersFromFileArrayPathRobust CApiTest.TestLoadModelFromArrayWithExternalInitializersFromFileMmap

…es/vite-default (microsoft#28304) Bumps [postcss](https://github.com/postcss/postcss) from 8.5.3 to 8.5.13. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/postcss/postcss/releases">postcss's releases</a>.</em></p> <blockquote> <h2>8.5.13</h2> <ul> <li>Fixed <code>postcss-scss</code> commend regression.</li> </ul> <h2>8.5.12</h2> <ul> <li>Fixed reading any file via user-generated CSS.</li> <li>Added <code>opts.unsafeMap</code> to disable checks.</li> </ul> <h2>8.5.11</h2> <ul> <li>Fixed nested brackets parsing performance (by <a href="https://github.com/offset"><code>@offset</code></a>).</li> </ul> <h2>8.5.10</h2> <ul> <li>Fixed XSS via unescaped <code></style></code> in non-bundler cases (by <a href="https://github.com/TharVid"><code>@TharVid</code></a>).</li> </ul> <h2>8.5.9</h2> <ul> <li>Speed up source map encoding paring in case of the error.</li> </ul> <h2>8.5.8</h2> <ul> <li>Fixed <code>Processor#version</code>.</li> </ul> <h2>8.5.7</h2> <ul> <li>Improved source map annotation cleaning performance (by CodeAnt AI).</li> </ul> <h2>8.5.6</h2> <ul> <li>Fixed <code>ContainerWithChildren</code> type discriminating (by <a href="https://github.com/Goodwine"><code>@Goodwine</code></a>).</li> </ul> <h2>8.5.5</h2> <ul> <li>Fixed <code>package.json</code>→<code>exports</code> compatibility with some tools (by <a href="https://github.com/JounQin"><code>@JounQin</code></a>).</li> </ul> <h2>8.5.4</h2> <ul> <li>Fixed Parcel compatibility issue (by <a href="https://github.com/git-sumitchaudhary"><code>@git-sumitchaudhary</code></a>).</li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/postcss/postcss/blob/main/CHANGELOG.md">postcss's changelog</a>.</em></p> <blockquote> <h2>8.5.13</h2> <ul> <li>Fixed <code>postcss-scss</code> commend regression.</li> </ul> <h2>8.5.12</h2> <ul> <li>Fixed reading any file via user-generated CSS.</li> <li>Added <code>opts.unsafeMap</code> to disable checks.</li> </ul> <h2>8.5.11</h2> <ul> <li>Fixed nested brackets parsing performance (by <a href="https://github.com/offset"><code>@offset</code></a>).</li> </ul> <h2>8.5.10</h2> <ul> <li>Fixed XSS via unescaped <code></style></code> in non-bundler cases (by <a href="https://github.com/TharVid"><code>@TharVid</code></a>).</li> </ul> <h2>8.5.9</h2> <ul> <li>Speed up source map encoding paring in case of the error.</li> </ul> <h2>8.5.8</h2> <ul> <li>Fixed <code>Processor#version</code>.</li> </ul> <h2>8.5.7</h2> <ul> <li>Improved source map annotation cleaning performance (by CodeAnt AI).</li> </ul> <h2>8.5.6</h2> <ul> <li>Fixed <code>ContainerWithChildren</code> type discriminating (by <a href="https://github.com/Goodwine"><code>@Goodwine</code></a>).</li> </ul> <h2>8.5.5</h2> <ul> <li>Fixed <code>package.json</code>→<code>exports</code> compatibility with some tools (by <a href="https://github.com/JounQin"><code>@JounQin</code></a>).</li> </ul> <h2>8.5.4</h2> <ul> <li>Fixed Parcel compatibility issue (by <a href="https://github.com/git-sumitchaudhary"><code>@git-sumitchaudhary</code></a>).</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/postcss/postcss/commit/af58cf1b7af02e9b9fcb138a4a2d7ef3450158b1"><code>af58cf1</code></a> Release 8.5.13 version</li> <li><a href="https://github.com/postcss/postcss/commit/f227dbd0e9443e5f33e18e633b8b4d2b55aac5ee"><code>f227dbd</code></a> Temporary ignore pnpm 11 config</li> <li><a href="https://github.com/postcss/postcss/commit/d3abd40d723cf3559e5ddb5fc738b7cb64e92bb0"><code>d3abd40</code></a> Update dependencies</li> <li><a href="https://github.com/postcss/postcss/commit/dd06c3e11362087bc18f9c20cee30fd82bda3de9"><code>dd06c3e</code></a> Revert stringifier changes because of the conflict with postcss-scss</li> <li><a href="https://github.com/postcss/postcss/commit/ae889c815fb88d785401a88f1a7dfc8cb11915fb"><code>ae889c8</code></a> Try to fix CI</li> <li><a href="https://github.com/postcss/postcss/commit/e0093e49bcf00347383a13e40bb1f67bc823ca15"><code>e0093e4</code></a> Move to pnpm 11</li> <li><a href="https://github.com/postcss/postcss/commit/9bc81c48f054a630c9a2e3868263b7ad4fc15013"><code>9bc81c4</code></a> Release 8.5.12 version</li> <li><a href="https://github.com/postcss/postcss/commit/85c4d7dab830be366f8a96047f9e5b7944e101d8"><code>85c4d7d</code></a> Another try to fix coverage</li> <li><a href="https://github.com/postcss/postcss/commit/94484cae6d4308167939f2ac888d166bd80dff01"><code>94484ca</code></a> Try to fix coverage</li> <li><a href="https://github.com/postcss/postcss/commit/c64b7488d2731dfa16213739b42c34faf5a9eba3"><code>c64b748</code></a> Load only .map source maps</li> <li>Additional commits viewable in <a href="https://github.com/postcss/postcss/compare/8.5.3...8.5.13">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=postcss&package-manager=npm_and_yarn&previous-version=8.5.3&new-version=8.5.13)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…crosoft#28350) ### Description Fix `onnxruntime-node` and `onnxruntime-common` NPM packages lacking an RC suffix when built in Release + RC mode. This isn't great, the suffix looks like `-QUAL.DATE-COMMIT`. This'll break the publishing pipeline if the packaging pipelines (zip-nuget and NPM) span more than a single day due to same-version checks/enforcement. ### Motivation and Context Missing the RC qualifier/suffix fails the NPM publish pipeline. It correctly assets that the (onnxruntime-node, onnxruntime-common, and onnxruntime-web) do not share a common version specifier.

…ize op (microsoft#28345) ### Description `ROUND_PREFER_CEIL` in the Resize operator used bare `std::round`/`roundf`, which rounds away from zero. This is correct for positive halfway values (e.g., `round(0.5) = 1 = ceil(0.5)`) but wrong for negative halfway values (e.g., `round(-0.5) = -1`, but `ceil(-0.5) = 0`). Negative coordinates occur naturally with the `half_pixel` coordinate transformation mode for the first output pixels when upsampling. Added an explicit negative-halfway check, mirroring the existing positive-halfway check in `ROUND_PREFER_FLOOR`: ```cpp // CPU (upsamplebase.h) case ROUND_PREFER_CEIL: return [](float x_original, bool) { if (x_original == static_cast<int64_t>(x_original) - 0.5f) { return static_cast<int64_t>(std::ceil(x_original)); } return static_cast<int64_t>(std::round(x_original)); }; ``` Same fix applied to the CUDA implementation (`resize_impl.cu`). Added two test cases in `resize_op_test.cc`: 1. `ResizeOpNearestUpSample_RoundPreferCeil_HalfPixel` — exercises non-integer scale (26→64) from the original issue report, verifying correct source pixel selection at fractional boundaries. 2. `ResizeOpNearestUpSample_RoundPreferCeil_HalfPixel_2x2to7x8` — exercises a positive 0.5 boundary where `round_prefer_ceil` selects ceiling. ### Motivation and Context The `round_prefer_floor` path already had an explicit halfway-case override (for positive values where `std::round` disagrees with floor). The `round_prefer_ceil` path was missing the symmetric fix for negative values, violating the ONNX spec semantics of "at ties, prefer ceiling." --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description Add Python wheel packaging support for the CUDA plugin EP, following the WebGPU plugin EP packaging pattern from microsoft#28226. Changes include: - Add `plugin-ep-cuda/python` packaging sources for the `onnxruntime-ep-cuda` wheel. - Add helper APIs to locate/register the CUDA plugin EP shared library. - Add Linux and Windows x64 Python package jobs that consume the CUDA plugin binary artifacts. - Extend plugin package version setup to emit a PEP 440-compatible `PluginPythonPackageVersion`. - Add a Linux Docker helper script to build the CUDA plugin Python wheel inside the manylinux CUDA image. ### Validation - Parsed touched Azure pipeline YAML files with PyYAML. - Ran Python syntax checks for the new package helper and wheel builder. ### Notes The Linux Python package job is limited to x64 for now, matching the existing x64 plugin artifact packaging flow. --------- Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Signed-off-by: bfilipek <bartlomiej.filipek@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com> Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Xiaoxi Han <xiha@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com> Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com> Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: liang <gxgaoliang@126.com> Co-authored-by: Javier Martinez <javier.e.martinez@intel.com> Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Garth Long <garth.long@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: derdeljan-msft <derdeljan@microsoft.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com> Co-authored-by: Christopher Warrington <chwarr@microsoft.com> Co-authored-by: Ishwar Raut <iraut@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Xinpeng Dou <15529241576@163.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: adrastogi <aditya.rastogi@microsoft.com> Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com> Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com> Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com> Co-authored-by: Adam Pocock <adam.pocock@oracle.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com> Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com> Co-authored-by: Kotomi-Du <yaru.du@intel.com> Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com> Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com> Co-authored-by: Mikhail Dvoretckii <mikhail.dvoretckii@intel.com> Co-authored-by: bopeng1234 <bo.peng@intel.com> Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Wenqin Yang <wenqin.yang@intel.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: xieofxie <xieofxie@126.com> Co-authored-by: hualxie <hualxie@microsoft.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com> Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com> Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Jiawei Shao <jiawei.shao@intel.com> Co-authored-by: czekun <chen.zekun@intel.com> Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: ai-fw-intg <sys_ai_fw_intg@intel.com> Co-authored-by: Rajeev Sekar <rajeev.sekar@intel.com> Co-authored-by: RajeevSekar <117911837+RajeevSekar@users.noreply.github.com> Co-authored-by: Nazanin Beheshti <nazanin.beheshti@intel.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

### Description  This pull request adds C#/.NET (NuGet) packaging support for the WebGPU plugin Execution Provider, including all necessary project files, documentation, and helper code. It introduces a new NuGet package (`Microsoft.ML.OnnxRuntime.EP.WebGpu`), updates the main plugin documentation to reflect C# support, and provides detailed instructions and code samples for building, packaging, and using the provider in .NET applications. It also has some minor changes for the existing Python packaging setup. The most important changes are: **C#/.NET Packaging Infrastructure:** - Added the `Microsoft.ML.OnnxRuntime.EP.WebGpu` project (`.csproj`) for NuGet packaging, including metadata, dependency management, and logic to read the minimum ONNX Runtime version from a shared file. Native binaries are included per platform, and the README is bundled in the package. - Introduced the `WebGpuEp.cs` helper class to resolve the native library path and EP name at runtime, simplifying registration and usage in .NET. **Documentation:** - Added a detailed `README.md` for the C# package, including usage instructions, supported platforms, and example code for registering and using the WebGPU EP in .NET. - Added a top-level `csharp/README.md` with instructions for building, packaging, and testing the NuGet package, as well as information on CI integration and native binary requirements. ### Motivation and Context  Create WebGPU plugin EP NuGet package.

…to 25 (microsoft#27744) ### Description Extends CUDA Cast kernel registration to cover opset 25 (latest ONNX spec). The existing non-versioned opset 23 registration is capped to VERSIONED (23, 24), and a new non-versioned opset 25 registration is added for all type specializations. **`cast_op.cc`**: - `REGISTER_KERNEL_TYPED(T)`: opset 23 → VERSIONED (23, 24), added non-versioned opset 25 - Renamed `REGISTER_KERNEL_TYPED_23` → `REGISTER_KERNEL_TYPED_23_TO_24` (VERSIONED) - Added `REGISTER_KERNEL_TYPED_25` macro (non-versioned) - Renamed `SPECIALIZE_IMPL_19_TO_23` → `SPECIALIZE_IMPL_19_TO_25`, covering Float8 types through opset 25 - Updated Float4E2M1x2 registration to use new versioned/non-versioned macros **`cuda_execution_provider.cc`**: - Forward declarations: all opset 23 Cast entries → VERSIONED (23, 24), added opset 25 non-versioned entries (all 16 types: 13 standard + 2 Float8 + 1 Float4) - `BuildKernelCreateInfo`: same pattern — capped 23 to (23, 24), added opset 25 block ### Motivation and Context CUDA Cast operator was registered up to opset 23, but ONNX spec defines Cast through opset 25. This gap can cause kernel lookup failures when running models exported at opset 25. Part of the broader CUDA opset gap-filling effort tracked in microsoft#27729. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Copilot <copilot@github.com>

…27755) ### Description Extends CUDA ReduceMax and ReduceMin kernel registrations from opset 18 to opset 20. - **`reduction_ops.cc`**: Added `REGISTER_KERNEL_VERSIONED_RANGE_AXES_INPUT_TYPED` macro for versioned ranges requiring `InputMemoryType(OrtMemTypeCPUInput, 1)`. Split both operators from 2-way (1–17, 18+) to 3-way (1–17, 18–19, 20+). - **`cuda_execution_provider.cc`**: Capped opset 18 forward declarations and `BuildKernelCreateInfo` entries to versioned 18–19. Added opset 20 non-versioned entries for both operators. Type coverage maintained as-is: ReduceMax (float, double, MLFloat16, int32_t, int64_t), ReduceMin adds int8_t, uint8_t. ### Motivation and Context ReduceMax and ReduceMin CUDA registrations stopped at opset 18; ONNX latest is opset 20. Models exported with opset 19–20 could fail to find a matching CUDA kernel for these ops. Follows the same pattern used in microsoft#27735 (TopK) and other opset gap PRs tracked in microsoft#27729. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

### Description Extends RNN CUDA kernel registration from opset 14 to opset 22, following the standard opset gap-filling pattern: - **`rnn.cc`**: Cap existing opset 14 non-versioned kernel to versioned 14–21; add new non-versioned kernel at opset 22 - **`cuda_execution_provider.cc`**: Update forward declarations and `BuildKernelCreateInfo` entries to match (versioned 14–21 + non-versioned 22); remove duplicate GRU opset 22 entries introduced during merge - **`OperatorKernels.md`**: Update CUDA RNN entry to reflect three tiers: `[7,13]`, `[14,21]`, `22+` No behavioral changes — the operator implementation is identical across opset 14–22. This is a registration-only change. ### Motivation and Context RNN CUDA operator was registered at opset 14 while ONNX defines it through opset 22, causing models exported at newer opsets to fall back to CPU. Part of the broader CUDA EP opset gap effort tracked in microsoft#27729. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

### Description Extends the Reshape CUDA kernel registration from opset 23 to opset 25, following the same pattern used in microsoft#27728. - **`reshape.cc`**: Cap existing non-versioned opset 23 kernel → versioned (23, 24); add new non-versioned kernel at opset 25 - **`cuda_execution_provider.cc`**: Update forward declaration and `BuildKernelCreateInfo` for versioned (23, 24); add opset 25 entries - **`docs/OperatorKernels.md`**: Update Reshape CUDA EP entry from `23+` to `25+` and add `[23, 24]` versioned range row No functional changes to the kernel itself — the opset 25 schema is backward-compatible with opset 23. ### Motivation and Context Reshape is listed as a P1 gap in microsoft#27729 (CUDA max opset 23, ONNX latest opset 25). Models exported at opset 25 would fail to find a matching Reshape kernel on the CUDA EP.  --- 🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. [Learn more about Advanced Security.](https://gh.io/cca-advanced-security) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Bumps and [brace-expansion](https://github.com/juliangruber/brace-expansion). These dependencies needed to be updated together. Updates `brace-expansion` from 1.1.11 to 1.1.13 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/juliangruber/brace-expansion/releases">brace-expansion's releases</a>.</em></p> <blockquote> <h2>v1.1.12</h2> <ul> <li>pkg: publish on tag 1.x c460dbd</li> <li>fmt ccb8ac6</li> <li>Fix potential ReDoS Vulnerability or Inefficient Regular Expression (<a href="https://redirect.github.com/juliangruber/brace-expansion/issues/65">#65</a>) c3c73c8</li> </ul> <hr /> <p><a href="https://github.com/juliangruber/brace-expansion/compare/v1.1.11...v1.1.12">https://github.com/juliangruber/brace-expansion/compare/v1.1.11...v1.1.12</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/juliangruber/brace-expansion/commit/6c353caf23beb9644f858eb3fe38d43a68b82898"><code>6c353ca</code></a> 1.1.13</li> <li><a href="https://github.com/juliangruber/brace-expansion/commit/7fd684f89fdde3549563d0a6522226a9189472a2"><code>7fd684f</code></a> Backport fix for GHSA-f886-m6hf-6m8v (<a href="https://redirect.github.com/juliangruber/brace-expansion/issues/95">#95</a>)</li> <li><a href="https://github.com/juliangruber/brace-expansion/commit/44f33b47c5c6a965d507421af43e86cf5971d711"><code>44f33b4</code></a> 1.1.12</li> <li><a href="https://github.com/juliangruber/brace-expansion/commit/c460dbd68e428d147b2080622d8ce126c7a08570"><code>c460dbd</code></a> pkg: publish on tag 1.x</li> <li><a href="https://github.com/juliangruber/brace-expansion/commit/ccb8ac6d4292b7661b677fe048ba6690c877f51f"><code>ccb8ac6</code></a> fmt</li> <li><a href="https://github.com/juliangruber/brace-expansion/commit/c3c73c8b088defc70851843be88ccc3af08e7217"><code>c3c73c8</code></a> Fix potential ReDoS Vulnerability or Inefficient Regular Expression (<a href="https://redirect.github.com/juliangruber/brace-expansion/issues/65">#65</a>)</li> <li>See full diff in <a href="https://github.com/juliangruber/brace-expansion/compare/1.1.11...v1.1.13">compare view</a></li> </ul> </details> <br /> Updates `brace-expansion` from 2.0.1 to 2.0.3 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/juliangruber/brace-expansion/releases">brace-expansion's releases</a>.</em></p> <blockquote> <h2>v1.1.12</h2> <ul> <li>pkg: publish on tag 1.x c460dbd</li> <li>fmt ccb8ac6</li> <li>Fix potential ReDoS Vulnerability or Inefficient Regular Expression (<a href="https://redirect.github.com/juliangruber/brace-expansion/issues/65">#65</a>) c3c73c8</li> </ul> <hr /> <p><a href="https://github.com/juliangruber/brace-expansion/compare/v1.1.11...v1.1.12">https://github.com/juliangruber/brace-expansion/compare/v1.1.11...v1.1.12</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/juliangruber/brace-expansion/commit/6c353caf23beb9644f858eb3fe38d43a68b82898"><code>6c353ca</code></a> 1.1.13</li> <li><a href="https://github.com/juliangruber/brace-expansion/commit/7fd684f89fdde3549563d0a6522226a9189472a2"><code>7fd684f</code></a> Backport fix for GHSA-f886-m6hf-6m8v (<a href="https://redirect.github.com/juliangruber/brace-expansion/issues/95">#95</a>)</li> <li><a href="https://github.com/juliangruber/brace-expansion/commit/44f33b47c5c6a965d507421af43e86cf5971d711"><code>44f33b4</code></a> 1.1.12</li> <li><a href="https://github.com/juliangruber/brace-expansion/commit/c460dbd68e428d147b2080622d8ce126c7a08570"><code>c460dbd</code></a> pkg: publish on tag 1.x</li> <li><a href="https://github.com/juliangruber/brace-expansion/commit/ccb8ac6d4292b7661b677fe048ba6690c877f51f"><code>ccb8ac6</code></a> fmt</li> <li><a href="https://github.com/juliangruber/brace-expansion/commit/c3c73c8b088defc70851843be88ccc3af08e7217"><code>c3c73c8</code></a> Fix potential ReDoS Vulnerability or Inefficient Regular Expression (<a href="https://redirect.github.com/juliangruber/brace-expansion/issues/65">#65</a>)</li> <li>See full diff in <a href="https://github.com/juliangruber/brace-expansion/compare/1.1.11...v1.1.13">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

### Description Caps existing non-versioned CUDA kernel registrations and adds new registrations at the latest ONNX opset: - **Round**: opset 11 (non-versioned) → versioned 11–21 + new opset 22 - **Equal**: opset 13 (non-versioned) → versioned 13–18 + new opset 19 Changes across three files: - `unary_elementwise_ops.cc` — `UNARY_OP_HFD(Round, 11)` → `UNARY_OP_VERSIONED_HFD` + `UNARY_OP_HFD` - `binary_elementwise_ops.cc` — `BINARY_LOGICALOP_REGISTER_UZILHFD(Equal, 13)` → versioned 13–18 + new 19 (same for `bool` typed registration) - `cuda_execution_provider.cc` — corresponding forward declarations and `BuildKernelCreateInfo` entries No type changes; both operators retain their existing CUDA type support at the new opsets. ### Motivation and Context Tracks with the ongoing effort to close ONNX opset coverage gaps in the CUDA execution provider (microsoft#27729). Without these registrations, models targeting opset 19+ (Equal) or 22+ (Round) fall back from CUDA to CPU. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

…crosoft#28380) ## Summary - Remove the `bias == nullptr` requirement from `CanApplyFlashAttention`, enabling FlashAttention for MultiHeadAttention nodes with QKV bias (e.g., whisper decoder). - Apply `TransferBSDToBNSH` to add bias and transpose Q/K/V to BNSH format before calling FlashAttention. - Handle cross-attention (only Q needs bias+transpose, K/V already BNSH from encoder) and self-attention (all Q/K/V need bias+transpose) separately. ## Motivation Whisper decoder's MultiHeadAttention nodes all have QKV bias, which previously forced them into the slower unfused attention path. Enabling FlashAttention for these nodes yields ~45% speedup on whisper-tiny-int4 (~92 → ~134 tokens/s). ## Test plan - [x] Existing MHA unit tests with bias data now exercise the FlashAttention path on WebGPU with Subgroups support - [x] whisper-tiny-int4 end-to-end: correct transcription at ~134 tps (vs ~92 tps baseline) - [x] clang-format passes - [x] D3D12 build succeeds

…t#28123) ### Description The OrtModelEditorApi C API functions (AddNodeToGraph, AddGraphToModel, SetGraphInputs/SetGraphOutputs) take raw pointers and wrap them in unique_ptr to transfer ownership. Without guards, callers can pass the same pointer twice or call Release after ownership transfer, causing double-free on destruction. ### Changes - **AddInitializerToGraph**: Copy OrtValue internally instead of taking raw pointer ownership. OrtValue uses shared_ptr for its data, so copying is cheap (refcount increment). The caller retains ownership and is responsible for releasing. This eliminates the double-free class entirely for initializers. - **AddNodeToGraph**: Add \owned_\ flag to ModelEditorNode to reject double-add, add null check - **AddGraphToModel**: Reject if model already has a graph, add null check for model. Add \owned_\ flag to ModelEditorGraph to reject same graph added to two models. - **SetGraphInputs/SetGraphOutputs**: Add \owned_\ flag to ModelEditorValueInfo to reject already-owned ValueInfos. Detect duplicate pointers in input arrays. Pre-allocate vector capacity before ownership-transfer loop for exception safety. - **ReleaseNode/ReleaseGraph/ReleaseValueInfo**: Check \owned_\ flag before deleting. If already owned by a graph/model, the release is a safe no-op. - **C++ wrapper**: Remove initializer.release() in AddInitializer to match copy semantics. - **Regression tests**: Tests covering ownership-transfer guard paths, release-after-ownership, and duplicate detection. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…28250) - Wrap 8x16x16 MatMulNBits(SubgroupMatrix) kernel body in M-tile loop using uniforms.m_tiles_per_wg for tile assignment per workgroup - Cap dispatch_y on Xe2/3-LPG when M > 2k, with occupancy factor 16x - Non-Intel or small-M paths pass m_tiles_per_wg=1 (no behavior change)

eserscor and others added 29 commits May 4, 2026 09:34

Bump version for 1.27.0 (microsoft#28324)

b81e0c6

### Description Bump version to 1.27.0.

adds foundry local packaging to webgpu plugin ep packaging pipeline (m…

a1aa3bb

…icrosoft#28300) test run: https://dev.azure.com/aiinfra/Lotus/_build/results?buildId=1201168&view=results --------- Co-authored-by: Prathik Rao <prathikrao@microsoft.com>

fix(ci): test pipeline didn't correctly specify ReleaseVersionSuffix (

ebee606

microsoft#28346) ### Description Fix `ReleaseVersionSuffix` passing in 'Nuget Test' pipeline.

chore: rename ort_api_1_to_26 to ort_api_1_to_27 (microsoft#28341)

ef44604

### Description Rename `ort_api_1_to_26` -> `ort_api_1_to_27`. ### Motivation and Context This should have been done in microsoft#28324, but we wanted to merge ASAP.

Merge remote-tracking branch 'origin/master' into sync_msft_07052026

336c469

ai-fw-intg requested a review from ankitm3k May 6, 2026 20:32

ai-fw-intg requested review from Jaswanth51, jatinwadhwa921 and vthaniel May 6, 2026 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 07052026#1075

Sync with Microsoft ONNX Runtime - 07052026#1075
ai-fw-intg wants to merge 29 commits into
ovep-developfrom
sync_msft_07052026

ai-fw-intg commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

ai-fw-intg commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants