Skip to content

Sync with Microsoft ONNX Runtime - 07052026#1075

Open
ai-fw-intg wants to merge 29 commits into
ovep-developfrom
sync_msft_07052026
Open

Sync with Microsoft ONNX Runtime - 07052026#1075
ai-fw-intg wants to merge 29 commits into
ovep-developfrom
sync_msft_07052026

Conversation

@ai-fw-intg
Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

eserscor and others added 29 commits May 4, 2026 09:34
### Description

Bump version to 1.27.0.
### Description
<!-- Describe your changes. -->

Bump plugin-ep-webgpu/VERSION_NUMBER to 0.2.0.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Version bump after creating the release branch.
…t#28294)

### Description

Avoid having to depend on setup job/task for build date/time.
Use pipeline var & runtime expression instead.

### Motivation and Context

Faster, no need to wait for an ad-hoc job to set pipeline variables.
Easier to read/reason about, reduces cross-stage deps.
### Description
<!-- Describe your changes. -->

Add release info doc for WebGPU plugin EP.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Document release info.
…r dispatch with causal alignment fix (microsoft#27992)

## Motivation

Eliminate the legacy MHA Unfused path (`QkvToContext` in
`attention_impl.cu`) from the ONNX standard Attention op, simplifying
the CUDA dispatch to a clean 3-tier cascade.

## Design

```
Flash Attention → Memory-Efficient Attention (MEA) → Unified Unfused Attention
```

- **Flash**: Handles fp16/bf16 with head_size ≤ 256, no explicit
attn_mask. Fastest path.
- **MEA (CUTLASS)**: Handles cases Flash cannot (explicit masks,
softcap+mask combos). Requires head_size % 8 == 0.
- **Unified Unfused**: Fallback for everything else — fp32, small heads,
H≠H_v, output_qk. Handles both MHA and GQA via FP32 QK accumulation.

The legacy `RunUnfusedAttention` wrapper (which called contrib ops
`QkvToContext`) is deleted. The contrib MHA op is unaffected.

## Key Behavior Changes

- **Unified unfused kernel** replaces separate GQA-only and MHA-only
unfused paths
- **Causal alignment**: lower-right when past_key is present, upper-left
otherwise (per ONNX spec)
- **H≠H_v + past KV** now supported (separate K/V concat calls)
- **output_qk (mode 0)** supported in unified kernel via
`ScaledCopyQkKernel`
- **29 ONNX backend test filters removed** — tests now pass natively

## Testing

All existing tests pass (40 C++ attention tests, 215 Python parametrized
cases) plus new coverage for causal alignment on CPU EP and softcap
ordering verification.

Closes microsoft#27880. Related: microsoft#27516, microsoft#28198.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description

Shifting by >= the bit width of an unsigned type is undefined behavior
in C++. On x86-64, the hardware masks 64-bit shift amounts to 6 bits, so
`x >> 64` silently becomes `x >> 0`, returning the original value
instead of 0.

Added `SafeShiftLeft`/`SafeShiftRight` helpers that return 0 when `shift
>= sizeof(T) * 8`, applied across all three broadcast code paths
(scalar-X, scalar-Y, element-wise).

```cpp
template <typename T>
inline T SafeShiftRight(T value, T shift) {
  return shift >= sizeof(T) * 8 ? T{0} : value >> shift;
}
```

Added tests covering:
- Shift by exact bit width (32, 64) for `uint32_t` and `uint64_t`
- Shift by more than bit width (65, 128)
- All three broadcast paths (scalar-X, scalar-Y, element-wise)
- New tests are excluded for DirectML EP, which has the same
hardware-level shift masking behavior

### Motivation and Context

`BitShift` with `direction="RIGHT"` on `uint64` inputs with shift amount
64 returns the original values instead of zeros. Reproduces with
`CPUExecutionProvider` and `ORT_DISABLE_ALL` (constant folding masks the
bug under `ORT_ENABLE_ALL`).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
…ybind11 3.0 (microsoft#28251)

## Summary
- Probes `-Wmaybe-uninitialized` via `check_cxx_compiler_flag` and
applies `-Wno-maybe-uninitialized` only to the
`onnxruntime_pybind11_state` target when the compiler accepts it.
- Fixes the GCC build break introduced when ORT is compiled against
pybind11 3.0, currently blocking Fedora's pybind11 3.0 package update.

## Motivation
pybind11 3.0 rewrote `def_readwrite` to use a
`property_cpp_function_classic` template that generates a lambda
capturing a member pointer by value. GCC's `-Wmaybe-uninitialized` flow
analysis flags that lambda inside pybind11's own headers, so any
consumer compiling ORT's Python bindings against system pybind11 3.0
fails the build. This is a header-side false positive — there is no real
uninitialized read in ORT code or in pybind11.

Fixes microsoft#25681

## Changes
- `cmake/CMakeLists.txt`: add
`check_cxx_compiler_flag(-Wno-maybe-uninitialized
HAS_NO_MAYBE_UNINITIALIZED)` next to the existing
`HAS_CAST_FUNCTION_TYPE` probe.
- `cmake/onnxruntime_python.cmake`: when `HAS_NO_MAYBE_UNINITIALIZED` is
set, append `-Wno-maybe-uninitialized` to the
`onnxruntime_pybind11_state` target's private compile options. Mirrors
the established `HAS_CAST_FUNCTION_TYPE` pattern in the same file.

The suppression is target-scoped (only the Python binding shared
library), compiler-scoped (only when the flag is accepted — effectively
GCC), and warning-scoped (only the flow-sensitive
`-Wmaybe-uninitialized`, not the strict `-Wuninitialized`).

## Test Plan
- [x] `lintrunner -a` clean on the diff.
- [ ] CI: confirm Linux GCC builds remain green.
- [ ] Downstream verification: Fedora packagers can rebuild ORT against
system pybind11 3.0 without `-Wmaybe-uninitialized` errors (per issue
reporter).
… dims (microsoft#28349)

### Description

`ReshapeFusion::FuseContiguousReshapes` collapses a chain of `Reshape` /
`Squeeze` / `Unsqueeze` nodes into a single `Reshape` whose shape data
is taken verbatim from the fully-inferred output shape of the last node
in the chain. The new node is created without an `allowzero` attribute,
so it defaults to `allowzero = 0`.

When that inferred shape contains a literal `0` dim (legitimate when the
original chain used `allowzero=1`, or when intermediate tensors had
zero-sized dimensions), the fused `Reshape` misinterprets the `0` as
"copy the corresponding dim from the input tensor" — but the input here
is the original input of the *first* reshape in the chain, with
unrelated dims. The result is a silently wrong output shape (and a
benign-looking `MergeShapeInfo` warning at graph load).

### Repro (before the fix)

```python
import numpy as np, onnx, onnxruntime as ort, onnx.reference
from onnx import helper, TensorProto

X  = helper.make_tensor_value_info("X", TensorProto.FLOAT, [0, 6, 2])
Y  = helper.make_tensor_value_info("Y", TensorProto.FLOAT, [None, None, None])
s1 = helper.make_tensor("s1", TensorProto.INT64, [3], [3, 2, -1])
s2 = helper.make_tensor("s2", TensorProto.INT64, [3], [0, 0, 3])

n1 = helper.make_node("Reshape", ["X",   "s1"], ["mid"])
n2 = helper.make_node("Reshape", ["mid", "s2"], ["Y"], allowzero=1)
m  = helper.make_model(helper.make_graph([n1, n2], "g", [X], [Y], initializer=[s1, s2]),
                       opset_imports=[helper.make_opsetid("", 18)])

inp = np.random.default_rng(7).random((0, 6, 2), dtype=np.float32)
print("REF:", onnx.reference.ReferenceEvaluator(m).run(None, {"X": inp})[0].shape)
print("ORT:", ort.InferenceSession(m.SerializeToString(),
                                   providers=["CPUExecutionProvider"]).run(None, {"X": inp})[0].shape)
```

Output on `main` (`40c9f85f69`):

```
REF: (0, 0, 3)
[W ... graph.cc:122 MergeShapeInfo] Error merging shape info for output. 'Y' source:{0,6,3} target:{0,0,3}. Falling back to lenient merge.
ORT: (0, 6, 3)   ❌
```

### Fix

Setting `allowzero=1` on the fused node would also work but requires
opset >= 14, which this transformer cannot assume (it accepts `Reshape`
opset 5+). Bail out of fusion conservatively when `shape_value` contains
any literal `0` dim.

### Test

Adds `ReshapeFusionContiguousReshapesWithZeroDim` that builds the bug
repro programmatically and asserts:
- the two reshapes are NOT collapsed
- the inferred output shape stays `(0, 0, 3)`

The existing happy-path test `ReshapeFusion_Contiguous_Reshape` (added
in microsoft#22494) is unaffected — its inferred output shape `(2, 1, 64, 32)`
contains no zero dims, so the new guard does not trigger.

### Provenance

`FuseContiguousReshapes` was introduced in microsoft#22494 (Feb 2025). The bug
has been latent in `main` since then.

### Motivation and Context

Found while reviewing microsoft/onnxscript#2907
— the rewriter rule under test there is semantically correct, but its
numerical-equivalence check using ORT as the oracle fails because of
this fusion bug.

Fixes microsoft#28348.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
## Summary
- Defer `sympy` import so `import onnxruntime.quantization` succeeds
without sympy installed
- Move `SymbolicShapeInference` import in `quant_pre_process` behind
`skip_symbolic_shape` gate
- Defer sympy-dependent imports in `transformers.onnx_model` and
`transformers.shape_infer_helper`
- Raise a clear, actionable `ImportError` instructing users to install
sympy when needed

## Motivation
Fixes microsoft#24872. `sympy` (~29 MB plus `mpmath` ~2 MB) was a hard runtime
dependency even though it is only needed for symbolic shape inference.
Pure-inference users — the common case — pay the install/import cost for
functionality they do not use. `setup.py` already declares sympy as an
optional extra (`"symbolic": ["sympy"]`), but top-level imports forced
it to load unconditionally.

## Changes
- `onnxruntime/python/tools/quantization/shape_inference.py`: move `from
onnxruntime.tools.symbolic_shape_infer import SymbolicShapeInference`
from module top-level into `quant_pre_process`, guarded by `if not
skip_symbolic_shape`. Wrap in `try/except ImportError` that re-raises
with install instructions.
- `onnxruntime/python/tools/transformers/onnx_model.py`: move the `from
shape_infer_helper import SymbolicShapeInferenceHelper` from module
top-level into the two methods that instantiate it. Add
`TYPE_CHECKING`-guarded import for type annotations.
- `onnxruntime/python/tools/transformers/shape_infer_helper.py`: wrap
the import of `symbolic_shape_infer` in `try/except ImportError`. The
`SymbolicShapeInferenceHelper.__init__` now raises a clear `ImportError`
when sympy is unavailable, instead of failing at module load time.
- `onnxruntime/test/python/quantization/test_quant_preprocess.py`: add
`test_skip_symbolic_shape_does_not_require_sympy` which removes sympy
from `sys.modules` and verifies `quant_pre_process(...,
skip_symbolic_shape=True)` completes successfully.

No public API signatures change. Users who want symbolic shape inference
install sympy as before (`pip install sympy` or `pip install
onnxruntime[symbolic]`).

## Test Plan
- `python -m pytest
onnxruntime/test/python/quantization/test_quant_preprocess.py -v` — all
tests pass including the new coverage.
- Smoke-tested locally: `import onnxruntime.quantization` no longer
pulls `sympy` into `sys.modules`.
- `lintrunner -a` clean on all changed files.

Fixes microsoft#24872
microsoft#28346)

### Description

Fix `ReleaseVersionSuffix` passing in 'Nuget Test' pipeline.
### Description

Rename `ort_api_1_to_26` -> `ort_api_1_to_27`.

### Motivation and Context

This should have been done in microsoft#28324, but we wanted to merge ASAP.
…n GQA (microsoft#28358)

## Summary


## Problem

The Memory-Efficient Attention (MEA) path crashes with
`cudaErrorMisalignedAddress` when:
- GQA mode (`q_num_heads != kv_num_heads`)
- `head_size != v_head_size` (e.g., Q.head_dim=256, K.head_dim=512)
- `seq_len >= 4` (Flash Attention not eligible due to attention mask)

This is because MEA's `LaunchUngroup` requires equal head sizes, but the
dispatch logic only checked this constraint for the past_key case (line
1380), not the general GQA case.

## Fix

Skip MEA for GQA when head sizes differ. The Unfused Attention fallback
handles this correctly.

## Affected Models

Gemma 4 was not affected. This was a previously incorrect graph. But the
fix is still good to have that improves robustness anyways.

~~**Gemma4** (google/gemma-4-e2b-it) with KV sharing:~~
- Layers 15-34 borrow K,V from source layers
- Q projection: 1536 → 2048 (8 heads × 256)
- K/V from source: [batch, 1, seq, 512]
- `head_size = 256`, `v_head_size = 512`

## Testing

Minimal repro (from microsoft#28357):
```python
# Attention(Q=[1,S,2048], K=[1,S,512], V=[1,S,512], q_num_heads=8, kv_num_heads=1)
# Before fix: seq=4+ crashes with misaligned address
# After fix: all seq lengths work
```

Full Gemma4 decoder (35 layers, 15 GQA + 20 standard Attention):
- Prefill seq=32: ✅
- Decode seq=1: ✅

Fixes microsoft#28357

Signed-off-by: Justin Chu <justinchu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description
When loading data into tensors from memory buffers from external files,
byteswap it if necessary.

Also add a fix for deleter when byteswapping: keep copy of AllocatorPtr
instead of reference.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
While trying to setup local s390x CI, I've found 4 more tests that fail
on s390x:

CApiTest.TestLoadModelFromArrayWithExternalInitializerFromFileArray
CApiTest.TestLoadModelFromArrayWithExternalInitializersFromFileArray

CApiTest.TestLoadModelFromArrayWithExternalInitializersFromFileArrayPathRobust
CApiTest.TestLoadModelFromArrayWithExternalInitializersFromFileMmap
…es/vite-default (microsoft#28304)

Bumps [postcss](https://github.com/postcss/postcss) from 8.5.3 to
8.5.13.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/postcss/postcss/releases">postcss's
releases</a>.</em></p>
<blockquote>
<h2>8.5.13</h2>
<ul>
<li>Fixed <code>postcss-scss</code> commend regression.</li>
</ul>
<h2>8.5.12</h2>
<ul>
<li>Fixed reading any file via user-generated CSS.</li>
<li>Added <code>opts.unsafeMap</code> to disable checks.</li>
</ul>
<h2>8.5.11</h2>
<ul>
<li>Fixed nested brackets parsing performance (by <a
href="https://github.com/offset"><code>@​offset</code></a>).</li>
</ul>
<h2>8.5.10</h2>
<ul>
<li>Fixed XSS via unescaped <code>&lt;/style&gt;</code> in non-bundler
cases (by <a
href="https://github.com/TharVid"><code>@​TharVid</code></a>).</li>
</ul>
<h2>8.5.9</h2>
<ul>
<li>Speed up source map encoding paring in case of the error.</li>
</ul>
<h2>8.5.8</h2>
<ul>
<li>Fixed <code>Processor#version</code>.</li>
</ul>
<h2>8.5.7</h2>
<ul>
<li>Improved source map annotation cleaning performance (by CodeAnt
AI).</li>
</ul>
<h2>8.5.6</h2>
<ul>
<li>Fixed <code>ContainerWithChildren</code> type discriminating (by <a
href="https://github.com/Goodwine"><code>@​Goodwine</code></a>).</li>
</ul>
<h2>8.5.5</h2>
<ul>
<li>Fixed <code>package.json</code>→<code>exports</code> compatibility
with some tools (by <a
href="https://github.com/JounQin"><code>@​JounQin</code></a>).</li>
</ul>
<h2>8.5.4</h2>
<ul>
<li>Fixed Parcel compatibility issue (by <a
href="https://github.com/git-sumitchaudhary"><code>@​git-sumitchaudhary</code></a>).</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/postcss/postcss/blob/main/CHANGELOG.md">postcss's
changelog</a>.</em></p>
<blockquote>
<h2>8.5.13</h2>
<ul>
<li>Fixed <code>postcss-scss</code> commend regression.</li>
</ul>
<h2>8.5.12</h2>
<ul>
<li>Fixed reading any file via user-generated CSS.</li>
<li>Added <code>opts.unsafeMap</code> to disable checks.</li>
</ul>
<h2>8.5.11</h2>
<ul>
<li>Fixed nested brackets parsing performance (by <a
href="https://github.com/offset"><code>@​offset</code></a>).</li>
</ul>
<h2>8.5.10</h2>
<ul>
<li>Fixed XSS via unescaped <code>&lt;/style&gt;</code> in non-bundler
cases (by <a
href="https://github.com/TharVid"><code>@​TharVid</code></a>).</li>
</ul>
<h2>8.5.9</h2>
<ul>
<li>Speed up source map encoding paring in case of the error.</li>
</ul>
<h2>8.5.8</h2>
<ul>
<li>Fixed <code>Processor#version</code>.</li>
</ul>
<h2>8.5.7</h2>
<ul>
<li>Improved source map annotation cleaning performance (by CodeAnt
AI).</li>
</ul>
<h2>8.5.6</h2>
<ul>
<li>Fixed <code>ContainerWithChildren</code> type discriminating (by <a
href="https://github.com/Goodwine"><code>@​Goodwine</code></a>).</li>
</ul>
<h2>8.5.5</h2>
<ul>
<li>Fixed <code>package.json</code>→<code>exports</code> compatibility
with some tools (by <a
href="https://github.com/JounQin"><code>@​JounQin</code></a>).</li>
</ul>
<h2>8.5.4</h2>
<ul>
<li>Fixed Parcel compatibility issue (by <a
href="https://github.com/git-sumitchaudhary"><code>@​git-sumitchaudhary</code></a>).</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/postcss/postcss/commit/af58cf1b7af02e9b9fcb138a4a2d7ef3450158b1"><code>af58cf1</code></a>
Release 8.5.13 version</li>
<li><a
href="https://github.com/postcss/postcss/commit/f227dbd0e9443e5f33e18e633b8b4d2b55aac5ee"><code>f227dbd</code></a>
Temporary ignore pnpm 11 config</li>
<li><a
href="https://github.com/postcss/postcss/commit/d3abd40d723cf3559e5ddb5fc738b7cb64e92bb0"><code>d3abd40</code></a>
Update dependencies</li>
<li><a
href="https://github.com/postcss/postcss/commit/dd06c3e11362087bc18f9c20cee30fd82bda3de9"><code>dd06c3e</code></a>
Revert stringifier changes because of the conflict with
postcss-scss</li>
<li><a
href="https://github.com/postcss/postcss/commit/ae889c815fb88d785401a88f1a7dfc8cb11915fb"><code>ae889c8</code></a>
Try to fix CI</li>
<li><a
href="https://github.com/postcss/postcss/commit/e0093e49bcf00347383a13e40bb1f67bc823ca15"><code>e0093e4</code></a>
Move to pnpm 11</li>
<li><a
href="https://github.com/postcss/postcss/commit/9bc81c48f054a630c9a2e3868263b7ad4fc15013"><code>9bc81c4</code></a>
Release 8.5.12 version</li>
<li><a
href="https://github.com/postcss/postcss/commit/85c4d7dab830be366f8a96047f9e5b7944e101d8"><code>85c4d7d</code></a>
Another try to fix coverage</li>
<li><a
href="https://github.com/postcss/postcss/commit/94484cae6d4308167939f2ac888d166bd80dff01"><code>94484ca</code></a>
Try to fix coverage</li>
<li><a
href="https://github.com/postcss/postcss/commit/c64b7488d2731dfa16213739b42c34faf5a9eba3"><code>c64b748</code></a>
Load only .map source maps</li>
<li>Additional commits viewable in <a
href="https://github.com/postcss/postcss/compare/8.5.3...8.5.13">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=postcss&package-manager=npm_and_yarn&previous-version=8.5.3&new-version=8.5.13)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…crosoft#28350)

### Description

Fix `onnxruntime-node` and `onnxruntime-common` NPM packages lacking an
RC suffix when built in Release + RC mode.
This isn't great, the suffix looks like `-QUAL.DATE-COMMIT`. This'll
break the publishing pipeline if the packaging pipelines (zip-nuget and
NPM) span more than a single day due to same-version checks/enforcement.

### Motivation and Context

Missing the RC qualifier/suffix fails the NPM publish pipeline. It
correctly assets that the (onnxruntime-node, onnxruntime-common, and
onnxruntime-web) do not share a common version specifier.
…ize op (microsoft#28345)

### Description

`ROUND_PREFER_CEIL` in the Resize operator used bare
`std::round`/`roundf`, which rounds away from zero. This is correct for
positive halfway values (e.g., `round(0.5) = 1 = ceil(0.5)`) but wrong
for negative halfway values (e.g., `round(-0.5) = -1`, but `ceil(-0.5) =
0`).

Negative coordinates occur naturally with the `half_pixel` coordinate
transformation mode for the first output pixels when upsampling.

Added an explicit negative-halfway check, mirroring the existing
positive-halfway check in `ROUND_PREFER_FLOOR`:

```cpp
// CPU (upsamplebase.h)
case ROUND_PREFER_CEIL:
  return [](float x_original, bool) {
    if (x_original == static_cast<int64_t>(x_original) - 0.5f) {
      return static_cast<int64_t>(std::ceil(x_original));
    }
    return static_cast<int64_t>(std::round(x_original));
  };
```

Same fix applied to the CUDA implementation (`resize_impl.cu`).

Added two test cases in `resize_op_test.cc`:
1. `ResizeOpNearestUpSample_RoundPreferCeil_HalfPixel` — exercises
non-integer scale (26→64) from the original issue report, verifying
correct source pixel selection at fractional boundaries.
2. `ResizeOpNearestUpSample_RoundPreferCeil_HalfPixel_2x2to7x8` —
exercises a positive 0.5 boundary where `round_prefer_ceil` selects
ceiling.

### Motivation and Context

The `round_prefer_floor` path already had an explicit halfway-case
override (for positive values where `std::round` disagrees with floor).
The `round_prefer_ceil` path was missing the symmetric fix for negative
values, violating the ONNX spec semantics of "at ties, prefer ceiling."

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description

Add Python wheel packaging support for the CUDA plugin EP, following the
WebGPU plugin EP packaging pattern from microsoft#28226.

Changes include:
- Add `plugin-ep-cuda/python` packaging sources for the
`onnxruntime-ep-cuda` wheel.
- Add helper APIs to locate/register the CUDA plugin EP shared library.
- Add Linux and Windows x64 Python package jobs that consume the CUDA
plugin binary artifacts.
- Extend plugin package version setup to emit a PEP 440-compatible
`PluginPythonPackageVersion`.
- Add a Linux Docker helper script to build the CUDA plugin Python wheel
inside the manylinux CUDA image.

### Validation

- Parsed touched Azure pipeline YAML files with PyYAML.
- Ran Python syntax checks for the new package helper and wheel builder.

### Notes

The Linux Python package job is limited to x64 for now, matching the
existing x64 plugin artifact packaging flow.

---------

Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Xiaoxi Han <xiha@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com>
Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: liang <gxgaoliang@126.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Garth Long <garth.long@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: derdeljan-msft <derdeljan@microsoft.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com>
Co-authored-by: Christopher Warrington <chwarr@microsoft.com>
Co-authored-by: Ishwar Raut <iraut@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Xinpeng Dou <15529241576@163.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: adrastogi <aditya.rastogi@microsoft.com>
Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com>
Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com>
Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com>
Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com>
Co-authored-by: Kotomi-Du <yaru.du@intel.com>
Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com>
Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com>
Co-authored-by: Mikhail Dvoretckii <mikhail.dvoretckii@intel.com>
Co-authored-by: bopeng1234 <bo.peng@intel.com>
Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Wenqin Yang <wenqin.yang@intel.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: xieofxie <xieofxie@126.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Joshua Lochner <admin@xenova.com>
Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com>
Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com>
Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Jiawei Shao <jiawei.shao@intel.com>
Co-authored-by: czekun <chen.zekun@intel.com>
Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
Co-authored-by: ai-fw-intg <sys_ai_fw_intg@intel.com>
Co-authored-by: Rajeev Sekar <rajeev.sekar@intel.com>
Co-authored-by: RajeevSekar <117911837+RajeevSekar@users.noreply.github.com>
Co-authored-by: Nazanin Beheshti <nazanin.beheshti@intel.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->

This pull request adds C#/.NET (NuGet) packaging support for the WebGPU
plugin Execution Provider, including all necessary project files,
documentation, and helper code. It introduces a new NuGet package
(`Microsoft.ML.OnnxRuntime.EP.WebGpu`), updates the main plugin
documentation to reflect C# support, and provides detailed instructions
and code samples for building, packaging, and using the provider in .NET
applications.

It also has some minor changes for the existing Python packaging setup.

The most important changes are:

**C#/.NET Packaging Infrastructure:**
- Added the `Microsoft.ML.OnnxRuntime.EP.WebGpu` project (`.csproj`) for
NuGet packaging, including metadata, dependency management, and logic to
read the minimum ONNX Runtime version from a shared file. Native
binaries are included per platform, and the README is bundled in the
package.
- Introduced the `WebGpuEp.cs` helper class to resolve the native
library path and EP name at runtime, simplifying registration and usage
in .NET.

**Documentation:**
- Added a detailed `README.md` for the C# package, including usage
instructions, supported platforms, and example code for registering and
using the WebGPU EP in .NET.
- Added a top-level `csharp/README.md` with instructions for building,
packaging, and testing the NuGet package, as well as information on CI
integration and native binary requirements.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Create WebGPU plugin EP NuGet package.
…to 25 (microsoft#27744)

### Description

Extends CUDA Cast kernel registration to cover opset 25 (latest ONNX
spec). The existing non-versioned opset 23 registration is capped to
VERSIONED (23, 24), and a new non-versioned opset 25 registration is
added for all type specializations.

**`cast_op.cc`**:
- `REGISTER_KERNEL_TYPED(T)`: opset 23 → VERSIONED (23, 24), added
non-versioned opset 25
- Renamed `REGISTER_KERNEL_TYPED_23` → `REGISTER_KERNEL_TYPED_23_TO_24`
(VERSIONED)
- Added `REGISTER_KERNEL_TYPED_25` macro (non-versioned)
- Renamed `SPECIALIZE_IMPL_19_TO_23` → `SPECIALIZE_IMPL_19_TO_25`,
covering Float8 types through opset 25
- Updated Float4E2M1x2 registration to use new versioned/non-versioned
macros

**`cuda_execution_provider.cc`**:
- Forward declarations: all opset 23 Cast entries → VERSIONED (23, 24),
added opset 25 non-versioned entries (all 16 types: 13 standard + 2
Float8 + 1 Float4)
- `BuildKernelCreateInfo`: same pattern — capped 23 to (23, 24), added
opset 25 block

### Motivation and Context

CUDA Cast operator was registered up to opset 23, but ONNX spec defines
Cast through opset 25. This gap can cause kernel lookup failures when
running models exported at opset 25. Part of the broader CUDA opset
gap-filling effort tracked in microsoft#27729.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Copilot <copilot@github.com>
…27755)

### Description

Extends CUDA ReduceMax and ReduceMin kernel registrations from opset 18
to opset 20.

- **`reduction_ops.cc`**: Added
`REGISTER_KERNEL_VERSIONED_RANGE_AXES_INPUT_TYPED` macro for versioned
ranges requiring `InputMemoryType(OrtMemTypeCPUInput, 1)`. Split both
operators from 2-way (1–17, 18+) to 3-way (1–17, 18–19, 20+).
- **`cuda_execution_provider.cc`**: Capped opset 18 forward declarations
and `BuildKernelCreateInfo` entries to versioned 18–19. Added opset 20
non-versioned entries for both operators.

Type coverage maintained as-is: ReduceMax (float, double, MLFloat16,
int32_t, int64_t), ReduceMin adds int8_t, uint8_t.

### Motivation and Context

ReduceMax and ReduceMin CUDA registrations stopped at opset 18; ONNX
latest is opset 20. Models exported with opset 19–20 could fail to find
a matching CUDA kernel for these ops.

Follows the same pattern used in microsoft#27735 (TopK) and other opset gap PRs
tracked in microsoft#27729.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description

Extends RNN CUDA kernel registration from opset 14 to opset 22,
following the standard opset gap-filling pattern:

- **`rnn.cc`**: Cap existing opset 14 non-versioned kernel to versioned
14–21; add new non-versioned kernel at opset 22
- **`cuda_execution_provider.cc`**: Update forward declarations and
`BuildKernelCreateInfo` entries to match (versioned 14–21 +
non-versioned 22); remove duplicate GRU opset 22 entries introduced
during merge
- **`OperatorKernels.md`**: Update CUDA RNN entry to reflect three
tiers: `[7,13]`, `[14,21]`, `22+`

No behavioral changes — the operator implementation is identical across
opset 14–22. This is a registration-only change.

### Motivation and Context

RNN CUDA operator was registered at opset 14 while ONNX defines it
through opset 22, causing models exported at newer opsets to fall back
to CPU. Part of the broader CUDA EP opset gap effort tracked in microsoft#27729.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
### Description

Extends the Reshape CUDA kernel registration from opset 23 to opset 25,
following the same pattern used in microsoft#27728.

- **`reshape.cc`**: Cap existing non-versioned opset 23 kernel →
versioned (23, 24); add new non-versioned kernel at opset 25
- **`cuda_execution_provider.cc`**: Update forward declaration and
`BuildKernelCreateInfo` for versioned (23, 24); add opset 25 entries
- **`docs/OperatorKernels.md`**: Update Reshape CUDA EP entry from `23+`
to `25+` and add `[23, 24]` versioned range row

No functional changes to the kernel itself — the opset 25 schema is
backward-compatible with opset 23.

### Motivation and Context

Reshape is listed as a P1 gap in microsoft#27729 (CUDA max opset 23, ONNX latest
opset 25). Models exported at opset 25 would fail to find a matching
Reshape kernel on the CUDA EP.

<!-- START COPILOT CODING AGENT TIPS -->
---

🔒 GitHub Advanced Security automatically protects Copilot coding agent
pull requests. You can protect all pull requests by enabling Advanced
Security for your repositories. [Learn more about Advanced
Security.](https://gh.io/cca-advanced-security)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Bumps and
[brace-expansion](https://github.com/juliangruber/brace-expansion).
These dependencies needed to be updated together.
Updates `brace-expansion` from 1.1.11 to 1.1.13
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/juliangruber/brace-expansion/releases">brace-expansion's
releases</a>.</em></p>
<blockquote>
<h2>v1.1.12</h2>
<ul>
<li>pkg: publish on tag 1.x  c460dbd</li>
<li>fmt  ccb8ac6</li>
<li>Fix potential ReDoS Vulnerability or Inefficient Regular Expression
(<a
href="https://redirect.github.com/juliangruber/brace-expansion/issues/65">#65</a>)
c3c73c8</li>
</ul>
<hr />
<p><a
href="https://github.com/juliangruber/brace-expansion/compare/v1.1.11...v1.1.12">https://github.com/juliangruber/brace-expansion/compare/v1.1.11...v1.1.12</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/6c353caf23beb9644f858eb3fe38d43a68b82898"><code>6c353ca</code></a>
1.1.13</li>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/7fd684f89fdde3549563d0a6522226a9189472a2"><code>7fd684f</code></a>
Backport fix for GHSA-f886-m6hf-6m8v (<a
href="https://redirect.github.com/juliangruber/brace-expansion/issues/95">#95</a>)</li>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/44f33b47c5c6a965d507421af43e86cf5971d711"><code>44f33b4</code></a>
1.1.12</li>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/c460dbd68e428d147b2080622d8ce126c7a08570"><code>c460dbd</code></a>
pkg: publish on tag 1.x</li>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/ccb8ac6d4292b7661b677fe048ba6690c877f51f"><code>ccb8ac6</code></a>
fmt</li>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/c3c73c8b088defc70851843be88ccc3af08e7217"><code>c3c73c8</code></a>
Fix potential ReDoS Vulnerability or Inefficient Regular Expression (<a
href="https://redirect.github.com/juliangruber/brace-expansion/issues/65">#65</a>)</li>
<li>See full diff in <a
href="https://github.com/juliangruber/brace-expansion/compare/1.1.11...v1.1.13">compare
view</a></li>
</ul>
</details>
<br />

Updates `brace-expansion` from 2.0.1 to 2.0.3
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/juliangruber/brace-expansion/releases">brace-expansion's
releases</a>.</em></p>
<blockquote>
<h2>v1.1.12</h2>
<ul>
<li>pkg: publish on tag 1.x  c460dbd</li>
<li>fmt  ccb8ac6</li>
<li>Fix potential ReDoS Vulnerability or Inefficient Regular Expression
(<a
href="https://redirect.github.com/juliangruber/brace-expansion/issues/65">#65</a>)
c3c73c8</li>
</ul>
<hr />
<p><a
href="https://github.com/juliangruber/brace-expansion/compare/v1.1.11...v1.1.12">https://github.com/juliangruber/brace-expansion/compare/v1.1.11...v1.1.12</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/6c353caf23beb9644f858eb3fe38d43a68b82898"><code>6c353ca</code></a>
1.1.13</li>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/7fd684f89fdde3549563d0a6522226a9189472a2"><code>7fd684f</code></a>
Backport fix for GHSA-f886-m6hf-6m8v (<a
href="https://redirect.github.com/juliangruber/brace-expansion/issues/95">#95</a>)</li>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/44f33b47c5c6a965d507421af43e86cf5971d711"><code>44f33b4</code></a>
1.1.12</li>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/c460dbd68e428d147b2080622d8ce126c7a08570"><code>c460dbd</code></a>
pkg: publish on tag 1.x</li>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/ccb8ac6d4292b7661b677fe048ba6690c877f51f"><code>ccb8ac6</code></a>
fmt</li>
<li><a
href="https://github.com/juliangruber/brace-expansion/commit/c3c73c8b088defc70851843be88ccc3af08e7217"><code>c3c73c8</code></a>
Fix potential ReDoS Vulnerability or Inefficient Regular Expression (<a
href="https://redirect.github.com/juliangruber/brace-expansion/issues/65">#65</a>)</li>
<li>See full diff in <a
href="https://github.com/juliangruber/brace-expansion/compare/1.1.11...v1.1.13">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description

Caps existing non-versioned CUDA kernel registrations and adds new
registrations at the latest ONNX opset:

- **Round**: opset 11 (non-versioned) → versioned 11–21 + new opset 22
- **Equal**: opset 13 (non-versioned) → versioned 13–18 + new opset 19

Changes across three files:
- `unary_elementwise_ops.cc` — `UNARY_OP_HFD(Round, 11)` →
`UNARY_OP_VERSIONED_HFD` + `UNARY_OP_HFD`
- `binary_elementwise_ops.cc` —
`BINARY_LOGICALOP_REGISTER_UZILHFD(Equal, 13)` → versioned 13–18 + new
19 (same for `bool` typed registration)
- `cuda_execution_provider.cc` — corresponding forward declarations and
`BuildKernelCreateInfo` entries

No type changes; both operators retain their existing CUDA type support
at the new opsets.

### Motivation and Context

Tracks with the ongoing effort to close ONNX opset coverage gaps in the
CUDA execution provider
(microsoft#27729). Without these
registrations, models targeting opset 19+ (Equal) or 22+ (Round) fall
back from CUDA to CPU.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
…crosoft#28380)

## Summary

- Remove the `bias == nullptr` requirement from
`CanApplyFlashAttention`, enabling FlashAttention for MultiHeadAttention
nodes with QKV bias (e.g., whisper decoder).
- Apply `TransferBSDToBNSH` to add bias and transpose Q/K/V to BNSH
format before calling FlashAttention.
- Handle cross-attention (only Q needs bias+transpose, K/V already BNSH
from encoder) and self-attention (all Q/K/V need bias+transpose)
separately.

## Motivation

Whisper decoder's MultiHeadAttention nodes all have QKV bias, which
previously forced them into the slower unfused attention path. Enabling
FlashAttention for these nodes yields ~45% speedup on whisper-tiny-int4
(~92 → ~134 tokens/s).

## Test plan

- [x] Existing MHA unit tests with bias data now exercise the
FlashAttention path on WebGPU with Subgroups support
- [x] whisper-tiny-int4 end-to-end: correct transcription at ~134 tps
(vs ~92 tps baseline)
- [x] clang-format passes
- [x] D3D12 build succeeds
…t#28123)

### Description

The OrtModelEditorApi C API functions (AddNodeToGraph, AddGraphToModel,
SetGraphInputs/SetGraphOutputs) take raw pointers and wrap them in
unique_ptr to transfer ownership. Without guards, callers can pass the
same pointer twice or call Release after ownership transfer, causing
double-free on destruction.

### Changes
- **AddInitializerToGraph**: Copy OrtValue internally instead of taking
raw pointer ownership. OrtValue uses shared_ptr for its data, so copying
is cheap (refcount increment). The caller retains ownership and is
responsible for releasing. This eliminates the double-free class
entirely for initializers.
- **AddNodeToGraph**: Add \owned_\ flag to ModelEditorNode to reject
double-add, add null check
- **AddGraphToModel**: Reject if model already has a graph, add null
check for model. Add \owned_\ flag to ModelEditorGraph to reject same
graph added to two models.
- **SetGraphInputs/SetGraphOutputs**: Add \owned_\ flag to
ModelEditorValueInfo to reject already-owned ValueInfos. Detect
duplicate pointers in input arrays. Pre-allocate vector capacity before
ownership-transfer loop for exception safety.
- **ReleaseNode/ReleaseGraph/ReleaseValueInfo**: Check \owned_\ flag
before deleting. If already owned by a graph/model, the release is a
safe no-op.
- **C++ wrapper**: Remove initializer.release() in AddInitializer to
match copy semantics.
- **Regression tests**: Tests covering ownership-transfer guard paths,
release-after-ownership, and duplicate detection.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…28250)

- Wrap 8x16x16 MatMulNBits(SubgroupMatrix) kernel body in M-tile loop
using uniforms.m_tiles_per_wg for tile assignment per workgroup
- Cap dispatch_y on Xe2/3-LPG when M > 2k, with occupancy factor 16x
- Non-Intel or small-M paths pass m_tiles_per_wg=1 (no behavior change)
@ai-fw-intg ai-fw-intg requested a review from ankitm3k May 6, 2026 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.