Skip to content

SYCL: add oneMKL GEMM flash attention for XMX-accelerated prompt proc…#25025

Open
johnkarlhill wants to merge 1 commit into
ggml-org:masterfrom
johnkarlhill:sycl-mkl-flash-attn
Open

SYCL: add oneMKL GEMM flash attention for XMX-accelerated prompt proc…#25025
johnkarlhill wants to merge 1 commit into
ggml-org:masterfrom
johnkarlhill:sycl-mkl-flash-attn

Conversation

@johnkarlhill

Copy link
Copy Markdown

Adds a flash attention path that routes Q·K^T and S·V matrix multiplies
through oneMKL GEMM, enabling XMX hardware acceleration on Intel GPUs.

Motivation

The existing SYCL flash attention kernels (VEC, TILE) run entirely in
SYCL subgroup operations. On Intel Arc GPUs with XMX matrix engines
(Battlemage and later), oneMKL GEMM can process the large matmuls in
attention significantly faster — particularly at high context lengths
where the KV cache is quantized.

When it activates

  • KV cache is quantized (q8_0, q4_0, q4_1, q5_0, q5_1, or any K-quant)
  • K sequence length ≥ 1024 tokens (covers the full --batch-size)
  • Q sequence length ≥ 128

These thresholds route prompt processing through MKL while leaving
single-token decode to the existing TG-optimized kernels. The path is
never activated for f16/bf16 KV cache — those already perform well
with the TILE kernel and graph capture.

Implementation

All logic is in one new file, fattn-mkl.cpp (567 lines). The pipeline:

  1. Dequantize K/V to fp16
  2. For each KV head: pack all GQA query heads into a single fp16 buffer
  3. Chunked KV loop (8192-token chunks):
    • MKL GEMM: KQ = Q_batched × K_chunk^T
    • Online softmax SYCL kernel (row-wise, with running max/sum)
    • MKL GEMM: VKQ_chunk = S × V_chunk
    • Accumulate: VKQ_accum += VKQ_chunk
  4. Normalize each GQA head by KQ_sum and scatter to output

GQA groups sharing a KV head are batched into single GEMM calls —
6 query heads × 1020 tokens = 6120 rows in one MKL call, amortizing
launch overhead.

Performance (Arc Pro B70 / BMG-G21, 32 GB)

Context KV Cache PP t/s TG t/s
8K q8_0 ~812 ~23
110K q8_0 ~335 ~17

For comparison, stock bf16 KV cache + FA off on the same GPU achieves
~822 t/s PP at 8K — the MKL path with q8_0 is within 1% while using
quantized memory.

Testing

  • test-backend-ops: 3605/3605 FLASH_ATTN_EXT tests pass (all
    quant types, head sizes 64–512, causal/non-causal masks, sinks,
    max_bias, GQA ratios, multi-batch)
  • Multi-batch: parallel-2 at 32K context, stable throughput,
    no coherence errors
  • Build isolation: all code is SYCL-only, gated behind
    BEST_FATTN_KERNEL_MKL enum; other backends and non-quantized
    paths are completely unaffected

Known limitations

  • No graph capture: MKL GEMM's internal queue management is
    incompatible with SYCL command graph replay. The existing
    GGML_SYCL_DISABLE_GRAPH default (1) handles this.
  • No ALiBi: max_bias == 0.0f asserted; models needing ALiBi
    will fall through to the TILE kernel.
  • No sinks tensor: dst->src[4] is not yet supported.

Debug output

Timing instrumentation is gated behind MKL_FA_DEBUG=1. In normal
operation the MKL path produces no output.

AI disclosure

Claude Code was used for SYCL boilerplate (ND-range kernel launches,
ggml_sycl_pool_alloc patterns) and initial drafting of the chunked KV
loop. All algorithmic decisions — oneMKL GEMM integration, online
softmax with GQA batching, activation thresholds, chunk sizing — were
human-directed. Comprehensive testing (3605 test-backend-ops,
multi-quant and multi-batch coherence validation, performance
benchmarking at contexts up to 110K) was performed manually.


🤖 Generated with Claude Code using DeepSeek-V4-Pro

@johnkarlhill johnkarlhill requested a review from a team as a code owner June 26, 2026 01:24
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 26, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

Hi @johnkarlhill, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@arthw arthw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnkarlhill

It's good to see this PR to enable XMX in FA.

Could you share which LLM show good performance increasing by this PR?
I use Qwen3.6 and can't trigger the oneMKL path on FA.

Thank you!

@johnkarlhill

Copy link
Copy Markdown
Author

Adding before and after... both compiled with arch flags to show side-by-side. Compiling without arch flags will degrade performance from these numbers but should still be better than stock.
PR25025 - Qwen3.6-27B-MTP-UD-Q5_K_XL on B70.txt
b9752 - Qwen3.6-27B-MTP-UD-Q5_K_XL on B70.txt

I'll add more models if needed.

@arthw

arthw commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@johnkarlhill
1.
Could you provide a smaller LLM case to show the perf increase for this PR?
Including the whole cmd.

For user, how to trigger the new code in usage?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants