Agentic SYCL support by danielfleischer · Pull Request #43 · IntelLabs/Xe-Forge

danielfleischer · 2026-06-08T10:35:27Z

Claude Code engine for SYCL XPU kernels (+ TFLOPS reporting, VTune profiling)

Summary

Adds a Claude Code optimization engine for SYCL/CUTLASS GEMM kernels on
Intel Xe, alongside the existing Triton path. The agent rewrites a whole .cpp
kernel each trial, then compiles (icpx) + benchmarks it via the existing
SyclExecutor, with correctness checked against a golden PyTorch/numpy
reference computed on the same bit-identical inputs. The change also adds a
TFLOPS / peak-utilization reporting convention and an independent VTune
gpu-hotspots profiler for SYCL kernels.

All GPU paths were developed and verified on real hardware — an Intel Arc Pro
B70 (Battlemage) with icpx + VTune 2026.0.

What's included

SYCL Claude engine (core)

SyclExecutor.compare_with_reference() — runs the optimized kernel, loads
D2.bin, reshapes to the golden's shape, compares with spec rtol/atol.
The single mockable GPU seam. Also maps "Arc Pro B70" → the bmg-g31 AOT
target.
benchmark skill split into _run_triton / _run_sycl on --dsl; the SYCL
path computes the golden via the PyTorch reference on the same
A.bin/B0.bin inputs and prints the uniform baseline_us/triton_us line
(the triton_us= token is kept verbatim so trial tooling parses uniformly
across DSLs).
cli reads {input}_pytorch.py as the golden reference for SYCL.
validator gains SYCL static checks: missing_main, missing_io_contract,
no_cutlass_include.

Workspace generation

generator selects SYCL templates and a .cpp extension for dsl == sycl,
substitutes a compilable starter stub when the input has no #include, and
always writes the PyTorch golden reference as {name}_pytorch.py.
New templates: CLAUDE.sycl.md.j2, optimize-kernel.sycl.md.j2,
starter_kernel.sycl.cpp.j2 (a file-IO-honouring CUTLASS BMG GEMM).

Knowledge base

sycl_io_contract.yaml — the runner harness contract (CLI args, A/B0/B1 .bin
layout with bf16-as-int16, D2.bin f32 output, perf-line format, CUTLASS
correctness constraints).
sycl_vtune.yaml — VTune verbs + metric→CUTLASS-knob mapping.

TFLOPS / utilization reporting

DeviceConfig.peak_tflops (default 160.0, PEAK_TFLOPS env override).
Benchmark Performance: line appends tflops=<f>, util=<f>% on both DSL
paths; the trial tree stores --tflops and shows it in trial status; the
tool-runner agent surfaces it.

VTune profiling (independent gpu-hotspots module)

SyclProfiler (core/sycl_profiler.py) — compiles via SyclExecutor, runs
the binary under vtune -collect gpu-hotspots (characterization), parses the
per-task report, and emits recommendations (memory-bound, low occupancy, high
idle, low XMX/DPAS, L3 thrashing). Kept fully separate from the Triton
gpu-offload profiler. profile skill dispatches on --dsl.

Example

examples/sycl/ — hardware-verified worked example: baseline (t0, 256×256×32)
and trial (t1, 128×128×32) CUTLASS GEMM kernels, a bf16 spec, the PyTorch
golden reference, and a README with reproduction commands.

Design notes

DSL-agnostic engine. ClaudeEngine / create_engine are unchanged; the
work is DSL-aware workspace generation, a SYCL branch in the benchmark/profile
skills, the golden-reference comparison, and SYCL templates/KB.
Independent SYCL paths. SYCL kernels are compiled binaries with no
PyTorch Model, so the benchmark and VTune paths don't reuse the Python-runner
Triton code — they share data-layout conventions, not control flow.
The file-IO contract is the load-bearing piece. No pre-existing .cpp
implemented the A.bin/B0.bin/D2.bin + --input_dir/--output_dir
contract; it lived only inside SyclExecutor. The starter kernel implements
it and the KB documents it so the agent can produce a runnable kernel from
scratch.

Verification (Intel Arc Pro B70, real hardware)

Golden-reference correctness: bf16 GEMM vs f32 PyTorch matmul on identical
inputs passes np.allclose at rtol=1e-2, atol=1e-2 (max abs diff ~2e-5).
Benchmark skill: t0 compiles baseline+trial, checks vs golden →
Correctness: PASSED, e.g. baseline_us=194.20, triton_us=83.10, speedup=2.34x, tflops=25.83, util=16.1%. Cached-baseline (t1+) path skips the
baseline rerun.
Workspace generation: .cpp input kept verbatim; PyTorch-only input
substitutes the starter stub; SYCL CLAUDE.md / --dsl sycl command / golden
ref / KB symlink all correct.
VTune profiling: xe-forge-skill profile --dsl sycl produces real Xe
hardware metrics (XVE Active/Stalled/Idle, occupancy, XMX/DPAS, L3) and
correct recommendations.
Unit tests: 52 pass (uv run pytest), platform-independent — the
icpx/GPU/vtune seams are mocked. Ruff lint + format clean.

Hardware/tooling gotchas surfaced (documented in VTUNE.md / KB)

VTune works on this box after the 2025.3→2026 upgrade, but
vtune-self-checker.sh gives a false negative (its bundled DPC++ test app
won't launch) — a real AOT bmg-g31 kernel profiles fine.
The B70 uses the xe kernel driver (not i915); gpu-hotspots works.
VTune column names containing a literal comma (GPU Memory Bandwidth, GB/sec:Read) break the -column parser; bandwidth is requested via the
comma-free substring GB/sec in a separate report pass.
_detect_device_target didn't recognize "Arc Pro B70" → added a bmg-g31
mapping.

Scope / limitations (v1)

GEMM-family kernels only (the file-IO inputs are GEMM-shaped). Non-GEMM
ops are out of scope.
Golden reference located by convention: {baseline_stem}_pytorch.py.
VTune gpu-hotspots splits metrics across sampling passes, so memory-bandwidth
can read near-zero for sub-millisecond kernels (sampling noise, not a parse
bug); the stable signals are XVE/occupancy/XMX.

Commits

Add SYCL golden-reference benchmarking infrastructure
Add SYCL workspace generation to the Claude engine
Add SYCL file-IO contract to the knowledge base
Add SYCL Claude-engine GEMM example
Add tests for the SYCL Claude engine
Report TFLOPS and peak utilization across benchmark and trials
Add SYCL VTune profiling via an independent gpu-hotspots module

Check CUTLASS SYCL kernels against a PyTorch/numpy golden reference computed on bit-identical inputs, rather than original-vs-optimized. - sycl_executor: add compare_with_reference() — run the kernel, load D2.bin, reshape to the golden's shape, compare with spec rtol/atol; map Arc Pro B70 to the bmg-g31 AOT target. - benchmark skill: split run() into _run_triton / _run_sycl on --dsl; the SYCL path computes the golden via the PyTorch ref on the same A.bin/B0.bin inputs and prints the uniform baseline_us/triton_us line. - validator: add missing_main, missing_io_contract, no_cutlass_include checks for SYCL sources. - cli: read {input}_pytorch.py as the golden reference for SYCL.

Generate a DSL-aware agentic workspace for SYCL kernels. - generator: select SYCL templates and a .cpp extension for dsl==sycl; substitute a compilable starter stub when the input has no #include; always write the PyTorch golden reference as {name}_pytorch.py. - templates: CLAUDE.sycl.md.j2 and optimize-kernel.sycl.md.j2 (C++ workflow, file-IO contract, --dsl sycl / bench-xpu, CUTLASS rules), and starter_kernel.sycl.cpp.j2 (file-IO-honouring CUTLASS BMG GEMM).

knowledge_base/sycl/xpu/sycl_io_contract.yaml: the runner harness contract the agent must follow — CLI args (--m/--n/--k/--input_dir/ --output_dir/--iterations/--verify), the A/B0/B1 .bin input layout (bf16 as int16 bits), the D2.bin f32 output, the perf-line format, and SYCL/CUTLASS correctness constraints (tile/subgroup/atom consistency, SLM budget, f32 accumulate). Loaded as optimization guidance alongside the existing cutlass_sycl_framework / xetla_patterns entries.

Hardware-verified worked example on Intel Arc Pro B70: baseline (t0, 256x256x32) and trial (t1, 128x128x32) CUTLASS GEMM kernels honouring the file-IO contract, a bf16 spec, the PyTorch golden reference, and a README with reproduction commands.

Platform-independent unit tests mocking the icpx/GPU seam: - test_generator_sycl: .cpp emission, starter-stub substitution, SYCL CLAUDE.md content, Triton regression. - test_benchmark_skill_routing: --dsl dispatch, cached-baseline skip. - test_sycl_golden_reference: compare_with_reference pass/fail/compile -error/missing-D2 cases, bf16 .bin roundtrip, output-format regex. - test_validator: SYCL missing_main / missing_io_contract / cutlass.

Surface achieved throughput and its percentage of the device's theoretical peak so results read as utilization, not just relative speedup (the B70's bf16 peak is ~160 TFLOPS). - config: add DeviceConfig.peak_tflops (default 160.0), overridable via the PEAK_TFLOPS env var. - benchmark skill: append `tflops=<f>, util=<f>%` to the Performance: line via a shared _perf_line() helper, on both the Triton and SYCL paths; util = tflops / peak_tflops. The us/speedup prefix is unchanged so existing parsers keep working. - trial tree: add --tflops to `trial result`, persist it per trial, and show it in `trial status` (e.g. "18.9 TFLOPS"). - tool-runner agent: extract tflops/util from benchmark output. - SYCL CLAUDE.md: record --tflops and reason about utilization vs the ~160 TFLOPS peak when choosing the next tile. - examples/sycl/README: show the tflops/util fields and PEAK_TFLOPS.

Profile compiled CUTLASS SYCL kernels with VTune, mapping Intel Xe hardware metrics to CUTLASS tuning knobs. Kept separate from the Triton profiler: SYCL kernels are compiled binaries, so there is no Python runner and a different VTune analysis (gpu-hotspots characterization, not gpu-offload) is used. - core/sycl_profiler.py: new SyclProfiler — compiles via SyclExecutor, generates the same deterministic file-IO inputs as the benchmark, runs the binary under `vtune -collect gpu-hotspots`, parses the per-task hotspots report, and emits recommendations (memory-bound, low occupancy, high idle, low XMX/DPAS, L3 thrashing). Degrades gracefully to an error result when VTune is absent or collection fails. - profile skill: split run() into _profile_triton / _profile_sycl on --dsl; add the previously-missing --dsl flag to the profile subparser. - CLAUDE.sycl.md: wire the Profile step to `--dsl sycl --variant bench-xpu` and have the agent reason about the XVE/occupancy/XMX metrics when choosing the next tile. - knowledge_base/sycl/xpu/sycl_vtune.yaml: the collect/report verbs and the metric -> CUTLASS-knob mapping. - VTUNE.md: SYCL profiling section (verbs, metrics, knob table, and the xe-driver / self-checker-false-negative notes). - tests: test_sycl_profiler.py and test_profile_skill_routing.py mock the vtune subprocess + compile seams (no VTune/GPU needed). Verified on VTune 2026.0 + Arc Pro B70. Column names containing a literal comma ("GPU Memory Bandwidth, GB/sec:Read") break VTune's -column parser, so bandwidth is requested via the comma-free substring "GB/sec" in a separate report pass.

danielfleischer added 7 commits June 8, 2026 00:40

Add SYCL Claude-engine GEMM example

bce1d75

Hardware-verified worked example on Intel Arc Pro B70: baseline (t0, 256x256x32) and trial (t1, 128x128x32) CUTLASS GEMM kernels honouring the file-IO contract, a bf16 spec, the PyTorch golden reference, and a README with reproduction commands.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agentic SYCL support#43

Agentic SYCL support#43
danielfleischer wants to merge 7 commits into
mainfrom
df/sycl

danielfleischer commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielfleischer commented Jun 8, 2026

Claude Code engine for SYCL XPU kernels (+ TFLOPS reporting, VTune profiling)

Summary

What's included

Design notes

Verification (Intel Arc Pro B70, real hardware)

Hardware/tooling gotchas surfaced (documented in VTUNE.md / KB)

Scope / limitations (v1)

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant