Agentic SYCL support#43
Draft
danielfleischer wants to merge 7 commits into
Draft
Conversation
Check CUTLASS SYCL kernels against a PyTorch/numpy golden reference
computed on bit-identical inputs, rather than original-vs-optimized.
- sycl_executor: add compare_with_reference() — run the kernel, load
D2.bin, reshape to the golden's shape, compare with spec rtol/atol;
map Arc Pro B70 to the bmg-g31 AOT target.
- benchmark skill: split run() into _run_triton / _run_sycl on --dsl;
the SYCL path computes the golden via the PyTorch ref on the same
A.bin/B0.bin inputs and prints the uniform baseline_us/triton_us line.
- validator: add missing_main, missing_io_contract, no_cutlass_include
checks for SYCL sources.
- cli: read {input}_pytorch.py as the golden reference for SYCL.
Generate a DSL-aware agentic workspace for SYCL kernels.
- generator: select SYCL templates and a .cpp extension for dsl==sycl;
substitute a compilable starter stub when the input has no #include;
always write the PyTorch golden reference as {name}_pytorch.py.
- templates: CLAUDE.sycl.md.j2 and optimize-kernel.sycl.md.j2 (C++
workflow, file-IO contract, --dsl sycl / bench-xpu, CUTLASS rules),
and starter_kernel.sycl.cpp.j2 (file-IO-honouring CUTLASS BMG GEMM).
knowledge_base/sycl/xpu/sycl_io_contract.yaml: the runner harness contract the agent must follow — CLI args (--m/--n/--k/--input_dir/ --output_dir/--iterations/--verify), the A/B0/B1 .bin input layout (bf16 as int16 bits), the D2.bin f32 output, the perf-line format, and SYCL/CUTLASS correctness constraints (tile/subgroup/atom consistency, SLM budget, f32 accumulate). Loaded as optimization guidance alongside the existing cutlass_sycl_framework / xetla_patterns entries.
Hardware-verified worked example on Intel Arc Pro B70: baseline (t0, 256x256x32) and trial (t1, 128x128x32) CUTLASS GEMM kernels honouring the file-IO contract, a bf16 spec, the PyTorch golden reference, and a README with reproduction commands.
Platform-independent unit tests mocking the icpx/GPU seam: - test_generator_sycl: .cpp emission, starter-stub substitution, SYCL CLAUDE.md content, Triton regression. - test_benchmark_skill_routing: --dsl dispatch, cached-baseline skip. - test_sycl_golden_reference: compare_with_reference pass/fail/compile -error/missing-D2 cases, bf16 .bin roundtrip, output-format regex. - test_validator: SYCL missing_main / missing_io_contract / cutlass.
Surface achieved throughput and its percentage of the device's theoretical peak so results read as utilization, not just relative speedup (the B70's bf16 peak is ~160 TFLOPS). - config: add DeviceConfig.peak_tflops (default 160.0), overridable via the PEAK_TFLOPS env var. - benchmark skill: append `tflops=<f>, util=<f>%` to the Performance: line via a shared _perf_line() helper, on both the Triton and SYCL paths; util = tflops / peak_tflops. The us/speedup prefix is unchanged so existing parsers keep working. - trial tree: add --tflops to `trial result`, persist it per trial, and show it in `trial status` (e.g. "18.9 TFLOPS"). - tool-runner agent: extract tflops/util from benchmark output. - SYCL CLAUDE.md: record --tflops and reason about utilization vs the ~160 TFLOPS peak when choosing the next tile. - examples/sycl/README: show the tflops/util fields and PEAK_TFLOPS.
Profile compiled CUTLASS SYCL kernels with VTune, mapping Intel Xe
hardware metrics to CUTLASS tuning knobs. Kept separate from the Triton
profiler: SYCL kernels are compiled binaries, so there is no Python
runner and a different VTune analysis (gpu-hotspots characterization,
not gpu-offload) is used.
- core/sycl_profiler.py: new SyclProfiler — compiles via SyclExecutor,
generates the same deterministic file-IO inputs as the benchmark, runs
the binary under `vtune -collect gpu-hotspots`, parses the per-task
hotspots report, and emits recommendations (memory-bound, low
occupancy, high idle, low XMX/DPAS, L3 thrashing). Degrades gracefully
to an error result when VTune is absent or collection fails.
- profile skill: split run() into _profile_triton / _profile_sycl on
--dsl; add the previously-missing --dsl flag to the profile subparser.
- CLAUDE.sycl.md: wire the Profile step to `--dsl sycl --variant
bench-xpu` and have the agent reason about the XVE/occupancy/XMX
metrics when choosing the next tile.
- knowledge_base/sycl/xpu/sycl_vtune.yaml: the collect/report verbs and
the metric -> CUTLASS-knob mapping.
- VTUNE.md: SYCL profiling section (verbs, metrics, knob table, and the
xe-driver / self-checker-false-negative notes).
- tests: test_sycl_profiler.py and test_profile_skill_routing.py mock the
vtune subprocess + compile seams (no VTune/GPU needed).
Verified on VTune 2026.0 + Arc Pro B70. Column names containing a
literal comma ("GPU Memory Bandwidth, GB/sec:Read") break VTune's
-column parser, so bandwidth is requested via the comma-free substring
"GB/sec" in a separate report pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude Code engine for SYCL XPU kernels (+ TFLOPS reporting, VTune profiling)
Summary
Adds a Claude Code optimization engine for SYCL/CUTLASS GEMM kernels on
Intel Xe, alongside the existing Triton path. The agent rewrites a whole
.cppkernel each trial, then compiles (
icpx) + benchmarks it via the existingSyclExecutor, with correctness checked against a golden PyTorch/numpyreference computed on the same bit-identical inputs. The change also adds a
TFLOPS / peak-utilization reporting convention and an independent VTune
gpu-hotspotsprofiler for SYCL kernels.All GPU paths were developed and verified on real hardware — an Intel Arc Pro
B70 (Battlemage) with
icpx+ VTune 2026.0.What's included
SYCL Claude engine (core)
SyclExecutor.compare_with_reference()— runs the optimized kernel, loadsD2.bin, reshapes to the golden's shape, compares with specrtol/atol.The single mockable GPU seam. Also maps "Arc Pro B70" → the
bmg-g31AOTtarget.
benchmarkskill split into_run_triton/_run_syclon--dsl; the SYCLpath computes the golden via the PyTorch reference on the same
A.bin/B0.bininputs and prints the uniformbaseline_us/triton_usline(the
triton_us=token is kept verbatim so trial tooling parses uniformlyacross DSLs).
clireads{input}_pytorch.pyas the golden reference for SYCL.validatorgains SYCL static checks:missing_main,missing_io_contract,no_cutlass_include.Workspace generation
generatorselects SYCL templates and a.cppextension fordsl == sycl,substitutes a compilable starter stub when the input has no
#include, andalways writes the PyTorch golden reference as
{name}_pytorch.py.CLAUDE.sycl.md.j2,optimize-kernel.sycl.md.j2,starter_kernel.sycl.cpp.j2(a file-IO-honouring CUTLASS BMG GEMM).Knowledge base
sycl_io_contract.yaml— the runner harness contract (CLI args, A/B0/B1.binlayout with bf16-as-int16,
D2.binf32 output, perf-line format, CUTLASScorrectness constraints).
sycl_vtune.yaml— VTune verbs + metric→CUTLASS-knob mapping.TFLOPS / utilization reporting
DeviceConfig.peak_tflops(default 160.0,PEAK_TFLOPSenv override).Performance:line appendstflops=<f>, util=<f>%on both DSLpaths; the trial tree stores
--tflopsand shows it intrial status; thetool-runner agent surfaces it.
VTune profiling (independent gpu-hotspots module)
SyclProfiler(core/sycl_profiler.py) — compiles viaSyclExecutor, runsthe binary under
vtune -collect gpu-hotspots(characterization), parses theper-task report, and emits recommendations (memory-bound, low occupancy, high
idle, low XMX/DPAS, L3 thrashing). Kept fully separate from the Triton
gpu-offloadprofiler.profileskill dispatches on--dsl.Example
examples/sycl/— hardware-verified worked example: baseline (t0, 256×256×32)and trial (t1, 128×128×32) CUTLASS GEMM kernels, a bf16 spec, the PyTorch
golden reference, and a README with reproduction commands.
Design notes
ClaudeEngine/create_engineare unchanged; thework is DSL-aware workspace generation, a SYCL branch in the benchmark/profile
skills, the golden-reference comparison, and SYCL templates/KB.
PyTorch
Model, so the benchmark and VTune paths don't reuse the Python-runnerTriton code — they share data-layout conventions, not control flow.
.cppimplemented the
A.bin/B0.bin/D2.bin+--input_dir/--output_dircontract; it lived only inside
SyclExecutor. The starter kernel implementsit and the KB documents it so the agent can produce a runnable kernel from
scratch.
Verification (Intel Arc Pro B70, real hardware)
inputs passes
np.allcloseatrtol=1e-2, atol=1e-2(max abs diff ~2e-5).Correctness: PASSED, e.g.baseline_us=194.20, triton_us=83.10, speedup=2.34x, tflops=25.83, util=16.1%. Cached-baseline (t1+) path skips thebaseline rerun.
.cppinput kept verbatim; PyTorch-only inputsubstitutes the starter stub; SYCL
CLAUDE.md/--dsl syclcommand / goldenref / KB symlink all correct.
xe-forge-skill profile --dsl syclproduces real Xehardware metrics (XVE Active/Stalled/Idle, occupancy, XMX/DPAS, L3) and
correct recommendations.
uv run pytest), platform-independent — theicpx/GPU/vtuneseams are mocked. Ruff lint + format clean.Hardware/tooling gotchas surfaced (documented in VTUNE.md / KB)
vtune-self-checker.shgives a false negative (its bundled DPC++ test appwon't launch) — a real AOT
bmg-g31kernel profiles fine.xekernel driver (noti915);gpu-hotspotsworks.GPU Memory Bandwidth, GB/sec:Read) break the-columnparser; bandwidth is requested via thecomma-free substring
GB/secin a separate report pass._detect_device_targetdidn't recognize "Arc Pro B70" → added abmg-g31mapping.
Scope / limitations (v1)
ops are out of scope.
{baseline_stem}_pytorch.py.gpu-hotspotssplits metrics across sampling passes, so memory-bandwidthcan read near-zero for sub-millisecond kernels (sampling noise, not a parse
bug); the stable signals are XVE/occupancy/XMX.
Commits