Skip to content

Agentic SYCL support#43

Draft
danielfleischer wants to merge 7 commits into
mainfrom
df/sycl
Draft

Agentic SYCL support#43
danielfleischer wants to merge 7 commits into
mainfrom
df/sycl

Conversation

@danielfleischer

Copy link
Copy Markdown
Member

Claude Code engine for SYCL XPU kernels (+ TFLOPS reporting, VTune profiling)

Summary

Adds a Claude Code optimization engine for SYCL/CUTLASS GEMM kernels on
Intel Xe, alongside the existing Triton path. The agent rewrites a whole .cpp
kernel each trial, then compiles (icpx) + benchmarks it via the existing
SyclExecutor, with correctness checked against a golden PyTorch/numpy
reference
computed on the same bit-identical inputs. The change also adds a
TFLOPS / peak-utilization reporting convention and an independent VTune
gpu-hotspots profiler for SYCL kernels.

All GPU paths were developed and verified on real hardware — an Intel Arc Pro
B70 (Battlemage)
with icpx + VTune 2026.0.

What's included

SYCL Claude engine (core)

  • SyclExecutor.compare_with_reference() — runs the optimized kernel, loads
    D2.bin, reshapes to the golden's shape, compares with spec rtol/atol.
    The single mockable GPU seam. Also maps "Arc Pro B70" → the bmg-g31 AOT
    target.
  • benchmark skill split into _run_triton / _run_sycl on --dsl; the SYCL
    path computes the golden via the PyTorch reference on the same
    A.bin/B0.bin inputs and prints the uniform baseline_us/triton_us line
    (the triton_us= token is kept verbatim so trial tooling parses uniformly
    across DSLs).
  • cli reads {input}_pytorch.py as the golden reference for SYCL.
  • validator gains SYCL static checks: missing_main, missing_io_contract,
    no_cutlass_include.

Workspace generation

  • generator selects SYCL templates and a .cpp extension for dsl == sycl,
    substitutes a compilable starter stub when the input has no #include, and
    always writes the PyTorch golden reference as {name}_pytorch.py.
  • New templates: CLAUDE.sycl.md.j2, optimize-kernel.sycl.md.j2,
    starter_kernel.sycl.cpp.j2 (a file-IO-honouring CUTLASS BMG GEMM).

Knowledge base

  • sycl_io_contract.yaml — the runner harness contract (CLI args, A/B0/B1 .bin
    layout with bf16-as-int16, D2.bin f32 output, perf-line format, CUTLASS
    correctness constraints).
  • sycl_vtune.yaml — VTune verbs + metric→CUTLASS-knob mapping.

TFLOPS / utilization reporting

  • DeviceConfig.peak_tflops (default 160.0, PEAK_TFLOPS env override).
  • Benchmark Performance: line appends tflops=<f>, util=<f>% on both DSL
    paths; the trial tree stores --tflops and shows it in trial status; the
    tool-runner agent surfaces it.

VTune profiling (independent gpu-hotspots module)

  • SyclProfiler (core/sycl_profiler.py) — compiles via SyclExecutor, runs
    the binary under vtune -collect gpu-hotspots (characterization), parses the
    per-task report, and emits recommendations (memory-bound, low occupancy, high
    idle, low XMX/DPAS, L3 thrashing). Kept fully separate from the Triton
    gpu-offload profiler. profile skill dispatches on --dsl.

Example

  • examples/sycl/ — hardware-verified worked example: baseline (t0, 256×256×32)
    and trial (t1, 128×128×32) CUTLASS GEMM kernels, a bf16 spec, the PyTorch
    golden reference, and a README with reproduction commands.

Design notes

  • DSL-agnostic engine. ClaudeEngine / create_engine are unchanged; the
    work is DSL-aware workspace generation, a SYCL branch in the benchmark/profile
    skills, the golden-reference comparison, and SYCL templates/KB.
  • Independent SYCL paths. SYCL kernels are compiled binaries with no
    PyTorch Model, so the benchmark and VTune paths don't reuse the Python-runner
    Triton code — they share data-layout conventions, not control flow.
  • The file-IO contract is the load-bearing piece. No pre-existing .cpp
    implemented the A.bin/B0.bin/D2.bin + --input_dir/--output_dir
    contract; it lived only inside SyclExecutor. The starter kernel implements
    it and the KB documents it so the agent can produce a runnable kernel from
    scratch.

Verification (Intel Arc Pro B70, real hardware)

  • Golden-reference correctness: bf16 GEMM vs f32 PyTorch matmul on identical
    inputs passes np.allclose at rtol=1e-2, atol=1e-2 (max abs diff ~2e-5).
  • Benchmark skill: t0 compiles baseline+trial, checks vs golden →
    Correctness: PASSED, e.g. baseline_us=194.20, triton_us=83.10, speedup=2.34x, tflops=25.83, util=16.1%. Cached-baseline (t1+) path skips the
    baseline rerun.
  • Workspace generation: .cpp input kept verbatim; PyTorch-only input
    substitutes the starter stub; SYCL CLAUDE.md / --dsl sycl command / golden
    ref / KB symlink all correct.
  • VTune profiling: xe-forge-skill profile --dsl sycl produces real Xe
    hardware metrics (XVE Active/Stalled/Idle, occupancy, XMX/DPAS, L3) and
    correct recommendations.
  • Unit tests: 52 pass (uv run pytest), platform-independent — the
    icpx/GPU/vtune seams are mocked. Ruff lint + format clean.

Hardware/tooling gotchas surfaced (documented in VTUNE.md / KB)

  • VTune works on this box after the 2025.3→2026 upgrade, but
    vtune-self-checker.sh gives a false negative (its bundled DPC++ test app
    won't launch) — a real AOT bmg-g31 kernel profiles fine.
  • The B70 uses the xe kernel driver (not i915); gpu-hotspots works.
  • VTune column names containing a literal comma (GPU Memory Bandwidth, GB/sec:Read) break the -column parser; bandwidth is requested via the
    comma-free substring GB/sec in a separate report pass.
  • _detect_device_target didn't recognize "Arc Pro B70" → added a bmg-g31
    mapping.

Scope / limitations (v1)

  • GEMM-family kernels only (the file-IO inputs are GEMM-shaped). Non-GEMM
    ops are out of scope.
  • Golden reference located by convention: {baseline_stem}_pytorch.py.
  • VTune gpu-hotspots splits metrics across sampling passes, so memory-bandwidth
    can read near-zero for sub-millisecond kernels (sampling noise, not a parse
    bug); the stable signals are XVE/occupancy/XMX.

Commits

  • Add SYCL golden-reference benchmarking infrastructure
  • Add SYCL workspace generation to the Claude engine
  • Add SYCL file-IO contract to the knowledge base
  • Add SYCL Claude-engine GEMM example
  • Add tests for the SYCL Claude engine
  • Report TFLOPS and peak utilization across benchmark and trials
  • Add SYCL VTune profiling via an independent gpu-hotspots module

Check CUTLASS SYCL kernels against a PyTorch/numpy golden reference
computed on bit-identical inputs, rather than original-vs-optimized.

- sycl_executor: add compare_with_reference() — run the kernel, load
  D2.bin, reshape to the golden's shape, compare with spec rtol/atol;
  map Arc Pro B70 to the bmg-g31 AOT target.
- benchmark skill: split run() into _run_triton / _run_sycl on --dsl;
  the SYCL path computes the golden via the PyTorch ref on the same
  A.bin/B0.bin inputs and prints the uniform baseline_us/triton_us line.
- validator: add missing_main, missing_io_contract, no_cutlass_include
  checks for SYCL sources.
- cli: read {input}_pytorch.py as the golden reference for SYCL.
Generate a DSL-aware agentic workspace for SYCL kernels.

- generator: select SYCL templates and a .cpp extension for dsl==sycl;
  substitute a compilable starter stub when the input has no #include;
  always write the PyTorch golden reference as {name}_pytorch.py.
- templates: CLAUDE.sycl.md.j2 and optimize-kernel.sycl.md.j2 (C++
  workflow, file-IO contract, --dsl sycl / bench-xpu, CUTLASS rules),
  and starter_kernel.sycl.cpp.j2 (file-IO-honouring CUTLASS BMG GEMM).
knowledge_base/sycl/xpu/sycl_io_contract.yaml: the runner harness
contract the agent must follow — CLI args (--m/--n/--k/--input_dir/
--output_dir/--iterations/--verify), the A/B0/B1 .bin input layout
(bf16 as int16 bits), the D2.bin f32 output, the perf-line format, and
SYCL/CUTLASS correctness constraints (tile/subgroup/atom consistency,
SLM budget, f32 accumulate). Loaded as optimization guidance alongside
the existing cutlass_sycl_framework / xetla_patterns entries.
Hardware-verified worked example on Intel Arc Pro B70: baseline (t0,
256x256x32) and trial (t1, 128x128x32) CUTLASS GEMM kernels honouring
the file-IO contract, a bf16 spec, the PyTorch golden reference, and a
README with reproduction commands.
Platform-independent unit tests mocking the icpx/GPU seam:
- test_generator_sycl: .cpp emission, starter-stub substitution, SYCL
  CLAUDE.md content, Triton regression.
- test_benchmark_skill_routing: --dsl dispatch, cached-baseline skip.
- test_sycl_golden_reference: compare_with_reference pass/fail/compile
  -error/missing-D2 cases, bf16 .bin roundtrip, output-format regex.
- test_validator: SYCL missing_main / missing_io_contract / cutlass.
Surface achieved throughput and its percentage of the device's
theoretical peak so results read as utilization, not just relative
speedup (the B70's bf16 peak is ~160 TFLOPS).

- config: add DeviceConfig.peak_tflops (default 160.0), overridable via
  the PEAK_TFLOPS env var.
- benchmark skill: append `tflops=<f>, util=<f>%` to the Performance:
  line via a shared _perf_line() helper, on both the Triton and SYCL
  paths; util = tflops / peak_tflops. The us/speedup prefix is unchanged
  so existing parsers keep working.
- trial tree: add --tflops to `trial result`, persist it per trial, and
  show it in `trial status` (e.g. "18.9 TFLOPS").
- tool-runner agent: extract tflops/util from benchmark output.
- SYCL CLAUDE.md: record --tflops and reason about utilization vs the
  ~160 TFLOPS peak when choosing the next tile.
- examples/sycl/README: show the tflops/util fields and PEAK_TFLOPS.
Profile compiled CUTLASS SYCL kernels with VTune, mapping Intel Xe
hardware metrics to CUTLASS tuning knobs. Kept separate from the Triton
profiler: SYCL kernels are compiled binaries, so there is no Python
runner and a different VTune analysis (gpu-hotspots characterization,
not gpu-offload) is used.

- core/sycl_profiler.py: new SyclProfiler — compiles via SyclExecutor,
  generates the same deterministic file-IO inputs as the benchmark, runs
  the binary under `vtune -collect gpu-hotspots`, parses the per-task
  hotspots report, and emits recommendations (memory-bound, low
  occupancy, high idle, low XMX/DPAS, L3 thrashing). Degrades gracefully
  to an error result when VTune is absent or collection fails.
- profile skill: split run() into _profile_triton / _profile_sycl on
  --dsl; add the previously-missing --dsl flag to the profile subparser.
- CLAUDE.sycl.md: wire the Profile step to `--dsl sycl --variant
  bench-xpu` and have the agent reason about the XVE/occupancy/XMX
  metrics when choosing the next tile.
- knowledge_base/sycl/xpu/sycl_vtune.yaml: the collect/report verbs and
  the metric -> CUTLASS-knob mapping.
- VTUNE.md: SYCL profiling section (verbs, metrics, knob table, and the
  xe-driver / self-checker-false-negative notes).
- tests: test_sycl_profiler.py and test_profile_skill_routing.py mock the
  vtune subprocess + compile seams (no VTune/GPU needed).

Verified on VTune 2026.0 + Arc Pro B70. Column names containing a
literal comma ("GPU Memory Bandwidth, GB/sec:Read") break VTune's
-column parser, so bandwidth is requested via the comma-free substring
"GB/sec" in a separate report pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant