IntelLabs · danielfleischer · Jun 8, 2026 · Jun 8, 2026 · Jun 8, 2026 · Jun 8, 2026
diff --git a/VTUNE.md b/VTUNE.md
@@ -1,6 +1,9 @@
 # VTune GPU Profiling
 
-Hardware-counter profiling for Triton kernels on Intel XPU using Intel VTune Profiler.
+Hardware-counter profiling on Intel XPU using Intel VTune Profiler. Two paths:
+**Triton/PyTorch** kernels (`gpu-offload` around a generated Python runner) and
+**SYCL/CUTLASS** `.cpp` kernels (`gpu-hotspots` characterization on the compiled
+binary). The path is selected by `--dsl`; see [SYCL kernels](#sycl-kernels) below.
 
 ---
 
@@ -125,6 +128,67 @@ xe-forge -i kernel.py -s spec.yaml --vtune --engine claude --workspace ./workspa
 
 ---
 
+## SYCL kernels
+
+SYCL/CUTLASS kernels are compiled `.cpp` binaries, so they are profiled
+differently from Triton: there is no Python runner. `SyclProfiler`
+(`core/sycl_profiler.py`) compiles the kernel via `SyclExecutor`, generates the
+same deterministic file-IO inputs the benchmark uses, and runs the binary
+directly under VTune `gpu-hotspots` in **characterization** mode — which exposes
+richer Intel Xe metrics than `gpu-offload`.
+
+```bash
+# Profile a SYCL kernel (point --vtune-bin at a 2026.x build if needed)
+xe-forge-skill profile examples/sycl/gemm.cpp \
+    --spec examples/sycl/gemm.yaml --dsl sycl --variant bench-xpu \
+    --iters 200 --vtune-bin /data/swtools/intel/vtune/2026.0/bin64/vtune
+```
+
+Under the hood:
+
+```bash
+vtune -collect gpu-hotspots \
+    -knob gpu-profiling-mode=characterization \
+    -knob characterization-mode=overview \
+    -result-dir <dir> \
+    -- <binary> --m=M --n=N --k=K --input_dir=<in> --output_dir=<out> \
+       --iterations=200 --verify=0
+```
+
+### Metrics collected (SYCL)
+
+| Metric | Meaning |
+|--------|---------|
+| XVE Active / Stalled / Idle | Xe Vector Engine execution / stall / idle time |
+| Peak XVE Threads Occupancy | Thread occupancy (with Work-Size / SLM / Barrier sub-limiters) |
+| XMX (DPAS) Active | Fraction of time the matrix engine is busy — the key GEMM-efficiency signal |
+| GPU L3 Miss Ratio | L3 cache miss ratio |
+| GPU Memory Bandwidth Read/Write | GB/s to/from GPU memory |
+
+### Metric → CUTLASS knob (SYCL)
+
+| Condition | Diagnosis | Action | KB |
+|-----------|-----------|--------|----|
+| XVE Stalled > Active | Memory-bound mainloop | ↑ PipelineStages; 2D-block/VNNI copy atoms; ↓ TileK | `sycl_vtune.yaml` |
+| Peak occupancy < 50% | Grid too small / register pressure | Smaller TileShape (256→128); check 256-GRF | `sycl_vtune.yaml` |
+| XVE Idle > 30% | Work-distribution / tail | TileShape vs M/N; stream-K / persistent scheduler | `sycl_vtune.yaml` |
+| XMX active < 20% | Matrix engine underutilized | Larger N-per-subgroup; SubgroupLayout vs DPAS atom | `sycl_vtune.yaml` |
+| L3 miss > 50% | Cache thrashing | Reduce tiles; improve K-blocking/reuse | `sycl_vtune.yaml` |
+| Mem BW ≈ peak, low TFLOPS | Bandwidth-bound | Accept, or change algorithm | `sycl_vtune.yaml` |
+
+### SYCL-specific notes
+
+- **Self-checker false negative**: `vtune-self-checker.sh` may report GPU
+  profiling as unsupported (its bundled DPC++ app fails to launch), yet a real
+  AOT-compiled `bmg-g31` kernel profiles fine. Don't gate on the self-checker.
+- **`xe` kernel driver** (newer than `i915`) is supported by VTune 2026 for
+  `gpu-hotspots`; `perf_event_paranoid=0` helps.
+- **VTune version**: the config default `vtune_bin` may point at an older build;
+  pass `--vtune-bin /data/swtools/intel/vtune/2026.0/bin64/vtune` (or set
+  `VTUNE_BIN`) to use 2026.x.
+
+---
+
 ## Troubleshooting
 
 **"VTune not found"** -- Ensure `vtune` is on `$PATH` after sourcing the oneAPI environment, or specify the path with `--vtune-bin`.

diff --git a/examples/sycl/README.md b/examples/sycl/README.md
@@ -0,0 +1,80 @@
+# SYCL Claude Engine Example (Intel Xe / Battlemage)
+
+A worked, hardware-verified example of the **Claude Code engine for SYCL XPU
+kernels**: the agent rewrites a whole CUTLASS SYCL `.cpp` GEMM each trial, then
+compiles (`icpx`) + benchmarks it via `SyclExecutor`, with correctness checked
+against a **golden PyTorch reference** (`numpy.allclose` on the kernel's dumped
+`D2.bin`).
+
+| File | Role |
+|------|------|
+| `gemm.cpp` | Baseline kernel (t0). CUTLASS BMG GEMM `D = A·B0`, tile `256×256×32`. Honours the file-IO contract. |
+| `gemm_t1.cpp` | One optimization trial (t1). Same kernel, tile `128×128×32` — ~1.7× faster at 1024³ on an Arc Pro B70. |
+| `gemm.yaml` | KernelBench-style spec: GEMM dims `M=N=K=1024`, bf16, `rtol=atol=0.02`. |
+| `gemm_pytorch.py` | Golden PyTorch reference: `Model.forward(A, B0) -> A.float() @ B0.float()`. |
+
+## The file-IO contract
+
+Every SYCL kernel optimized by this engine is a standalone executable invoked as:
+
+```
+./kernel --m=<M> --n=<N> --k=<K> --input_dir=<dir> --output_dir=<dir> --iterations=<int> --verify=<int>
+```
+
+It reads `A.bin` `[M,K]` and `B0.bin` `[K,N]` (raw row-major, bf16 stored as
+int16 bits) from `--input_dir`, computes `D = A·B0`, writes `D2.bin` `[M,N]`
+(float32, row-major) to `--output_dir`, and prints a `… TFlop/s … ms` line.
+Full spec: [`knowledge_base/sycl/xpu/sycl_io_contract.yaml`](../../knowledge_base/sycl/xpu/sycl_io_contract.yaml).
+
+## Environment (Intel XPU box)
+
+```bash
+export SYCL_TLA_DIR=/path/to/sycl-tla          # CUTLASS SYCL checkout
+export AIBENCH_SYCL_TARGET=bmg-g31             # AOT target (Battlemage: B580/B570/B70)
+export MKL_INCLUDE=/path/to/oneapi/include
+export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
+export IGC_ExtraOCLOptions="-cl-intel-256-GRF-per-thread"
+export SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file -gline-tables-only"
+```
+
+## Reproduce the benchmark
+
+t0 — compiles both kernels, times the baseline, checks the trial vs the golden ref:
+
+```bash
+xe-forge-skill benchmark examples/sycl/gemm.cpp examples/sycl/gemm_t1.cpp \
+    --spec examples/sycl/gemm.yaml --dsl sycl --variant bench-xpu
+```
+
+```
+Correctness: PASSED
+Performance: baseline_us=193.80, triton_us=109.90, speedup=1.76x, tflops=19.54, util=12.2%
+```
+
+t1+ — reuse the cached baseline (no baseline recompile/rerun):
+
+```bash
+xe-forge-skill benchmark examples/sycl/gemm.cpp examples/sycl/gemm_t1.cpp \
+    --spec examples/sycl/gemm.yaml --dsl sycl --variant bench-xpu --baseline-us 193.80
+```
+
+The `triton_us=` token is kept verbatim across DSLs so the trial tooling parses
+uniformly; for SYCL it carries the optimized kernel's time in microseconds.
+`tflops=` is the achieved throughput and `util=` is its percentage of the
+device's theoretical peak (`peak_tflops`, default 160 TFLOPS bf16 for the B70;
+override with the `PEAK_TFLOPS` env var) — so `util=12.2%` means this trial
+reaches 12.2% of peak.
+
+## Generate an agentic workspace
+
+```bash
+python -m xe_forge.cli --input examples/sycl/gemm.cpp --name gemm \
+    --dsl sycl --engine claude --spec examples/sycl/gemm.yaml \
+    --variant bench-xpu --workspace /tmp/ws_sycl
+```
+
+This scaffolds a SYCL `CLAUDE.md`, an `/optimize-kernel` command wired with
+`--dsl sycl`, the kernel + `gemm_pytorch.py` golden reference under
+`test_kernels/`, and a `knowledge_base/` symlink. Passing a PyTorch-only `.py`
+input instead substitutes a compilable starter `.cpp` (a copy of `gemm.cpp`)
+and uses the `.py` as the golden reference.