Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 65 additions & 1 deletion VTUNE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# VTune GPU Profiling

Hardware-counter profiling for Triton kernels on Intel XPU using Intel VTune Profiler.
Hardware-counter profiling on Intel XPU using Intel VTune Profiler. Two paths:
**Triton/PyTorch** kernels (`gpu-offload` around a generated Python runner) and
**SYCL/CUTLASS** `.cpp` kernels (`gpu-hotspots` characterization on the compiled
binary). The path is selected by `--dsl`; see [SYCL kernels](#sycl-kernels) below.

---

Expand Down Expand Up @@ -125,6 +128,67 @@ xe-forge -i kernel.py -s spec.yaml --vtune --engine claude --workspace ./workspa

---

## SYCL kernels

SYCL/CUTLASS kernels are compiled `.cpp` binaries, so they are profiled
differently from Triton: there is no Python runner. `SyclProfiler`
(`core/sycl_profiler.py`) compiles the kernel via `SyclExecutor`, generates the
same deterministic file-IO inputs the benchmark uses, and runs the binary
directly under VTune `gpu-hotspots` in **characterization** mode — which exposes
richer Intel Xe metrics than `gpu-offload`.

```bash
# Profile a SYCL kernel (point --vtune-bin at a 2026.x build if needed)
xe-forge-skill profile examples/sycl/gemm.cpp \
--spec examples/sycl/gemm.yaml --dsl sycl --variant bench-xpu \
--iters 200 --vtune-bin /data/swtools/intel/vtune/2026.0/bin64/vtune
```

Under the hood:

```bash
vtune -collect gpu-hotspots \
-knob gpu-profiling-mode=characterization \
-knob characterization-mode=overview \
-result-dir <dir> \
-- <binary> --m=M --n=N --k=K --input_dir=<in> --output_dir=<out> \
--iterations=200 --verify=0
```

### Metrics collected (SYCL)

| Metric | Meaning |
|--------|---------|
| XVE Active / Stalled / Idle | Xe Vector Engine execution / stall / idle time |
| Peak XVE Threads Occupancy | Thread occupancy (with Work-Size / SLM / Barrier sub-limiters) |
| XMX (DPAS) Active | Fraction of time the matrix engine is busy — the key GEMM-efficiency signal |
| GPU L3 Miss Ratio | L3 cache miss ratio |
| GPU Memory Bandwidth Read/Write | GB/s to/from GPU memory |

### Metric → CUTLASS knob (SYCL)

| Condition | Diagnosis | Action | KB |
|-----------|-----------|--------|----|
| XVE Stalled > Active | Memory-bound mainloop | ↑ PipelineStages; 2D-block/VNNI copy atoms; ↓ TileK | `sycl_vtune.yaml` |
| Peak occupancy < 50% | Grid too small / register pressure | Smaller TileShape (256→128); check 256-GRF | `sycl_vtune.yaml` |
| XVE Idle > 30% | Work-distribution / tail | TileShape vs M/N; stream-K / persistent scheduler | `sycl_vtune.yaml` |
| XMX active < 20% | Matrix engine underutilized | Larger N-per-subgroup; SubgroupLayout vs DPAS atom | `sycl_vtune.yaml` |
| L3 miss > 50% | Cache thrashing | Reduce tiles; improve K-blocking/reuse | `sycl_vtune.yaml` |
| Mem BW ≈ peak, low TFLOPS | Bandwidth-bound | Accept, or change algorithm | `sycl_vtune.yaml` |

### SYCL-specific notes

- **Self-checker false negative**: `vtune-self-checker.sh` may report GPU
profiling as unsupported (its bundled DPC++ app fails to launch), yet a real
AOT-compiled `bmg-g31` kernel profiles fine. Don't gate on the self-checker.
- **`xe` kernel driver** (newer than `i915`) is supported by VTune 2026 for
`gpu-hotspots`; `perf_event_paranoid=0` helps.
- **VTune version**: the config default `vtune_bin` may point at an older build;
pass `--vtune-bin /data/swtools/intel/vtune/2026.0/bin64/vtune` (or set
`VTUNE_BIN`) to use 2026.x.

---

## Troubleshooting

**"VTune not found"** -- Ensure `vtune` is on `$PATH` after sourcing the oneAPI environment, or specify the path with `--vtune-bin`.
Expand Down
80 changes: 80 additions & 0 deletions examples/sycl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# SYCL Claude Engine Example (Intel Xe / Battlemage)

A worked, hardware-verified example of the **Claude Code engine for SYCL XPU
kernels**: the agent rewrites a whole CUTLASS SYCL `.cpp` GEMM each trial, then
compiles (`icpx`) + benchmarks it via `SyclExecutor`, with correctness checked
against a **golden PyTorch reference** (`numpy.allclose` on the kernel's dumped
`D2.bin`).

| File | Role |
|------|------|
| `gemm.cpp` | Baseline kernel (t0). CUTLASS BMG GEMM `D = A·B0`, tile `256×256×32`. Honours the file-IO contract. |
| `gemm_t1.cpp` | One optimization trial (t1). Same kernel, tile `128×128×32` — ~1.7× faster at 1024³ on an Arc Pro B70. |
| `gemm.yaml` | KernelBench-style spec: GEMM dims `M=N=K=1024`, bf16, `rtol=atol=0.02`. |
| `gemm_pytorch.py` | Golden PyTorch reference: `Model.forward(A, B0) -> A.float() @ B0.float()`. |

## The file-IO contract

Every SYCL kernel optimized by this engine is a standalone executable invoked as:

```
./kernel --m=<M> --n=<N> --k=<K> --input_dir=<dir> --output_dir=<dir> --iterations=<int> --verify=<int>
```

It reads `A.bin` `[M,K]` and `B0.bin` `[K,N]` (raw row-major, bf16 stored as
int16 bits) from `--input_dir`, computes `D = A·B0`, writes `D2.bin` `[M,N]`
(float32, row-major) to `--output_dir`, and prints a `… TFlop/s … ms` line.
Full spec: [`knowledge_base/sycl/xpu/sycl_io_contract.yaml`](../../knowledge_base/sycl/xpu/sycl_io_contract.yaml).

## Environment (Intel XPU box)

```bash
export SYCL_TLA_DIR=/path/to/sycl-tla # CUTLASS SYCL checkout
export AIBENCH_SYCL_TARGET=bmg-g31 # AOT target (Battlemage: B580/B570/B70)
export MKL_INCLUDE=/path/to/oneapi/include
export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
export IGC_ExtraOCLOptions="-cl-intel-256-GRF-per-thread"
export SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file -gline-tables-only"
```

## Reproduce the benchmark

t0 — compiles both kernels, times the baseline, checks the trial vs the golden ref:

```bash
xe-forge-skill benchmark examples/sycl/gemm.cpp examples/sycl/gemm_t1.cpp \
--spec examples/sycl/gemm.yaml --dsl sycl --variant bench-xpu
```

```
Correctness: PASSED
Performance: baseline_us=193.80, triton_us=109.90, speedup=1.76x, tflops=19.54, util=12.2%
```

t1+ — reuse the cached baseline (no baseline recompile/rerun):

```bash
xe-forge-skill benchmark examples/sycl/gemm.cpp examples/sycl/gemm_t1.cpp \
--spec examples/sycl/gemm.yaml --dsl sycl --variant bench-xpu --baseline-us 193.80
```

The `triton_us=` token is kept verbatim across DSLs so the trial tooling parses
uniformly; for SYCL it carries the optimized kernel's time in microseconds.
`tflops=` is the achieved throughput and `util=` is its percentage of the
device's theoretical peak (`peak_tflops`, default 160 TFLOPS bf16 for the B70;
override with the `PEAK_TFLOPS` env var) — so `util=12.2%` means this trial
reaches 12.2% of peak.

## Generate an agentic workspace

```bash
python -m xe_forge.cli --input examples/sycl/gemm.cpp --name gemm \
--dsl sycl --engine claude --spec examples/sycl/gemm.yaml \
--variant bench-xpu --workspace /tmp/ws_sycl
```

This scaffolds a SYCL `CLAUDE.md`, an `/optimize-kernel` command wired with
`--dsl sycl`, the kernel + `gemm_pytorch.py` golden reference under
`test_kernels/`, and a `knowledge_base/` symlink. Passing a PyTorch-only `.py`
input instead substitutes a compilable starter `.cpp` (a copy of `gemm.cpp`)
and uses the `.py` as the golden reference.
Loading
Loading