Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions skills/competition/nineops-skills/.vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
{
"chat.advanced.cli.mcp.enabled": true,
"chat.cli.mcp.enabled": true,
"chat.mcp.access": "all",
"chat.tools.terminal.autoApprove": {
"/^bash\\b/": true,
"/^cat\\b/": true,
"/^cd\\b/": true,
"/^chmod\\b/": true,
"/^cp\\b/": true,
"/^curl\\b/": true,
"/^diff\\b/": true,
"/^echo\\b/": true,
"/^find\\b/": true,
"/^git\\b/": true,
"/^go\\b/": true,
"/^grep\\b/": true,
"/^head\\b/": true,
"/^ls\\b/": true,
"/^mkdir\\b/": true,
"/^mv\\b/": true,
"/^node\\b/": true,
"/^npm\\b/": true,
"/^pip/": true,
"/^printf\\b/": true,
"/^pwd\\b/": true,
"/^python/": true,
"/^rm\\b/": true,
"/^sed\\b/": true,
"/^sh\\b/": true,
"/^sort\\b/": true,
"/^tail\\b/": true,
"/^tee\\b/": true,
"/^test\\b/": true,
"/^touch\\b/": true,
"/^wc\\b/": true,
"/^wget\\b/": true,
"/^which\\b/": true
},
"chat.tools.terminal.blockDetectedFileWrites": false,
"chat.tools.terminal.ignoreDefaultAutoApproveRules": true,
"github.copilot.chat.additionalReadAccessPaths": [
"C:\\",
"D:\\",
"E:\\",
"F:\\"
],
"github.copilot.enable": {
"*": false,
"markdown": false,
"plaintext": false
},
"github.copilot.nextEditSuggestions.enabled": false,
"github.copilot.nextEditSuggestions.fixes": false,
"python-envs.defaultEnvManager": "ms-python.python:conda",
"python-envs.defaultPackageManager": "ms-python.python:conda",
"python-envs.pythonProjects": []
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"python-envs.defaultEnvManager": "ms-python.python:conda",
"python-envs.defaultPackageManager": "ms-python.python:conda",
"python-envs.pythonProjects": []
}
50 changes: 50 additions & 0 deletions skills/competition/nineops-skills/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Nineteethed DSL 算子开发 — Agent 快速参考

> **工作流指引**:收到算子开发任务时,先通读 `skill/SKILL.md §0 工作流总览`,按「开发→测试→诊断→修复」四阶段执行。
> 关键模板在 `skill/templates/`,故障排查参照 `skill/references/failure_diagnosis.md`。

## 核心经验总结(来自 Add / ReLU / GELU 实现)

### AST 跟踪陷阱(最重要的坑)
application() 内的代码会被 AST 跟踪原样嵌入生成 Triton 代码,Triton 环境没有标准 Python 库。
- **禁止** `math.*`、`torch.*`、`numpy.*` → 用字面量数值
- **禁止**模块级变量引用(变量名被原样嵌入导致 NameError)
- **禁止** `**` 运算符(Triton tensor 无 `__pow__`) → `x * x * x`
- **禁止** `ntl.tanh`(不存在) → `(exp(t)-exp(-t))/(exp(t)+exp(-t))`
- **允许** `ntl.*` 函数、字面量数值、四则运算

### 非连续张量支持(关键修复)
- **不要** `flatten()` → 破坏 strides,转置张量写入错位
- **要** `tile(tuple(1 for _ in range(ndim-1)) + (block_size,))` → 保留 strides
- `Tensor(ndim)` 的 ndim 必须与实际张量维度一致

### Element-wise 通用 arrangement 模式
```python
def _element_wise_arrangement(*tensors, block_size):
ndim = max(tensor.ndim for tensor in tensors)
assert all(tensor.ndim == ndim or tensor.ndim == 0 for tensor in tensors)
tile_shape = tuple(1 for _ in range(ndim - 1)) + (block_size,)
return tuple(
tensor.tile(tile_shape) if tensor.ndim != 0 else tensor
for tensor in tensors
)
```

### GELU 实现要点
- **近似版**: `0.5*x*(1+tanh(sqrt(2/pi)*(x+0.044715*x^3)))`
- `sqrt(2/pi)` = `0.7978845608028654`(字面量)
- `x^3` = `x * x * x`
- `tanh` = 手动用 `ntl.exp`
- 测试: `torch.nn.functional.gelu(x, approximate='tanh')`
- **精确版**: `x * 0.5 * (1 + erf(x / sqrt(2)))`
- 使用 `ntl.erf`, `ntl.sqrt`
- 测试: `torch.nn.functional.gelu(x)`

### 数据类型支持
- fp32: atol=1e-5, rtol=1e-5
- fp16: atol=1e-3, rtol=1e-3 (注意精度损失)
- bf16: 类似 fp16

### 广播操作
- 通过 `expand_as` 创建 stride=0 视图
- Triton 自动处理 HBM 广播
131 changes: 131 additions & 0 deletions skills/competition/nineops-skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# .skill — ninetoothed DSL Agent Workspace

> 本 `.skill` 工作区是 AI Agent 的 **技能包 (Skill Package)**,使 agent 能够高效实现、测试、基准分析和诊断基于 **ninetoothed DSL** 的 GPU 算子。所有文档、模板、脚本、示例均已内置,agent 可自主完成从实作到报告的全流程。

## 概览

| 目录 | 用途 |
|------|------|
| `references/` | DSL 模式、测试模式、Benchmark 模式、Repo 索引、AOT 指南、故障诊断 |
| `scripts/` | 正确性测试、Benchmark、源码检查、日志收集的可执行脚本 |
| `templates/` | Agent 任务报告、Benchmark 报告、故障诊断报告模板 |
| `examples/` | 4 个完整示例项目(含源码 + benchmark) |
| `tests/` | Agent 触发 prompt、自校验任务、期望输出参考 |

## 快速开始

### 1. 实现一个算子

```bash
# 1a. 参考 DSL 模式
cat references/dsl_patterns.md

# 1b. 参考已有示例(如 elementwise 加法)
cat examples/elementwise_broadcast_add/run.py

# 1c. 实现自己的 kernel
```

### 2. 运行正确性测试

```bash
scripts/run_correctness.sh examples/elementwise_broadcast_add
```

### 3. 运行 Benchmark

```bash
scripts/run_benchmark.sh examples/elementwise_broadcast_add
```

### 4. 查看生成源

```bash
scripts/inspect_generated_source.sh examples/elementwise_broadcast_add/run.py
```

### 5. 收集日志

```bash
python scripts/collect_task_log.py --dir . --output task_log.json
```

## 文件结构

```
.skill/
├── README.md ← 本文档
├── SKILL.md ← DSL 完整 API 参考
├── references/
│ ├── repo_index.md ← ninetoothed 仓库结构索引
│ ├── dsl_patterns.md ← 7 种 DSL 实现模式
│ ├── testing_patterns.md ← 4 维度测试覆盖策略
│ ├── benchmark_patterns.md ← 8 元素 Benchmark 设计
│ ├── generated_source_and_aot.md ← Codegen 查看 + AOT 编译
│ └── failure_diagnosis.md ← 4 类故障诊断指南
├── scripts/
│ ├── validate_skill_package.py ← 结构完整性检查
│ ├── run_correctness.sh ← 正确性测试运行器
│ ├── run_benchmark.sh ← Benchmark 运行器
│ ├── inspect_generated_source.sh ← 生成源码查看器
│ └── collect_task_log.py ← 任务日志收集器
├── templates/
│ ├── operator_task_report_template.md ← 算子任务报告模板
│ ├── benchmark_report_template.md ← Benchmark 报告模板
│ └── failure_diagnosis_template.md ← 故障诊断模板
├── examples/
│ ├── elementwise_broadcast_add/ ← 加法 kernel (elementwise_1d)
│ ├── reduction_softmax/ ← Softmax kernel (reduction_2d)
│ ├── non_contiguous_stride_case/ ← 非连续 stride 测试
│ └── performance_regression_case/ ← BLOCK_SIZE 退化诊断
└── tests/
├── trigger_prompts.md ← Agent 触发 prompt
├── selftest_tasks.md ← 自我校验任务
└── expected_outputs.md ← 期望输出参考
```

## Agent 工作流程

当 agent 收到"实现一个 XX 算子"的请求时,典型工作流如下:

```
1. 理解需求 ──→ 打开 references/dsl_patterns.md,匹配模式
2. 查看模板 ──→ 打开 templates/operator_task_report_template.md
3. 参考示例 ──→ 查看 examples/ 下相同模式的实现
4. 实现代码 ──→ 编写 kernel.py + run.py + benchmark.py
5. 正确性测试 ──→ scripts/run_correctness.sh 验证
6. Benchmark ──→ scripts/run_benchmark.sh 性能对比
7. 查看源码 ──→ scripts/inspect_generated_source.sh 检查
8. 故障诊断 ──→ (如遇错误) 参考 failure_diagnosis.md
9. 生成报告 ──→ 填写 operator_task_report_template 完成
```

## 环境要求

- Python 3.10+
- PyTorch 2.0+ (CUDA)
- ninetoothed (git@github.com:QuantumIntelligence/ninetoothed.git)
- NVIDIA GPU with CUDA support

## 结构校验

```bash
python scripts/validate_skill_package.py
```

预期输出:
```
✅ .skill structure OK (所有 5 个目录和核心文件均存在)
```

## License

Internal — Qiyuan Competition
Loading