Skip to content

Explicit cleanup#2756

Draft
hubert-marek wants to merge 12 commits into
feat/ephemeral-mm-pixelsfrom
explicit-cleanup
Draft

Explicit cleanup#2756
hubert-marek wants to merge 12 commits into
feat/ephemeral-mm-pixelsfrom
explicit-cleanup

Conversation

@hubert-marek

Copy link
Copy Markdown
Contributor

No description provided.

andre-fu and others added 12 commits June 5, 2026 20:21
The aarch64 host install path was broken: `uv sync` installs flash-attn
from PyPI source but pyproject sets FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE,
so the compiled extension never builds. `scripts/docker-arm64-post-install.sh`
fixed it for Docker GB200 builds but hardcoded sm_100 and /app/.venv,
leaving Hopper hosts (H100/H200/GH200) without a recipe.

Changes:
- `scripts/docker-arm64-post-install.sh`: auto-detect compute capability
  via nvidia-smi when available; parameterize venv path. Preserves the
  sm_100 default when no GPU is visible (Docker buildx).
- `scripts/install.sh`: call the post-install for aarch64 hosts after
  `uv sync --all-extras`. Previously the script ran uv sync and exited,
  leaving aarch64 users with a broken venv.
- `README.md`: document the aarch64 post-install step (mirrors the
  existing 3.1 Flash Attention 3 pattern).

Validated on GH200 (sm_90, aarch64):
- forward + backward parity vs torch SDPA (max diff < 0.05 / 0.25)
- 383/384 unit tests pass (the 1 failure is unrelated TileLang/MoE)
- SFT trainer smoke test (5 steps, Qwen3-0.6B) runs with flash_attention_2

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(fp8): fuse transpose into the dX block-fp8 weight cast

The dX backward built `weight.transpose(0, 1).contiguous()` and re-cast it to
fp8 every step, materializing a full bf16 transpose buffer plus an extra
read/write pass. Add `per_block_cast_to_fp8_tp_triton`, which produces the
block-fp8 of `weight.T` directly by reusing the existing per-block kernel with
swapped output/scale strides — no intermediate buffer.

128x128 block quantization is transpose-symmetric, so the result is
bit-identical to casting the materialized transpose; DeepGEMM receives an
identical B tensor. Verified byte-for-byte across shapes; ~14x faster on a
4096x4096 weight (373 -> 27 us).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* deslopified

* also add fused implementation for per-token

* Fix: skip tests on <Hopper

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>
… source (#2733)

The [tool.uv.sources] override for prime-pydantic-config was being ignored
because it was only a transitive dependency (via prime-rl-configs). uv only
applies source overrides for packages that appear in project.dependencies.
Adding it as a direct dependency makes uv resolve from the local editable
path (deps/pydantic-config) instead of PyPI.
* orch improvements

* fixes
Remove the configs/private submodule (research-configs) and all
references to it throughout the codebase:

- Remove submodule from .gitmodules and git tracking
- Simplify install.sh: use plain git submodule update --init --recursive
  now that no private submodule can fail for users without access
- Update skills/install/SKILL.md to reflect simplified submodule init
- Remove configs/private/ entry from skills/configs/SKILL.md key files
- Simplify test_configs.py: no longer need to filter out private/ path
* update deps

* update deps

* update deps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants