Explicit cleanup#2756
Draft
hubert-marek wants to merge 12 commits into
Draft
Conversation
The aarch64 host install path was broken: `uv sync` installs flash-attn from PyPI source but pyproject sets FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE, so the compiled extension never builds. `scripts/docker-arm64-post-install.sh` fixed it for Docker GB200 builds but hardcoded sm_100 and /app/.venv, leaving Hopper hosts (H100/H200/GH200) without a recipe. Changes: - `scripts/docker-arm64-post-install.sh`: auto-detect compute capability via nvidia-smi when available; parameterize venv path. Preserves the sm_100 default when no GPU is visible (Docker buildx). - `scripts/install.sh`: call the post-install for aarch64 hosts after `uv sync --all-extras`. Previously the script ran uv sync and exited, leaving aarch64 users with a broken venv. - `README.md`: document the aarch64 post-install step (mirrors the existing 3.1 Flash Attention 3 pattern). Validated on GH200 (sm_90, aarch64): - forward + backward parity vs torch SDPA (max diff < 0.05 / 0.25) - 383/384 unit tests pass (the 1 failure is unrelated TileLang/MoE) - SFT trainer smoke test (5 steps, Qwen3-0.6B) runs with flash_attention_2 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(fp8): fuse transpose into the dX block-fp8 weight cast The dX backward built `weight.transpose(0, 1).contiguous()` and re-cast it to fp8 every step, materializing a full bf16 transpose buffer plus an extra read/write pass. Add `per_block_cast_to_fp8_tp_triton`, which produces the block-fp8 of `weight.T` directly by reusing the existing per-block kernel with swapped output/scale strides — no intermediate buffer. 128x128 block quantization is transpose-symmetric, so the result is bit-identical to casting the materialized transpose; DeepGEMM receives an identical B tensor. Verified byte-for-byte across shapes; ~14x faster on a 4096x4096 weight (373 -> 27 us). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * deslopified * also add fused implementation for per-token * Fix: skip tests on <Hopper --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: S1ro1 <matej.sirovatka@gmail.com>
… source (#2733) The [tool.uv.sources] override for prime-pydantic-config was being ignored because it was only a transitive dependency (via prime-rl-configs). uv only applies source overrides for packages that appear in project.dependencies. Adding it as a direct dependency makes uv resolve from the local editable path (deps/pydantic-config) instead of PyPI.
* orch improvements * fixes
Remove the configs/private submodule (research-configs) and all references to it throughout the codebase: - Remove submodule from .gitmodules and git tracking - Simplify install.sh: use plain git submodule update --init --recursive now that no private submodule can fail for users without access - Update skills/install/SKILL.md to reflect simplified submodule init - Remove configs/private/ entry from skills/configs/SKILL.md key files - Simplify test_configs.py: no longer need to filter out private/ path
* update deps * update deps * update deps
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.