Skip to content

Add Ascend NPU support for the Domino training pipeline (Qwen3.5-4B)#584

Merged
jiapingW merged 12 commits into
sgl-project:mainfrom
curnane-lab:domino_npu
Jun 24, 2026
Merged

Add Ascend NPU support for the Domino training pipeline (Qwen3.5-4B)#584
jiapingW merged 12 commits into
sgl-project:mainfrom
curnane-lab:domino_npu

Conversation

@curnane-lab

@curnane-lab curnane-lab commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR brings the Domino training pipeline up on Ascend NPU. The adaptation is intentionally narrow: it keeps the existing CUDA path byte-for-byte identical and only swaps in NPU equivalents when the active accelerator is detected as npu. No new public APIs are added to the user-facing training entrypoint.

Scope

Area Change
specforge/modeling/target/target_head.py Accelerator-aware device + text_config fallback
specforge/core/domino.py, specforge/core/dflash.py NPU bf16 GRU workaround, FlexAttention guard
scripts/train_domino.py Replace hard-coded cuda literals
configs/qwen3.5-4b-domino.json, examples/run_qwen3.5_4b_domino_npu.sh Qwen3.5-4B Domino draft config and Ascend launcher

Design notes

Device / backend resolution

A small helper pair lives in specforge.utils:

get_device_type() -> str          # cuda | npu | cpu
get_local_device() -> torch.device

Resolution order is SPECFORGE_DEVICE env var ➜ torch.cudatorch.npucpu. init_distributed() consumes these helpers and selects nccl / hccl / gloo accordingly; init_device_mesh and DeviceMesh.from_group use the resolved device type instead of a hard-coded "cuda". A SPECFORGE_DIST_BACKEND env var is honored as an explicit override.

yunchang is imported softly because the package is currently CUDA only. When it is unavailable the seq-parallel groups fall back to the draft SP group; set_seq_parallel_pg becomes a no-op.

Domino bf16 GRU workaround

Ascend's DynamicGRU kernel does not yet support bfloat16. To keep training in bf16 end-to-end without breaking FSDP's mixed-dtype guard, OnlineDominoModel constructs a private float16 nn.GRU sandbox and attaches it via object.__setattr__ so it is not registered as a submodule. On every forward we sync the prefix-GRU weights into the sandbox, run inference in fp16, and cast back to the input dtype. The overhead is a single weight copy per step.

FlexAttention on NPU

FLEX_ATTENTION_AVAILABLE is forced to False at module-import time when running on NPU. OnlineDFlashModel.forward already raises a descriptive error when the backend is not available, which now also fires on NPU and tells the user to choose --attention-backend sdpa or --attention-backend eager.

TargetHead

TargetHead.from_pretrained now uses get_local_device() instead of .cuda(). The class also restores the upstream text_config / hidden_size / vocab_size fallback so VLM-style configs continue to load (the fallback is a no-op for plain text models such as Qwen3.5).

Qwen3.5-4B example

configs/qwen3.5-4b-domino.json matches the public Qwen3.5-4B architecture (intermediate_size: 9216, target_layer_ids: [3, 11, 19, 27, 31] — the four full-attention layers plus the final layer). The sample launcher examples/run_qwen3.5_4b_domino_npu.sh exposes TARGET_MODEL_PATH / TRAIN_DATA_PATH as required env vars, exports ASCEND_RT_VISIBLE_DEVICES and a conservative PYTORCH_NPU_ALLOC_CONF, and passes --attention-backend sdpa.

Compatibility

  • CUDA path is unchanged. get_device_type() returns cuda whenever torch.cuda.is_available() is true, and every code path that used to read "cuda" literals continues to do so.
  • The PR does not modify any DFlash-only training code path, configs, or launchers.
  • No new dependencies are introduced.

How to test

# CUDA (smoke)
bash examples/run_qwen3_8b_domino_online.sh

# Ascend NPU (Qwen3.5-4B)
export TARGET_MODEL_PATH=/path/to/Qwen3.5-4B
export TRAIN_DATA_PATH=/path/to/dataset
bash examples/run_qwen3.5_4b_domino_npu.sh

Accuracy Test

Hardware

8× Ascend 910C NPU

Target model

Qwen3.5-4B

Training data preparation

Qwen3.5-4B (target model) with the HF backend is used to regenerate responses for 10K samples from Open-PerfectBlend. The regenerated QA pairs serve as training data for both the DFlash and Domino draft models, keeping the data pipeline identical across the two baselines.

Baseline: DFlash on NPU (PR #562 + #559)

The DFlash NPU baseline uses the existing qwen3.5-4b-dflash.json config and the HF backend with SDPA attention.
dflash_npu

This PR: Domino on NPU

The Domino NPU run uses configs/qwen3.5-4b-domino.json added in this PR, with the same target model, regenerated 10K dataset, and comparable hyperparameters.
domino_npu

Head-to-head comparison

The following figure overlays the accuracy / loss / learning rate curves of Domino NPU (this PR) and DFlash NPU (baseline) on the same axes.
vs

Summary

Under the same 10K regenerated Open-PerfectBlend data, Qwen3.5-4B target model, and identical LR schedule, Domino NPU (red) consistently outperforms the DFlash NPU baseline (blue): it achieves a lower final training loss (~1.4 vs ~2.1) and a substantially higher accuracy (~0.65 vs ~0.45) after ~5,500 steps.

Commits

  1. feat(target): use accelerator-aware device for TargetHead loading
  2. feat(domino): work around Ascend NPU bf16 GRU and flex_attention
  3. feat(scripts): make train_domino device-aware
  4. feat(qwen3.5): add 4B Domino draft config and Ascend launcher

Add device-type / local-device helpers in specforge.utils and route process-group bring-up through them. Backend resolution is hccl on NPU, nccl on CUDA, gloo otherwise; init_device_mesh and DeviceMesh.from_group now use the active accelerator instead of a hard-coded 'cuda'. yunchang is imported softly so installs without it (e.g. Ascend) still work; missing seq-parallel groups fall back to the draft SP group.

SPECFORGE_DEVICE and SPECFORGE_DIST_BACKEND environment variables are honored as overrides for testing.
TargetHead.from_pretrained now moves the head to the local accelerator returned by get_local_device(), supporting CUDA and Ascend NPU. The text_config / hidden_size / vocab_size attributes also fall back to config.text_config when present so that VLM-style configurations continue to load.
Ascend's DynamicGRU kernel does not currently support bfloat16, so OnlineDominoModel mirrors prefix_gru into a private float16 nn.GRU sandbox at construction time. Weights are kept in sync via in-place copy on each forward and outputs are cast back to the input dtype. The fp16 module is attached via object.__setattr__ to avoid being registered as a submodule, which would otherwise trip FSDP's mixed-dtype guard.

FlexAttention is also unsupported on NPU; FLEX_ATTENTION_AVAILABLE is forced to False at import time so OnlineDFlashModel raises a clear error instructing users to pick --attention-backend sdpa or eager.
Replace .cuda() and device='cuda' literals with get_local_device() so the script runs on whichever accelerator is active.
Adds a Domino draft config matching Qwen3.5-4B (intermediate_size 9216, target_layer_ids on the four full-attention layers) and an example shell launcher that exports the Ascend RT envs, sets a conservative max_split_size, and forces --attention-backend sdpa.
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Two trivial fixes surfaced by pre-commit run --all-files: add the blank lines that black 24.10.0 expects in specforge/distributed.py and specforge/utils.py, and mark examples/run_qwen3.5_4b_domino_npu.sh as executable so check-shebang-scripts-are-executable accepts the #!/bin/bash header.
Qwen3.5-4B uses Qwen3_5ForConditionalGeneration where the language tower lives under model.language_model.* rather than model.* directly. The previous --embedding-key value caused a KeyError against model.safetensors.index.json at startup. Match the convention already used by examples/run_qwen3.5_35b_a3b_*_online.sh.
Drop the orphan step-number comments (sgl-project#3, sgl-project#4), shrink the helper docstrings that explained reviewer-only context, and tighten the yunchang try/except (ImportError instead of Exception, drop type:ignore and pragma:no-cover noise) so the file reads closer to the original distributed.py.
@curnane-lab

curnane-lab commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Hi @claude @jiapingW @wenqf11 , would you be up for reviewing this PR?

@jiapingW jiapingW self-requested a review June 24, 2026 08:46
@jiapingW jiapingW merged commit 61f9cb0 into sgl-project:main Jun 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants