Add Ascend NPU support for the Domino training pipeline (Qwen3.5-4B)#584
Merged
Conversation
Add device-type / local-device helpers in specforge.utils and route process-group bring-up through them. Backend resolution is hccl on NPU, nccl on CUDA, gloo otherwise; init_device_mesh and DeviceMesh.from_group now use the active accelerator instead of a hard-coded 'cuda'. yunchang is imported softly so installs without it (e.g. Ascend) still work; missing seq-parallel groups fall back to the draft SP group. SPECFORGE_DEVICE and SPECFORGE_DIST_BACKEND environment variables are honored as overrides for testing.
TargetHead.from_pretrained now moves the head to the local accelerator returned by get_local_device(), supporting CUDA and Ascend NPU. The text_config / hidden_size / vocab_size attributes also fall back to config.text_config when present so that VLM-style configurations continue to load.
Ascend's DynamicGRU kernel does not currently support bfloat16, so OnlineDominoModel mirrors prefix_gru into a private float16 nn.GRU sandbox at construction time. Weights are kept in sync via in-place copy on each forward and outputs are cast back to the input dtype. The fp16 module is attached via object.__setattr__ to avoid being registered as a submodule, which would otherwise trip FSDP's mixed-dtype guard. FlexAttention is also unsupported on NPU; FLEX_ATTENTION_AVAILABLE is forced to False at import time so OnlineDFlashModel raises a clear error instructing users to pick --attention-backend sdpa or eager.
Replace .cuda() and device='cuda' literals with get_local_device() so the script runs on whichever accelerator is active.
Adds a Domino draft config matching Qwen3.5-4B (intermediate_size 9216, target_layer_ids on the four full-attention layers) and an example shell launcher that exports the Ascend RT envs, sets a conservative max_split_size, and forces --attention-backend sdpa.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Two trivial fixes surfaced by pre-commit run --all-files: add the blank lines that black 24.10.0 expects in specforge/distributed.py and specforge/utils.py, and mark examples/run_qwen3.5_4b_domino_npu.sh as executable so check-shebang-scripts-are-executable accepts the #!/bin/bash header.
Qwen3.5-4B uses Qwen3_5ForConditionalGeneration where the language tower lives under model.language_model.* rather than model.* directly. The previous --embedding-key value caused a KeyError against model.safetensors.index.json at startup. Match the convention already used by examples/run_qwen3.5_35b_a3b_*_online.sh.
Drop the orphan step-number comments (sgl-project#3, sgl-project#4), shrink the helper docstrings that explained reviewer-only context, and tighten the yunchang try/except (ImportError instead of Exception, drop type:ignore and pragma:no-cover noise) so the file reads closer to the original distributed.py.
Contributor
Author
jiapingW
approved these changes
Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR brings the Domino training pipeline up on Ascend NPU. The adaptation is intentionally narrow: it keeps the existing CUDA path byte-for-byte identical and only swaps in NPU equivalents when the active accelerator is detected as
npu. No new public APIs are added to the user-facing training entrypoint.Scope
specforge/modeling/target/target_head.pytext_configfallbackspecforge/core/domino.py,specforge/core/dflash.pyscripts/train_domino.pycudaliteralsconfigs/qwen3.5-4b-domino.json,examples/run_qwen3.5_4b_domino_npu.shDesign notes
Device / backend resolution
A small helper pair lives in
specforge.utils:Resolution order is
SPECFORGE_DEVICEenv var ➜torch.cuda➜torch.npu➜cpu.init_distributed()consumes these helpers and selectsnccl/hccl/glooaccordingly;init_device_meshandDeviceMesh.from_groupuse the resolved device type instead of a hard-coded"cuda". ASPECFORGE_DIST_BACKENDenv var is honored as an explicit override.yunchangis imported softly because the package is currently CUDA only. When it is unavailable the seq-parallel groups fall back to the draft SP group;set_seq_parallel_pgbecomes a no-op.Domino bf16 GRU workaround
Ascend's
DynamicGRUkernel does not yet support bfloat16. To keep training in bf16 end-to-end without breaking FSDP's mixed-dtype guard,OnlineDominoModelconstructs a private float16nn.GRUsandbox and attaches it viaobject.__setattr__so it is not registered as a submodule. On every forward we sync the prefix-GRU weights into the sandbox, run inference in fp16, and cast back to the input dtype. The overhead is a single weight copy per step.FlexAttention on NPU
FLEX_ATTENTION_AVAILABLEis forced toFalseat module-import time when running on NPU.OnlineDFlashModel.forwardalready raises a descriptive error when the backend is not available, which now also fires on NPU and tells the user to choose--attention-backend sdpaor--attention-backend eager.TargetHead
TargetHead.from_pretrainednow usesget_local_device()instead of.cuda(). The class also restores the upstreamtext_config/hidden_size/vocab_sizefallback so VLM-style configs continue to load (the fallback is a no-op for plain text models such as Qwen3.5).Qwen3.5-4B example
configs/qwen3.5-4b-domino.jsonmatches the public Qwen3.5-4B architecture (intermediate_size: 9216,target_layer_ids: [3, 11, 19, 27, 31]— the four full-attention layers plus the final layer). The sample launcherexamples/run_qwen3.5_4b_domino_npu.shexposesTARGET_MODEL_PATH/TRAIN_DATA_PATHas required env vars, exportsASCEND_RT_VISIBLE_DEVICESand a conservativePYTORCH_NPU_ALLOC_CONF, and passes--attention-backend sdpa.Compatibility
get_device_type()returnscudawhenevertorch.cuda.is_available()is true, and every code path that used to read"cuda"literals continues to do so.How to test
Accuracy Test
Hardware
8× Ascend 910C NPU
Target model
Qwen3.5-4B
Training data preparation
Qwen3.5-4B (target model) with the HF backend is used to regenerate responses for 10K samples from Open-PerfectBlend. The regenerated QA pairs serve as training data for both the DFlash and Domino draft models, keeping the data pipeline identical across the two baselines.
Baseline: DFlash on NPU (PR #562 + #559)
The DFlash NPU baseline uses the existing

qwen3.5-4b-dflash.jsonconfig and the HF backend with SDPA attention.This PR: Domino on NPU
The Domino NPU run uses

configs/qwen3.5-4b-domino.jsonadded in this PR, with the same target model, regenerated 10K dataset, and comparable hyperparameters.Head-to-head comparison
The following figure overlays the accuracy / loss / learning rate curves of Domino NPU (this PR) and DFlash NPU (baseline) on the same axes.

Summary
Under the same 10K regenerated Open-PerfectBlend data, Qwen3.5-4B target model, and identical LR schedule, Domino NPU (red) consistently outperforms the DFlash NPU baseline (blue): it achieves a lower final training loss (~1.4 vs ~2.1) and a substantially higher accuracy (~0.65 vs ~0.45) after ~5,500 steps.
Commits
feat(target): use accelerator-aware device for TargetHead loadingfeat(domino): work around Ascend NPU bf16 GRU and flex_attentionfeat(scripts): make train_domino device-awarefeat(qwen3.5): add 4B Domino draft config and Ascend launcher