Add Ascend NPU support for the Domino training pipeline (Qwen3.5-4B) by curnane-lab · Pull Request #584 · sgl-project/SpecForge

curnane-lab · 2026-06-18T02:54:30Z

Summary

This PR brings the Domino training pipeline up on Ascend NPU. The adaptation is intentionally narrow: it keeps the existing CUDA path byte-for-byte identical and only swaps in NPU equivalents when the active accelerator is detected as npu. No new public APIs are added to the user-facing training entrypoint.

Scope

Area	Change
`specforge/modeling/target/target_head.py`	Accelerator-aware device + `text_config` fallback
`specforge/core/domino.py`, `specforge/core/dflash.py`	NPU bf16 GRU workaround, FlexAttention guard
`scripts/train_domino.py`	Replace hard-coded `cuda` literals
`configs/qwen3.5-4b-domino.json`, `examples/run_qwen3.5_4b_domino_npu.sh`	Qwen3.5-4B Domino draft config and Ascend launcher

Design notes

Device / backend resolution

A small helper pair lives in specforge.utils:

get_device_type() -> str          # cuda | npu | cpu
get_local_device() -> torch.device

Resolution order is SPECFORGE_DEVICE env var ➜ torch.cuda ➜ torch.npu ➜ cpu. init_distributed() consumes these helpers and selects nccl / hccl / gloo accordingly; init_device_mesh and DeviceMesh.from_group use the resolved device type instead of a hard-coded "cuda". A SPECFORGE_DIST_BACKEND env var is honored as an explicit override.

yunchang is imported softly because the package is currently CUDA only. When it is unavailable the seq-parallel groups fall back to the draft SP group; set_seq_parallel_pg becomes a no-op.

Domino bf16 GRU workaround

Ascend's DynamicGRU kernel does not yet support bfloat16. To keep training in bf16 end-to-end without breaking FSDP's mixed-dtype guard, OnlineDominoModel constructs a private float16 nn.GRU sandbox and attaches it via object.__setattr__ so it is not registered as a submodule. On every forward we sync the prefix-GRU weights into the sandbox, run inference in fp16, and cast back to the input dtype. The overhead is a single weight copy per step.

FlexAttention on NPU

FLEX_ATTENTION_AVAILABLE is forced to False at module-import time when running on NPU. OnlineDFlashModel.forward already raises a descriptive error when the backend is not available, which now also fires on NPU and tells the user to choose --attention-backend sdpa or --attention-backend eager.

TargetHead

TargetHead.from_pretrained now uses get_local_device() instead of .cuda(). The class also restores the upstream text_config / hidden_size / vocab_size fallback so VLM-style configs continue to load (the fallback is a no-op for plain text models such as Qwen3.5).

Qwen3.5-4B example

configs/qwen3.5-4b-domino.json matches the public Qwen3.5-4B architecture (intermediate_size: 9216, target_layer_ids: [3, 11, 19, 27, 31] — the four full-attention layers plus the final layer). The sample launcher examples/run_qwen3.5_4b_domino_npu.sh exposes TARGET_MODEL_PATH / TRAIN_DATA_PATH as required env vars, exports ASCEND_RT_VISIBLE_DEVICES and a conservative PYTORCH_NPU_ALLOC_CONF, and passes --attention-backend sdpa.

Compatibility

CUDA path is unchanged. get_device_type() returns cuda whenever torch.cuda.is_available() is true, and every code path that used to read "cuda" literals continues to do so.
The PR does not modify any DFlash-only training code path, configs, or launchers.
No new dependencies are introduced.

How to test

# CUDA (smoke)
bash examples/run_qwen3_8b_domino_online.sh

# Ascend NPU (Qwen3.5-4B)
export TARGET_MODEL_PATH=/path/to/Qwen3.5-4B
export TRAIN_DATA_PATH=/path/to/dataset
bash examples/run_qwen3.5_4b_domino_npu.sh

Accuracy Test

Hardware

8× Ascend 910C NPU

Target model

Qwen3.5-4B

Training data preparation

Qwen3.5-4B (target model) with the HF backend is used to regenerate responses for 10K samples from Open-PerfectBlend. The regenerated QA pairs serve as training data for both the DFlash and Domino draft models, keeping the data pipeline identical across the two baselines.

Baseline: DFlash on NPU (PR #562 + #559)

The DFlash NPU baseline uses the existing qwen3.5-4b-dflash.json config and the HF backend with SDPA attention.

This PR: Domino on NPU

The Domino NPU run uses configs/qwen3.5-4b-domino.json added in this PR, with the same target model, regenerated 10K dataset, and comparable hyperparameters.

Head-to-head comparison

The following figure overlays the accuracy / loss / learning rate curves of Domino NPU (this PR) and DFlash NPU (baseline) on the same axes.

Summary

Under the same 10K regenerated Open-PerfectBlend data, Qwen3.5-4B target model, and identical LR schedule, Domino NPU (red) consistently outperforms the DFlash NPU baseline (blue): it achieves a lower final training loss (~1.4 vs ~2.1) and a substantially higher accuracy (~0.65 vs ~0.45) after ~5,500 steps.

Commits

feat(target): use accelerator-aware device for TargetHead loading
feat(domino): work around Ascend NPU bf16 GRU and flex_attention
feat(scripts): make train_domino device-aware
feat(qwen3.5): add 4B Domino draft config and Ascend launcher

Add device-type / local-device helpers in specforge.utils and route process-group bring-up through them. Backend resolution is hccl on NPU, nccl on CUDA, gloo otherwise; init_device_mesh and DeviceMesh.from_group now use the active accelerator instead of a hard-coded 'cuda'. yunchang is imported softly so installs without it (e.g. Ascend) still work; missing seq-parallel groups fall back to the draft SP group. SPECFORGE_DEVICE and SPECFORGE_DIST_BACKEND environment variables are honored as overrides for testing.

TargetHead.from_pretrained now moves the head to the local accelerator returned by get_local_device(), supporting CUDA and Ascend NPU. The text_config / hidden_size / vocab_size attributes also fall back to config.text_config when present so that VLM-style configurations continue to load.

Ascend's DynamicGRU kernel does not currently support bfloat16, so OnlineDominoModel mirrors prefix_gru into a private float16 nn.GRU sandbox at construction time. Weights are kept in sync via in-place copy on each forward and outputs are cast back to the input dtype. The fp16 module is attached via object.__setattr__ to avoid being registered as a submodule, which would otherwise trip FSDP's mixed-dtype guard. FlexAttention is also unsupported on NPU; FLEX_ATTENTION_AVAILABLE is forced to False at import time so OnlineDFlashModel raises a clear error instructing users to pick --attention-backend sdpa or eager.

Replace .cuda() and device='cuda' literals with get_local_device() so the script runs on whichever accelerator is active.

Adds a Domino draft config matching Qwen3.5-4B (intermediate_size 9216, target_layer_ids on the four full-attention layers) and an example shell launcher that exports the Ascend RT envs, sets a conservative max_split_size, and forces --attention-backend sdpa.

gemini-code-assist · 2026-06-18T02:54:34Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Two trivial fixes surfaced by pre-commit run --all-files: add the blank lines that black 24.10.0 expects in specforge/distributed.py and specforge/utils.py, and mark examples/run_qwen3.5_4b_domino_npu.sh as executable so check-shebang-scripts-are-executable accepts the #!/bin/bash header.

Qwen3.5-4B uses Qwen3_5ForConditionalGeneration where the language tower lives under model.language_model.* rather than model.* directly. The previous --embedding-key value caused a KeyError against model.safetensors.index.json at startup. Match the convention already used by examples/run_qwen3.5_35b_a3b_*_online.sh.

Drop the orphan step-number comments (sgl-project#3, sgl-project#4), shrink the helper docstrings that explained reviewer-only context, and tighten the yunchang try/except (ImportError instead of Exception, drop type:ignore and pragma:no-cover noise) so the file reads closer to the original distributed.py.

curnane-lab · 2026-06-24T03:37:24Z

Hi @claude @jiapingW @wenqf11 , would you be up for reviewing this PR?

mingliangfu added 5 commits June 18, 2026 10:44

feat(scripts): make train_domino device-aware

df88531

Replace .cuda() and device='cuda' literals with get_local_device() so the script runs on whichever accelerator is active.

curnane-lab requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners June 18, 2026 02:54

mingliangfu added 5 commits June 18, 2026 11:08

fix(distributed): preserve upstream rank-to-device mapping on CUDA

8b0ab98

tune(examples): lower num-anchors to 16 for Qwen3.5-4B Domino launcher

316a40f

heiheiha798 mentioned this pull request Jun 23, 2026

[Feature] Support Domino inference on top of DFlash speculative decoding sgl-project/sglang#28977

Open

Merge branch 'main' into domino_npu

76bf6a5

align specforge/distributed.py with upstream main

c8c7774

jiapingW self-requested a review June 24, 2026 08:46

jiapingW approved these changes Jun 24, 2026

View reviewed changes

jiapingW merged commit 61f9cb0 into sgl-project:main Jun 24, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Ascend NPU support for the Domino training pipeline (Qwen3.5-4B)#584

Add Ascend NPU support for the Domino training pipeline (Qwen3.5-4B)#584
jiapingW merged 12 commits into
sgl-project:mainfrom
curnane-lab:domino_npu

curnane-lab commented Jun 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 18, 2026

Uh oh!

curnane-lab commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

curnane-lab commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Design notes

Device / backend resolution

Domino bf16 GRU workaround

FlexAttention on NPU

TargetHead

Qwen3.5-4B example

Compatibility

How to test

Accuracy Test

Hardware

Target model

Training data preparation

Baseline: DFlash on NPU (PR #562 + #559)

This PR: Domino on NPU

Head-to-head comparison

Summary

Commits

Uh oh!

gemini-code-assist Bot commented Jun 18, 2026

Uh oh!

curnane-lab commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

curnane-lab commented Jun 18, 2026 •

edited

Loading

curnane-lab commented Jun 24, 2026 •

edited

Loading