[Feature] Mopd (Multi-Teacher On-Policy distillation) supported#2051
Draft
leoyuppieqnew wants to merge 15 commits into
Draft
[Feature] Mopd (Multi-Teacher On-Policy distillation) supported#2051leoyuppieqnew wants to merge 15 commits into
leoyuppieqnew wants to merge 15 commits into
Conversation
added 11 commits
June 10, 2026 15:16
- Add MOPD rollout module (slime/rollout/mopd.py) for multi-teacher distillation - Add MOPD loss computation in megatron backend (KL divergence based) - Add MOPD-related arguments (teacher config, distillation params) - Add ray rollout integration for MOPD pipeline - Add example scripts for Qwen3.5-35B-A3B MOPD training - Add README documentation for MOPD feature - Add unit tests for MOPD functionality
- Extend MOPD loss computation to support full vocabulary KL divergence - Add parameterized distillation mode selection (token-level vs full-vocab) - Add ppo_utils helpers for full vocabulary logits processing - Modify model.py to support output_all_logits mode - Add example script for full-vocab megatron training - Add comprehensive unit tests for full vocabulary distillation - Update README with full vocabulary distillation documentation
- Implement TopK token selection for efficient distillation loss computation - Add TopK-related arguments (topk_tokens, topk_temperature) - Add 397B model startup script (scripts/models/qwen3.5-397B-A17B.sh) - Extend ppo_utils with TopK logits extraction and processing - Update full-vocab megatron script with TopK options - Extend tests for TopK distillation mode
- Add SGLang-based teacher rollout pipeline (separate from Megatron in-process mode) - Implement HTTP-based teacher logprobs collection during rollout - Add MOPD teacher URL configuration via environment variables - Fix logits calculation bug in TopK mode - Fix bad teacher request handling with retry logic - Improve MOPD rollout logging and monitoring - Add 397B model example scripts (megatron and sglang modes) - Add README_zh.md with Chinese documentation - Add comprehensive SGLang TopK pipeline integration tests
- Add Qwen3.5 VL MoE megatron bridge plugin (qwen35_vl_moe.py) - Add multimodal input handling in MOPD rollout pipeline - Add visual input processing with image token support - Fix fused experts computation for VL MoE architecture - Fix VL MoE model conversion (HF <-> torch_dist) - Add 35B-A3B multimodal TopK SGLang training example script - Register VL MoE bridge in megatron_bridge plugin
- Add Qwen3.5 MoE bridge conversion support in mbridge plugin - Add parallel distributed conversion tool (convert_torch_dist_to_hf_parallel.py) - Add merge_missing_keys.py for handling partial checkpoint merges - Fix megatron_to_hf conversion for Qwen3.5 architecture - Fix convert_torch_dist_to_hf_bridge.py quantization support
- Fix loss becoming inf due to numerical instability in KL computation - Fix padding vocab size handling in actor forward pass - Add train-memory-margin-bytes argument for memory management - Add attention gate patching tool for distributed checkpoints - Add safety checks for logits with padding tokens - Update 397B SGLang script with stability improvements
- Add non-colocate mode support in update_weight_from_distributed.py (separate actor training GPUs from SGLang rollout GPUs) - Add HfWeightIteratorBridge support for Megatron-to-HF conversion in weight update pipeline (supports VL MoE models) - Switch 397B model script to use bridge mode for megatron-to-hf - Update 397B SGLang script for non-colocate deployment - Update 35B script with optimized parallelism settings
- Add GUIDE_qwen35_moe_mopd.md with detailed usage documentation - Cover MOPD workflow, distillation modes (TopK, full-vocab) - Document SGLang teacher server setup and configuration - Document multi-teacher domain routing and hyperparameters - Include troubleshooting and FAQ sections
- Fix filter_long_prompt to correctly process multimodal inputs when apply_chat_template has already converted prompt to a string - Add 'messages' field to Sample dataclass to preserve raw message list for multimodal processing after chat template application - Ensures vision info (images) can be extracted from original messages even when prompt has been templated
Remove megatron-mode and full-vocab example scripts that have known OOM problems. Keep only the validated SGLang TopK scripts: - run-qwen35-397B-A17B-mopd-topk-sglang.sh - run-qwen35-35B-A3B-mopd-topk-sglang.sh
added 4 commits
June 11, 2026 11:29
# Conflicts: # slime/backends/megatron_utils/data.py # slime/backends/megatron_utils/loss.py # slime/backends/megatron_utils/model.py # slime/backends/megatron_utils/update_weight/update_weight_from_distributed.py # slime/ray/rollout.py # slime/utils/types.py
- Add strict=False to zip() calls (B905) - Rename unused loop variable domain to _domain (B007) - Add from err to raise in except clause (B904) - Remove unused variables process_group, k, last_file_idx/name (F841) - Add noqa: F841 for intentionally assigned test variables - Replace assert False with raise AssertionError (B011) - Rename ambiguous variable l to v (E741) - Apply black formatting to all modified files
1. Fix RuntimeError in test_topk_kl_identical_distributions: kl.item() fails on multi-element tensor, changed to (kl >= -0.1).all() 2. Add missing vocab_size=20 to TestApplyMopdTopkToLoss._make_args(): apply_mopd_topk_to_loss requires args.vocab_size which was not set
Author
|
@zhuzilin Hello!Would you mind helping review it when you’re available? I’d really appreciate it. Thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: Multi-Teacher On-Policy Distillation (MOPD)
Summary
Add Multi-Teacher On-Policy Distillation (MOPD) support to slime, enabling a single student model to distill knowledge from multiple domain-specific teachers simultaneously with importance sampling (IS) correction for stable off-policy training.
34 files changed, +9040 / -50 lines
Motivation
Standard on-policy distillation (OPD) supports only a single teacher. In practice, different domains (math, code, reasoning) benefit from specialized teacher models. MOPD extends OPD to aggregate knowledge from multiple domain-specific teachers into a single student, using per-teacher reverse KL advantages averaged across domains with IS-weight correction for stable training when the student policy diverges from the sampling policy.
Algorithm
MOPD supports three distillation strategies, controlled by
--mopd-distill-type:Token-Level Mode (
token_level, default)Uses the sampled token's log-prob difference as a point estimate of the reverse KL divergence:
Training loss:
Top-K Mode (
top_k, recommended)A memory-efficient approximation of the full-vocab KL. Only stores the teacher's top-k logits and indices, with an analytical tail correction:
Top-K part (exact over the teacher's top-k tokens):
Tail correction (approximates the remaining vocabulary):
where:
π_s_tail = 1 - Σ_{y ∈ top-k} π_s(y)— student's exact tail mass (computed via all-reduce across TP ranks)π_t_tail = 1 - Σ exp(log_prob_t(y))— exact (SGLang returns full-vocab log-probs)π_t_tail ≈ (V - V_eff) / V— uniform assumption over non-top-k tokensFull loss:
B × R × k × 2 × 4B / TP(~97% reduction vs full_vocab for k=1024, V=152K)-infpadding for out-of-shard entries; Megatron provides local indices directlyFull-Vocabulary Mode (
full_vocab)Computes the exact reverse KL over the entire vocabulary:
Training loss:
B × R × V × 4B / TP(stores full teacher logits)Comparison
token_leveltop_kfull_vocabKey Features
w_t = sg[π_θ/μ_θ]ensure stable trainingαblends reverse KL with standard ORM advantagessglang— teachers on external SGLang servers (can have different architectures)megatron— teachers loaded into Megatron (same architecture required)mopd_domainsmetadata fieldHfWeightIteratorBridgeArchitecture
New Files
slime/rollout/mopd.pyslime/utils/ppo_utils.py(+369)vocab_parallel_topk_reverse_kl,vocab_parallel_reverse_klslime_plugins/megatron_bridge/qwen35_vl_moe.pyscripts/models/qwen3.5-397B-A17B.shtools/convert_torch_dist_to_hf_parallel.pytools/merge_missing_keys.pytools/patch_attention_gate_on_cluster.pytests/test_mopd.pytests/test_mopd_full_vocab.pytests/test_mopd_sglang_topk_pipeline.pyModified Files (key changes)
slime/backends/megatron_utils/actor.py(+317)slime/backends/megatron_utils/loss.py(+829)slime/backends/megatron_utils/model.py(+59)batch_keysfor per-domain MOPD teacher dataslime/backends/megatron_utils/update_weight/update_weight_from_distributed.py(+49)HfWeightIteratorBridgefor VL MoEslime/utils/arguments.py(+376)slime/ray/rollout.py(+86)slime/utils/types.py(+15)messagesfield on Sample dataclassslime/utils/data.py(+10)filter_long_promptmultimodal input handlingExample Scripts
run-qwen35-397B-A17B-mopd-topk-sglang.shrun-qwen35-35B-A3B-mopd-topk-sglang.shUsage
Key Arguments
--use-mopd--use-opd)--mopd-teachersMOPD_TEACHERS_JSON)--mopd-teacher-modesglangormegatron--mopd-distill-typetoken_level,top_k, orfull_vocabtoken_level--mopd-topk-ktop_kmode)--mopd-alpha--mopd-eps-low--mopd-eps-high--mopd-teacher-loads--mopd-sampling-logprobs-keyrollout_log_probsTraining Results
The following figure shows key training metrics from an MOPD top-k distillation run, confirming stable and effective convergence:
Key observations:
mopd_topk_kl/mopd_topk_kl/origin/mopd_topk_kl/enhanced: KL divergence between student and teacher decreases steadily, indicating the student successfully absorbs teacher knowledge across domains.mopd_is_weight_mean: Importance sampling weights remain tightly centered around 1.0, validating that IS clipping (eps_low=0.2,eps_high=5.0) effectively prevents variance explosion.train_rollout_logprob_abs_diff: The absolute difference between training and rollout log-probs stays small and stable, confirming on-policy consistency.grad_norm/entropy_loss/loss: Gradient norm stays bounded, entropy decreases gradually, and the overall loss converges — all signs of healthy training dynamics.ppo_kl: Remains well-controlled, indicating the student policy is not drifting excessively from the reference policy.Testing
Compatibility
top_logprobs_numsupport (recent versions)padded_vocab_size(for SGLang top-k mode)--megatron-to-hf-mode bridge) required for Qwen3.5 VL MoE weight sync