Skip to content

Add native MTP for Qwen3.6 MLX models#2110

Open
ffrappo wants to merge 1 commit into
exo-explore:mainfrom
ffrappo:native-mtp
Open

Add native MTP for Qwen3.6 MLX models#2110
ffrappo wants to merge 1 commit into
exo-explore:mainfrom
ffrappo:native-mtp

Conversation

@ffrappo
Copy link
Copy Markdown

@ffrappo ffrappo commented May 23, 2026

Summary

This PR adds native multi-token prediction for Qwen3.6/Qwen3.5-style MLX checkpoints that include in-checkpoint MTP layers. The MTP heads draft candidate tokens, and the target model verifies those candidates before any token is emitted.

What ships:

  • Native-MTP loading and dispatch for the selected Qwen3.6 27B and 35B-A3B MLX model cards.
  • Exact target verification for drafted tokens, including route-locked verification for the MoE path.
  • Probability-ratio speculative sampling for the request's temperature / top_p / top_k / min_p distribution.
  • Product/API metadata so clients can see native-MTP capability and final generation stats report when native MTP ran.
  • Dashboard and macOS UI surfacing for native-MTP model metadata.

Behavior

Native MTP is enabled by default for supported single-node model cards. It dispatches only when the card declares native_mtp, the local checkpoint has recoverable MTP weights, and the model is placed on one node. Multi-node placements continue to use the normal path in this PR. Operators can disable the feature with EXO_NATIVE_MTP_ENABLED=0 or the macOS setting.

Correctness

Native MTP keeps the target model in charge of emission:

  • Greedy decoding preserves target-greedy token IDs. The selected 27B and 35B local execution targets matched target-greedy for K=1/K=2/K=3, fixed and adaptive, with no first divergence in the recorded 64-token runs.
  • Sampling uses speculative probability-ratio acceptance against the reconstructed target and draft distributions for temperature / top_p / top_k / min_p.
  • Cache repair/commit handling preserves the target KV/GDN state after accepted and rejected drafts.

Current scope:

  • Native MTP is enabled for model cards that explicitly declare native_mtp.default_k / native_mtp.max_k.
  • Stateful logits_processors such as repetition/presence/frequency penalties are not routed through native MTP yet.
  • K>=4 is not enabled by this PR.

Performance

Broad prompt set, max_prompt_tokens=32, max_tokens=64:

Model Mode Mean tok/s vs MTP off Acceptance
27B native-MTP MTP off 17.27 1.00x n/a
27B native-MTP K=1 29.56 1.71x 85.7%
27B native-MTP K=2 34.06 1.97x 75.4%
27B native-MTP K=3 33.79 1.96x 66.4%
35B-A3B native-MTP MTP off 85.14 1.00x n/a
35B-A3B native-MTP K=1 98.59 1.16x 55.8%
35B-A3B native-MTP K=2 92.27 1.08x 38.3%
35B-A3B native-MTP K=3 80.53 0.95x 27.4%

Summary: 27B reaches +97.2% / 1.97x at K=2. 35B-A3B reaches +15.8% / 1.16x at K=1 in the broad sweep, and higher K is not automatically better on that path.

Implementation Notes

  • The 35B/MoE GDN setup path rebuilds the target prompt cache and primes the MTP cache in one incremental pass while preserving target/MTP cache equivalence.
  • Cache setup and repair paths use hidden-state-only target-body calls where logits are not consumed.
  • Accepted-prefix counting stays on the MLX side before verification results are needed on CPU.
  • K=1 skips draft-token concatenation.
  • MTP draft/cache evaluation can overlap with verifier graph construction; EXO_NATIVE_MTP_ASYNC_DRAFT_EVAL=0 disables this path.

Validation

  • uv run basedpyright
  • uv run ruff check
  • nix fmt
  • uv run pytest src -q
  • Selected 27B and 35B exactness probes for K=1/K=2/K=3, fixed and adaptive

@ffrappo ffrappo force-pushed the native-mtp branch 7 times, most recently from 9c3bf31 to 42c86bd Compare May 23, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant