Add native MTP for Qwen3.6 MLX models by ffrappo · Pull Request #2110 · exo-explore/exo

ffrappo · 2026-05-23T12:39:08Z

Summary

This PR adds native multi-token prediction for Qwen3.6/Qwen3.5-style MLX checkpoints that include in-checkpoint MTP layers. The MTP heads draft candidate tokens, and the target model verifies those candidates before any token is emitted.

What ships:

Native-MTP loading and dispatch for the selected Qwen3.6 27B and 35B-A3B MLX model cards.
Exact target verification for drafted tokens, including route-locked verification for the MoE path.
Probability-ratio speculative sampling for the request's temperature / top_p / top_k / min_p distribution.
Product/API metadata so clients can see native-MTP capability and final generation stats report when native MTP ran.
Dashboard and macOS UI surfacing for native-MTP model metadata.

Behavior

Native MTP is enabled by default for supported single-node model cards. It dispatches only when the card declares native_mtp, the local checkpoint has recoverable MTP weights, and the model is placed on one node. Multi-node placements continue to use the normal path in this PR. Operators can disable the feature with EXO_NATIVE_MTP_ENABLED=0 or the macOS setting.

Correctness

Native MTP keeps the target model in charge of emission:

Greedy decoding preserves target-greedy token IDs. The selected 27B and 35B local execution targets matched target-greedy for K=1/K=2/K=3, fixed and adaptive, with no first divergence in the recorded 64-token runs.
Sampling uses speculative probability-ratio acceptance against the reconstructed target and draft distributions for temperature / top_p / top_k / min_p.
Cache repair/commit handling preserves the target KV/GDN state after accepted and rejected drafts.

Current scope:

Native MTP is enabled for model cards that explicitly declare native_mtp.default_k / native_mtp.max_k.
Stateful logits_processors such as repetition/presence/frequency penalties are not routed through native MTP yet.
K>=4 is not enabled by this PR.

Performance

Broad prompt set, max_prompt_tokens=32, max_tokens=64:

Model	Mode	Mean tok/s	vs MTP off	Acceptance
27B native-MTP	MTP off	17.27	1.00x	n/a
27B native-MTP	K=1	29.56	1.71x	85.7%
27B native-MTP	K=2	34.06	1.97x	75.4%
27B native-MTP	K=3	33.79	1.96x	66.4%
35B-A3B native-MTP	MTP off	85.14	1.00x	n/a
35B-A3B native-MTP	K=1	98.59	1.16x	55.8%
35B-A3B native-MTP	K=2	92.27	1.08x	38.3%
35B-A3B native-MTP	K=3	80.53	0.95x	27.4%

Summary: 27B reaches +97.2% / 1.97x at K=2. 35B-A3B reaches +15.8% / 1.16x at K=1 in the broad sweep, and higher K is not automatically better on that path.

Implementation Notes

The 35B/MoE GDN setup path rebuilds the target prompt cache and primes the MTP cache in one incremental pass while preserving target/MTP cache equivalence.
Cache setup and repair paths use hidden-state-only target-body calls where logits are not consumed.
Accepted-prefix counting stays on the MLX side before verification results are needed on CPU.
K=1 skips draft-token concatenation.
MTP draft/cache evaluation can overlap with verifier graph construction; EXO_NATIVE_MTP_ASYNC_DRAFT_EVAL=0 disables this path.

Validation

uv run basedpyright
uv run ruff check
nix fmt
uv run pytest src -q
Selected 27B and 35B exactness probes for K=1/K=2/K=3, fixed and adaptive

ffrappo force-pushed the native-mtp branch 7 times, most recently from 9c3bf31 to 42c86bd Compare May 23, 2026 13:04

Add native MTP for Qwen3.6 MLX models

2c33a38

ffrappo force-pushed the native-mtp branch from 42c86bd to 2c33a38 Compare May 23, 2026 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native MTP for Qwen3.6 MLX models#2110

Add native MTP for Qwen3.6 MLX models#2110
ffrappo wants to merge 1 commit into
exo-explore:mainfrom
ffrappo:native-mtp

ffrappo commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ffrappo commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior

Correctness

Performance

Implementation Notes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ffrappo commented May 23, 2026 •

edited

Loading