You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
where <N> == num_hidden_layers (the first index past the trunk).
The converter writes block_count = num_hidden_layers + mtp_num_hidden_layers and a nextn_predict_layers key whenever config.json declares mtp_num_hidden_layers, even when the checkpoint contains no mtp.* weights. The runtime then derives n_layer_all = block_count and unconditionally constructs the trailing MTP/NextN block, marking blk.<N>.attn_norm.weight (and the other MTP tensors) as required. For a trunk-only GGUF this block is never present, so load aborts.
src/models/step35.cpp already handles this: it probes for the defining MTP tensor and, when absent, marks the MTP block tensors TENSOR_NOT_REQUIRED ("trunk-only"). This PR ports that same trunk_only handling to src/models/qwen35.cpp and src/models/qwen35moe.cpp, which previously hardcoded the MTP block tensors as required.
After the change:
Trunk-only GGUFs load and run normal inference (the MTP block is never executed in the main graph; n_layer() excludes nextn layers).
GGUFs that actually bundle the MTP block are unchanged - the tensors are still required and the speculative (graph_mtp) path keeps working.
Same failure family reported in #24737 (Qwen3.5-4B, blk.32), #24211 (Nex N2 Pro / Qwen3.5 397B MoE, blk.60), and the Qwen3.5-122B MoE GGUF discussion (blk.48). The MTP-in-GGUF mapping and runtime were added in #20533 / #22673; the step35 trunk-only fix landed in #24340 but was not ported to the qwen35 loaders.
AI usage disclosure: YES - I used an AI assistant to help me understand the issue and identify what needed to change, and to get a more thorough understanding of the relevant code. It helped me realize that step35 already had this change so I had to replicate that for qwen3.5. I made the changes myself. Further, I used AI to write this PR description.
The converter writes block_count = num_hidden_layers + mtp_num_hidden_layers and a nextn_predict_layers key whenever config.json declares mtp_num_hidden_layers, even when the checkpoint contains no mtp.* weights. The runtime then derives n_layer_all = block_count and unconditionally constructs the trailing MTP/NextN block, marking blk.<N>.attn_norm.weight (and the other MTP tensors) as required. For a trunk-only GGUF this block is never present, so load aborts.
First of all a model's config.json should not declare MTP layers if it does not have any, this is a model bug. Failing to load such a GGUF is perfectly valid (and can be fixed by editing the config or using --no-mtp at conversion, alternatively update the GGUF with gguf-set-metadata).
Secondly, allowing this probably leads to other subtle issues as hparams.n_layer_all is now incorrect. In fact the correct fix is to remove this from step35.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Fixes loading of Qwen3.5 dense (
Qwen3_5ForCausalLM) and MoE (Qwen3_5MoeForCausalLM) GGUFs that fail at load time with:where
<N> == num_hidden_layers(the first index past the trunk).The converter writes
block_count = num_hidden_layers + mtp_num_hidden_layersand anextn_predict_layerskey wheneverconfig.jsondeclaresmtp_num_hidden_layers, even when the checkpoint contains nomtp.*weights. The runtime then derivesn_layer_all = block_countand unconditionally constructs the trailing MTP/NextN block, markingblk.<N>.attn_norm.weight(and the other MTP tensors) as required. For a trunk-only GGUF this block is never present, so load aborts.src/models/step35.cppalready handles this: it probes for the defining MTP tensor and, when absent, marks the MTP block tensorsTENSOR_NOT_REQUIRED("trunk-only"). This PR ports that sametrunk_onlyhandling tosrc/models/qwen35.cppandsrc/models/qwen35moe.cpp, which previously hardcoded the MTP block tensors as required.After the change:
n_layer()excludes nextn layers).graph_mtp) path keeps working.Closes #24737.
Closes #24211.
Additional information
Same failure family reported in #24737 (Qwen3.5-4B,
blk.32), #24211 (Nex N2 Pro / Qwen3.5 397B MoE,blk.60), and the Qwen3.5-122B MoE GGUF discussion (blk.48). The MTP-in-GGUF mapping and runtime were added in #20533 / #22673; the step35 trunk-only fix landed in #24340 but was not ported to the qwen35 loaders.Requirements
step35already had this change so I had to replicate that forqwen3.5. I made the changes myself. Further, I used AI to write this PR description.