server : disable embeddings/pooling on the speculative draft/MTP context#24942
Open
liminfei-amd wants to merge 1 commit into
Open
server : disable embeddings/pooling on the speculative draft/MTP context#24942liminfei-amd wants to merge 1 commit into
liminfei-amd wants to merge 1 commit into
Conversation
MTP (and draft) models fail to load in llama-server while working in llama-cli. When the server creates the speculative draft / MTP context it reuses the target context params, which carry embeddings = true and a pooling type. The draft and MTP graphs only emit draft logits and have no embeddings output, so initializing their context with embeddings/pooling enabled fails and the server aborts in load_model (llama-cli builds the speculative context without these target-side params, which is why it is unaffected). Explicitly set embeddings = false and pooling_type = LLAMA_POOLING_TYPE_NONE on the draft and MTP context params before llama_init_from_model(), so the speculative context is created for logits only (matching the observation that MTP contexts do not need the embeddings output). Fixes ggml-org#24443 Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
|
Hi @liminfei-amd, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
MTP (and speculative draft) models fail to load in
llama-serverwhile the samemodel loads and runs in
llama-cli(#24443). The server aborts inload_modelwhen creating the speculative context.
Root cause
When the server builds the speculative draft context and the MTP context,
it reuses the target context params. Those carry
embeddings = trueand apooling type. The draft/MTP graphs only produce draft logits and have no
embeddings output, so creating their context with embeddings/pooling enabled
fails.
llama-clidoes not copy these target-side params into the speculativecontext, which is why it is unaffected. (This matches the maintainer note on
#24480 that MTP contexts do not need to set the embeddings tensor.)
Fix
In
tools/server/server-context.cpp, beforellama_init_from_model()for thedraft context and for the MTP context, set:
so the speculative context is created for logits only. The target context is
unchanged.
Validation
On an RDNA3.5 (gfx1151) ROCm/Vulkan build:
--spec-type draft-mtp) — unaffected.load_model(reproduced; gdb stack bottoms atserver.cppctx_server.load_model).(
draft-mtp: #gen drafts = 15, #acc drafts = 5, #acc tokens = 8, mean acc len = 1.53), slots release cleanly. Output matches the cli run.git apply --checkclean on currentmaster.Scope / related
#24480 (closed, unmerged) addressed a Gemma4-model-specific facet in
src/models/gemma4-assistant.cpp; this PR is the server-side context-params fixand is independent of it. The issue thread also mentions hardware-specific
variants; this PR fixes the embeddings/pooling load failure that reproduces on
the speculative draft/MTP path, which is the common failure mode in the report.
Requirements
reviewed and verified it and can explain every line.