server : disable embeddings/pooling on the speculative draft/MTP context#24942
server : disable embeddings/pooling on the speculative draft/MTP context#24942liminfei-amd wants to merge 1 commit into
Conversation
|
Hi @liminfei-amd, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
MTP (and draft) models fail to load in llama-server while working in llama-cli. When the server creates the speculative draft / MTP context it reuses the target context params, which carry embeddings = true and a pooling type. The draft and MTP graphs only emit draft logits and have no embeddings output, so initializing their context with embeddings/pooling enabled fails and the server aborts in load_model (llama-cli builds the speculative context without these target-side params, which is why it is unaffected). Explicitly set embeddings = false and pooling_type = LLAMA_POOLING_TYPE_NONE on the draft and MTP context params before llama_init_from_model(), so the speculative context is created for logits only (matching the observation that MTP contexts do not need the embeddings output). Fixes ggml-org#24443
d8d4102 to
b9268bd
Compare
|
Thanks for testing and for pinpointing #24980 @AndASM. I've rebased this onto current master. Re-verified on master (b3fed31), gfx1151: with this patch |
Overview
MTP (and speculative draft) models fail to load in
llama-serverwhile the samemodel loads and runs in
llama-cli(#24443). The server aborts inload_modelwhen creating the speculative context.
Root cause
When the server builds the speculative draft context and the MTP context,
it reuses the target context params. Those carry
embeddings = trueand apooling type. The draft/MTP graphs only produce draft logits and have no
embeddings output, so creating their context with embeddings/pooling enabled
fails.
llama-clidoes not copy these target-side params into the speculativecontext, which is why it is unaffected. (This matches the maintainer note on
#24480 that MTP contexts do not need to set the embeddings tensor.)
Fix
In
tools/server/server-context.cpp, beforellama_init_from_model()for thedraft context and for the MTP context, set:
so the speculative context is created for logits only. The target context is
unchanged.
Validation
On an RDNA3.5 (gfx1151) ROCm/Vulkan build:
--spec-type draft-mtp) — unaffected.load_model(reproduced; gdb stack bottoms atserver.cppctx_server.load_model).(
draft-mtp: #gen drafts = 15, #acc drafts = 5, #acc tokens = 8, mean acc len = 1.53), slots release cleanly. Output matches the cli run.git apply --checkclean on currentmaster.Scope / related
#24480 (closed, unmerged) addressed a Gemma4-model-specific facet in
src/models/gemma4-assistant.cpp; this PR is the server-side context-params fixand is independent of it. The issue thread also mentions hardware-specific
variants; this PR fixes the embeddings/pooling load failure that reproduces on
the speculative draft/MTP path, which is the common failure mode in the report.
Requirements
reviewed and verified it and can explain every line.