Skip to content

server : disable embeddings/pooling on the speculative draft/MTP context#24942

Open
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:server-mtp-draft-no-embd-24443
Open

server : disable embeddings/pooling on the speculative draft/MTP context#24942
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:server-mtp-draft-no-embd-24443

Conversation

@liminfei-amd

Copy link
Copy Markdown
Contributor

Overview

MTP (and speculative draft) models fail to load in llama-server while the same
model loads and runs in llama-cli (#24443). The server aborts in load_model
when creating the speculative context.

Root cause

When the server builds the speculative draft context and the MTP context,
it reuses the target context params. Those carry embeddings = true and a
pooling type. The draft/MTP graphs only produce draft logits and have no
embeddings output, so creating their context with embeddings/pooling enabled
fails. llama-cli does not copy these target-side params into the speculative
context, which is why it is unaffected. (This matches the maintainer note on
#24480 that MTP contexts do not need to set the embeddings tensor.)

Fix

In tools/server/server-context.cpp, before llama_init_from_model() for the
draft context and for the MTP context, set:

cparams.embeddings   = false;
cparams.pooling_type = LLAMA_POOLING_TYPE_NONE;

so the speculative context is created for logits only. The target context is
unchanged.

Validation

On an RDNA3.5 (gfx1151) ROCm/Vulkan build:

  • cli: loads & runs the MTP model (--spec-type draft-mtp) — unaffected.
  • server, unpatched: aborts in load_model (reproduced; gdb stack bottoms at
    server.cpp ctx_server.load_model).
  • server, patched: loads, and serves MTP speculative decoding end to end
    (draft-mtp: #gen drafts = 15, #acc drafts = 5, #acc tokens = 8, mean acc len = 1.53), slots release cleanly. Output matches the cli run.
  • git apply --check clean on current master.

Scope / related

#24480 (closed, unmerged) addressed a Gemma4-model-specific facet in
src/models/gemma4-assistant.cpp; this PR is the server-side context-params fix
and is independent of it. The issue thread also mentions hardware-specific
variants; this PR fixes the embeddings/pooling load failure that reproduces on
the speculative draft/MTP path, which is the common failure mode in the report.

Requirements

  • I have read the contributing guidelines.
  • AI usage disclosure: this change was prepared with AI assistance; a human
    reviewed and verified it and can explain every line.

MTP (and draft) models fail to load in llama-server while working in llama-cli.
When the server creates the speculative draft / MTP context it reuses the target
context params, which carry embeddings = true and a pooling type. The draft and
MTP graphs only emit draft logits and have no embeddings output, so initializing
their context with embeddings/pooling enabled fails and the server aborts in
load_model (llama-cli builds the speculative context without these target-side
params, which is why it is unaffected).

Explicitly set embeddings = false and pooling_type = LLAMA_POOLING_TYPE_NONE on
the draft and MTP context params before llama_init_from_model(), so the
speculative context is created for logits only (matching the observation that
MTP contexts do not need the embeddings output).

Fixes ggml-org#24443

Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

Hi @liminfei-amd, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 5 open PRs.

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@AndASM

AndASM commented Jun 26, 2026

Copy link
Copy Markdown

60bc886 appears to cause a new break that blocks this fix. With that commit reverted this fix works for me.

That's this pull I think #24980

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants