server : disable embeddings/pooling on the speculative draft/MTP context by liminfei-amd · Pull Request #24942 · ggml-org/llama.cpp

liminfei-amd · 2026-06-23T11:25:42Z

Overview

MTP (and speculative draft) models fail to load in llama-server while the same
model loads and runs in llama-cli (#24443). The server aborts in load_model
when creating the speculative context.

Root cause

When the server builds the speculative draft context and the MTP context,
it reuses the target context params. Those carry embeddings = true and a
pooling type. The draft/MTP graphs only produce draft logits and have no
embeddings output, so creating their context with embeddings/pooling enabled
fails. llama-cli does not copy these target-side params into the speculative
context, which is why it is unaffected. (This matches the maintainer note on
#24480 that MTP contexts do not need to set the embeddings tensor.)

Fix

In tools/server/server-context.cpp, before llama_init_from_model() for the
draft context and for the MTP context, set:

cparams.embeddings   = false;
cparams.pooling_type = LLAMA_POOLING_TYPE_NONE;

so the speculative context is created for logits only. The target context is
unchanged.

Validation

On an RDNA3.5 (gfx1151) ROCm/Vulkan build:

cli: loads & runs the MTP model (--spec-type draft-mtp) — unaffected.
server, unpatched: aborts in load_model (reproduced; gdb stack bottoms at
server.cpp ctx_server.load_model).
server, patched: loads, and serves MTP speculative decoding end to end
(draft-mtp: #gen drafts = 15, #acc drafts = 5, #acc tokens = 8, mean acc len = 1.53), slots release cleanly. Output matches the cli run.
git apply --check clean on current master.

Scope / related

#24480 (closed, unmerged) addressed a Gemma4-model-specific facet in
src/models/gemma4-assistant.cpp; this PR is the server-side context-params fix
and is independent of it. The issue thread also mentions hardware-specific
variants; this PR fixes the embeddings/pooling load failure that reproduces on
the speculative draft/MTP path, which is the common failure mode in the report.

Requirements

I have read the contributing guidelines.
AI usage disclosure: this change was prepared with AI assistance; a human
reviewed and verified it and can explain every line.

ggml-gh-bot · 2026-06-23T11:30:10Z

Hi @liminfei-amd, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 5 open PRs.
AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

AndASM · 2026-06-26T03:58:18Z

60bc886 appears to cause a new break that blocks this fix. With that commit reverted this fix works for me.

That's this pull I think #24980

MTP (and draft) models fail to load in llama-server while working in llama-cli. When the server creates the speculative draft / MTP context it reuses the target context params, which carry embeddings = true and a pooling type. The draft and MTP graphs only emit draft logits and have no embeddings output, so initializing their context with embeddings/pooling enabled fails and the server aborts in load_model (llama-cli builds the speculative context without these target-side params, which is why it is unaffected). Explicitly set embeddings = false and pooling_type = LLAMA_POOLING_TYPE_NONE on the draft and MTP context params before llama_init_from_model(), so the speculative context is created for logits only (matching the observation that MTP contexts do not need the embeddings output). Fixes ggml-org#24443

liminfei-amd · 2026-06-29T02:59:31Z

Thanks for testing and for pinpointing #24980 @AndASM. I've rebased this onto current master.

Re-verified on master (b3fed31), gfx1151: with this patch llama-server loads the MTP target+draft (gemma-4-12b) and serves completions, no abort. The Gemma4Assistant requires ctx_other to be set line now shows up only as a "normal during memory fitting" warning and the server continues, so the embeddings/pooling fix here is still needed and sufficient to load. If you still see a hard break from #24980 on your setup, a repro would help.

liminfei-amd requested a review from a team as a code owner June 23, 2026 11:25

github-actions Bot added examples server labels Jun 23, 2026

liminfei-amd mentioned this pull request Jun 23, 2026

Eval bug: MTP models fail to load when running llama-server, works with llama-cli #24443

Open

liminfei-amd force-pushed the server-mtp-draft-no-embd-24443 branch from d8d4102 to b9268bd Compare June 29, 2026 02:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : disable embeddings/pooling on the speculative draft/MTP context#24942

server : disable embeddings/pooling on the speculative draft/MTP context#24942
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:server-mtp-draft-no-embd-24443

liminfei-amd commented Jun 23, 2026

Uh oh!

ggml-gh-bot Bot commented Jun 23, 2026

Uh oh!

AndASM commented Jun 26, 2026 •

edited

Loading

Uh oh!

liminfei-amd commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

liminfei-amd commented Jun 23, 2026

Overview

Root cause

Fix

Validation

Scope / related

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 23, 2026

Uh oh!

AndASM commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liminfei-amd commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AndASM commented Jun 26, 2026 •

edited

Loading