llama : error clearly when a non-causal model is used for generation by liminfei-amd · Pull Request #24998 · ggml-org/llama.cpp

liminfei-amd · 2026-06-25T09:14:20Z

Overview

Loading an embedding/classification model (e.g. BERT/DistilRoBERTa) into a generation tool like llama-cli currently aborts with GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) in output_reserve() (#24967). Non-causal models emit one output row per token, while a generation request uses a bounded n_outputs_max (llama-cli uses 1), so the cap is exceeded.

Rather than raising the cap (which only lets llama-cli proceed into the current context does not allow logits computation. skipping), this PR fails early during llama_context construction with a clear, actionable message:

this model is non-causal (e.g. an embedding/classification model) and cannot be used for text generation; use the embedding API (e.g. llama-embedding) instead

It also adds a missing null-context check in the shared server/CLI load path (server_context_impl::load_model), which previously dereferenced the null context via llama_n_ctx() and segfaulted right after the clean error.

Additional info

These classifier models are fully usable in llama.cpp via the embedding/rank path — they just aren't generative. For example llama-embedding --pooling rank returns the per-label scores correctly:

$ llama-embedding -m emotion-distilroberta.gguf --pooling rank --embd-normalize -1 -p "I am furious and angry"
rerank score 0:   5.84 [anger]   ...  -2.00 [joy]
$ ...                                                  -p "I feel wonderful joy and happiness"
rerank score 0:   6.12 [joy]     ...  -1.27 [anger]

Tested on RDNA4 (gfx1201, Vulkan):

llama-cli on a DistilRoBERTa emotion classifier -> now exits cleanly (rc=1) with the message above, instead of aborting (rc=134) / segfaulting (rc=139).
llama-embedding --pooling rank on the same model -> unchanged, correct per-label scores.
A causal model (Gemma) via llama-cli -> unchanged, generates normally.

Requirements

I have read and followed the contributing guidelines
AI assistance: yes — an AI coding assistant helped investigate the root cause, draft the change, and run the reproduction/regression tests on real hardware; all changes were reviewed and verified by me.

Fixes #24967

ggerganov · 2026-06-25T11:15:07Z

The llama-cli does not work with non-causal models. What is your use case?

Loading an embedding/classification model (e.g. BERT/DistilRoBERTa) in a generation tool such as llama-cli currently aborts with GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) in output_reserve(), because non-causal models emit one output row per token while generation requests a bounded n_outputs_max. Fail early during llama_context construction with a clear message that points to the embedding API (e.g. llama-embedding), instead of asserting. Also add a missing null-context check in the shared server/cli load path so the tool exits cleanly rather than dereferencing a null context. Fixes ggml-org#24967 Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>

liminfei-amd · 2026-06-26T02:02:56Z

Thanks @ggerganov — you're right, and I've reworked the PR.

These are sequence classifiers (BERT/DistilRoBERTa with a cls.output head) run through llama-cli expecting generation. Raising the cap was the wrong fix — it just lets llama-cli fall into "the current context does not allow logits computation. skipping". They do work via the embedding API (e.g. llama-embedding --pooling rank returns the correct per-label scores).

So the PR now throws a clear error early when a non-causal model is used for generation, instead of bumping the cap — plus a small null-context guard so the tool exits cleanly rather than crashing. Happy to adjust.

liminfei-amd mentioned this pull request Jun 25, 2026

Eval bug: GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) failed #24967

Open

liminfei-amd marked this pull request as ready for review June 25, 2026 09:49

liminfei-amd requested a review from ggerganov as a code owner June 25, 2026 09:49

liminfei-amd force-pushed the amd-rocm/24967-noutputs-noncausal branch from 2c52028 to 03e8433 Compare June 26, 2026 02:01

liminfei-amd requested a review from a team as a code owner June 26, 2026 02:01

liminfei-amd changed the title ~~llama : raise n_outputs_max cap for non-causal models~~ llama : error clearly when a non-causal model is used for generation Jun 26, 2026

github-actions Bot added the server label Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : error clearly when a non-causal model is used for generation#24998

llama : error clearly when a non-causal model is used for generation#24998
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:amd-rocm/24967-noutputs-noncausal

liminfei-amd commented Jun 25, 2026 •

edited

Loading

Uh oh!

ggerganov commented Jun 25, 2026

Uh oh!

liminfei-amd commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

liminfei-amd commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional info

Requirements

Uh oh!

ggerganov commented Jun 25, 2026

Uh oh!

liminfei-amd commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liminfei-amd commented Jun 25, 2026 •

edited

Loading