Skip to content

llama : error clearly when a non-causal model is used for generation#24998

Open
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:amd-rocm/24967-noutputs-noncausal
Open

llama : error clearly when a non-causal model is used for generation#24998
liminfei-amd wants to merge 1 commit into
ggml-org:masterfrom
liminfei-amd:amd-rocm/24967-noutputs-noncausal

Conversation

@liminfei-amd

@liminfei-amd liminfei-amd commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Overview

Loading an embedding/classification model (e.g. BERT/DistilRoBERTa) into a generation tool like llama-cli currently aborts with GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) in output_reserve() (#24967). Non-causal models emit one output row per token, while a generation request uses a bounded n_outputs_max (llama-cli uses 1), so the cap is exceeded.

Rather than raising the cap (which only lets llama-cli proceed into the current context does not allow logits computation. skipping), this PR fails early during llama_context construction with a clear, actionable message:

this model is non-causal (e.g. an embedding/classification model) and cannot be used for text generation; use the embedding API (e.g. llama-embedding) instead

It also adds a missing null-context check in the shared server/CLI load path (server_context_impl::load_model), which previously dereferenced the null context via llama_n_ctx() and segfaulted right after the clean error.

Additional info

These classifier models are fully usable in llama.cpp via the embedding/rank path — they just aren't generative. For example llama-embedding --pooling rank returns the per-label scores correctly:

$ llama-embedding -m emotion-distilroberta.gguf --pooling rank --embd-normalize -1 -p "I am furious and angry"
rerank score 0:   5.84 [anger]   ...  -2.00 [joy]
$ ...                                                  -p "I feel wonderful joy and happiness"
rerank score 0:   6.12 [joy]     ...  -1.27 [anger]

Tested on RDNA4 (gfx1201, Vulkan):

  • llama-cli on a DistilRoBERTa emotion classifier -> now exits cleanly (rc=1) with the message above, instead of aborting (rc=134) / segfaulting (rc=139).
  • llama-embedding --pooling rank on the same model -> unchanged, correct per-label scores.
  • A causal model (Gemma) via llama-cli -> unchanged, generates normally.

Requirements

  • I have read and followed the contributing guidelines
  • AI assistance: yes — an AI coding assistant helped investigate the root cause, draft the change, and run the reproduction/regression tests on real hardware; all changes were reviewed and verified by me.

Fixes #24967

@ggerganov

Copy link
Copy Markdown
Member

The llama-cli does not work with non-causal models. What is your use case?

Loading an embedding/classification model (e.g. BERT/DistilRoBERTa) in a
generation tool such as llama-cli currently aborts with
GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) in output_reserve(),
because non-causal models emit one output row per token while generation
requests a bounded n_outputs_max.

Fail early during llama_context construction with a clear message that points
to the embedding API (e.g. llama-embedding), instead of asserting. Also add a
missing null-context check in the shared server/cli load path so the tool
exits cleanly rather than dereferencing a null context.

Fixes ggml-org#24967

Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
@liminfei-amd liminfei-amd force-pushed the amd-rocm/24967-noutputs-noncausal branch from 2c52028 to 03e8433 Compare June 26, 2026 02:01
@liminfei-amd liminfei-amd requested a review from a team as a code owner June 26, 2026 02:01
@liminfei-amd liminfei-amd changed the title llama : raise n_outputs_max cap for non-causal models llama : error clearly when a non-causal model is used for generation Jun 26, 2026
@liminfei-amd

Copy link
Copy Markdown
Contributor Author

Thanks @ggerganov — you're right, and I've reworked the PR.

These are sequence classifiers (BERT/DistilRoBERTa with a cls.output head) run through llama-cli expecting generation. Raising the cap was the wrong fix — it just lets llama-cli fall into "the current context does not allow logits computation. skipping". They do work via the embedding API (e.g. llama-embedding --pooling rank returns the correct per-label scores).

So the PR now throws a clear error early when a non-causal model is used for generation, instead of bumping the cap — plus a small null-context guard so the tool exits cleanly rather than crashing. Happy to adjust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) failed

2 participants