llama : error clearly when a non-causal model is used for generation#24998
Open
liminfei-amd wants to merge 1 commit into
Open
llama : error clearly when a non-causal model is used for generation#24998liminfei-amd wants to merge 1 commit into
liminfei-amd wants to merge 1 commit into
Conversation
Member
|
The |
Loading an embedding/classification model (e.g. BERT/DistilRoBERTa) in a generation tool such as llama-cli currently aborts with GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) in output_reserve(), because non-causal models emit one output row per token while generation requests a bounded n_outputs_max. Fail early during llama_context construction with a clear message that points to the embedding API (e.g. llama-embedding), instead of asserting. Also add a missing null-context check in the shared server/cli load path so the tool exits cleanly rather than dereferencing a null context. Fixes ggml-org#24967 Signed-off-by: liminfei-amd <91481003+liminfei-amd@users.noreply.github.com>
2c52028 to
03e8433
Compare
Contributor
Author
|
Thanks @ggerganov — you're right, and I've reworked the PR. These are sequence classifiers (BERT/DistilRoBERTa with a So the PR now throws a clear error early when a non-causal model is used for generation, instead of bumping the cap — plus a small null-context guard so the tool exits cleanly rather than crashing. Happy to adjust. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Loading an embedding/classification model (e.g. BERT/DistilRoBERTa) into a generation tool like
llama-clicurrently aborts withGGML_ASSERT(n_outputs_max <= cparams.n_outputs_max)inoutput_reserve()(#24967). Non-causal models emit one output row per token, while a generation request uses a boundedn_outputs_max(llama-cliuses1), so the cap is exceeded.Rather than raising the cap (which only lets
llama-cliproceed intothe current context does not allow logits computation. skipping), this PR fails early duringllama_contextconstruction with a clear, actionable message:It also adds a missing null-context check in the shared server/CLI load path (
server_context_impl::load_model), which previously dereferenced the null context viallama_n_ctx()and segfaulted right after the clean error.Additional info
These classifier models are fully usable in llama.cpp via the embedding/rank path — they just aren't generative. For example
llama-embedding --pooling rankreturns the per-label scores correctly:Tested on RDNA4 (gfx1201, Vulkan):
llama-clion a DistilRoBERTa emotion classifier -> now exits cleanly (rc=1) with the message above, instead of aborting (rc=134) / segfaulting (rc=139).llama-embedding --pooling rankon the same model -> unchanged, correct per-label scores.llama-cli-> unchanged, generates normally.Requirements
Fixes #24967