server : restore forwarding of base CLI args to router child instances#24763
Closed
jhsmith409 wants to merge 1 commit into
Closed
server : restore forwarding of base CLI args to router child instances#24763jhsmith409 wants to merge 1 commit into
jhsmith409 wants to merge 1 commit into
Conversation
In router mode, parent llama-server CLI flags that aren't derivable from the preset .ini (e.g. --parallel, --cache-type-k/v, --flash-attn, -ngl, --threads, --numa) stopped reaching spawned child instances after ggml-org#23976. Children fell back to defaults -- most damagingly --parallel's new default of -1=auto, which resolves to n_parallel=4 and multiplies the KV cache, OOMing models whose ctx-size was tuned for a single slot. The refactor dropped the final `final_presets[*].merge(base_preset)` pass that gave the parent's CLI args highest precedence. Restore it. base_preset and common_preset::merge() are unchanged, so this re-establishes the pre-ggml-org#23976 behavior. Fixes ggml-org#24762 (and the same-symptom ggml-org#24735).
|
Hi @jhsmith409, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Author
|
The PR restored the code to match the prior code in the repo. Neither I nor AI wrote it, I just put it back. Claude diagnosed it and wrote the description. So AI didn't write the code from my perspective unless the repo was previously written by AI. |
Collaborator
|
check before you push - the fix is already deployed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In router mode (
--models-preset/--models-max), the parentllama-server's own CLI flags stopped being forwarded to the child instances it spawns. This regressed between b9641 (good) and b9692 (bad). Flags that aren't derivable from the preset.ini— e.g.--parallel,--cache-type-k/v,--flash-attn,--n-gpu-layers,--threads,--numa— silently disappear from the child's argument list, so children fall back to defaults.The most damaging case is
--parallel, whose default is now-1 = auto(resolves ton_parallel = 4on a multi-core host). A child whosectx-sizewas tuned for a single slot then allocates ~4× the KV cache and OOMs.Fixes #24762. Same underlying cause as #24735 ("build b9688 no respect context size"), which manifests via
--ctx-size.Root cause
server_models::load_models()builds each model's effective preset by cascading the[*]global preset and the per-model section, but the final pass that merged the parent's CLI args (base_preset) into every model preset was dropped in the router rework (#23976).In b9641, after
final_presetsis assembled:In b9692 this block is gone.
base_presetis still constructed (load_from_args(argc, argv)) and cleaned (unset_reserved_args), but it is never merged into the per-model presets — so it has no effect on the rendered child args.Fix
Restore the dropped merge pass right after
final_presetsis assembled.base_presetandcommon_preset::merge()(overwrite semantics → CLI takes precedence) are unchanged, so this re-establishes the pre-#23976 behavior with no other changes.Verification
spawning server instance with args:log: on b9641 the child receives--parallel,--cache-type-k/v,--flash-attn,-ngl, etc.; on b9692 only preset-derived keys (--ctx-size,--model,--split-mode,--tensor-split,--main-gpu,--alias) survive, and the child logsn_parallel is set to auto, using n_parallel = 4then OOMs the KV cache.common_preset::merge(const common_preset&)and thebase_presetmember are unchanged in b9692.