server : restore forwarding of base CLI args to router child instances by jhsmith409 · Pull Request #24763 · ggml-org/llama.cpp

jhsmith409 · 2026-06-18T11:42:01Z

Summary

In router mode (--models-preset / --models-max), the parent llama-server's own CLI flags stopped being forwarded to the child instances it spawns. This regressed between b9641 (good) and b9692 (bad). Flags that aren't derivable from the preset .ini — e.g. --parallel, --cache-type-k/v, --flash-attn, --n-gpu-layers, --threads, --numa — silently disappear from the child's argument list, so children fall back to defaults.

The most damaging case is --parallel, whose default is now -1 = auto (resolves to n_parallel = 4 on a multi-core host). A child whose ctx-size was tuned for a single slot then allocates ~4× the KV cache and OOMs.

Fixes #24762. Same underlying cause as #24735 ("build b9688 no respect context size"), which manifests via --ctx-size.

Root cause

server_models::load_models() builds each model's effective preset by cascading the [*] global preset and the per-model section, but the final pass that merged the parent's CLI args (base_preset) into every model preset was dropped in the router rework (#23976).

In b9641, after final_presets is assembled:

// server base preset from CLI args takes highest precedence
for (auto & [name, preset] : final_presets) {
    preset.merge(base_preset);
}

In b9692 this block is gone. base_preset is still constructed (load_from_args(argc, argv)) and cleaned (unset_reserved_args), but it is never merged into the per-model presets — so it has no effect on the rendered child args.

Fix

Restore the dropped merge pass right after final_presets is assembled. base_preset and common_preset::merge() (overwrite semantics → CLI takes precedence) are unchanged, so this re-establishes the pre-#23976 behavior with no other changes.

Verification

Confirmed via the router's spawning server instance with args: log: on b9641 the child receives --parallel, --cache-type-k/v, --flash-attn, -ngl, etc.; on b9692 only preset-derived keys (--ctx-size, --model, --split-mode, --tensor-split, --main-gpu, --alias) survive, and the child logs n_parallel is set to auto, using n_parallel = 4 then OOMs the KV cache.
The restored block is a verbatim reinstatement of code that shipped in b9641; common_preset::merge(const common_preset&) and the base_preset member are unchanged in b9692.

Note: I was unable to run a full CUDA build to runtime-test this (no CUDA toolchain on the build host — it lives only in the prebuilt server-cuda image). The change is a faithful restoration of the b9641 code path; a maintainer CI build would confirm.

In router mode, parent llama-server CLI flags that aren't derivable from the preset .ini (e.g. --parallel, --cache-type-k/v, --flash-attn, -ngl, --threads, --numa) stopped reaching spawned child instances after ggml-org#23976. Children fell back to defaults -- most damagingly --parallel's new default of -1=auto, which resolves to n_parallel=4 and multiplies the KV cache, OOMing models whose ctx-size was tuned for a single slot. The refactor dropped the final `final_presets[*].merge(base_preset)` pass that gave the parent's CLI args highest precedence. Restore it. base_preset and common_preset::merge() are unchanged, so this re-establishes the pre-ggml-org#23976 behavior. Fixes ggml-org#24762 (and the same-symptom ggml-org#24735).

ggml-gh-bot · 2026-06-18T11:46:51Z

Hi @jhsmith409, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

jhsmith409 · 2026-06-18T11:49:33Z

The PR restored the code to match the prior code in the repo. Neither I nor AI wrote it, I just put it back. Claude diagnosed it and wrote the description.

So AI didn't write the code from my perspective unless the repo was previously written by AI.

ngxson · 2026-06-18T12:11:46Z

check before you push - the fix is already deployed

jhsmith409 requested a review from a team as a code owner June 18, 2026 11:42

github-actions Bot added examples server labels Jun 18, 2026

jhsmith409 mentioned this pull request Jun 18, 2026

Router (server) stops forwarding parent CLI flags (--parallel, --cache-type-*, --flash-attn, -ngl) to spawned child instances — regression between b9641 and b9692 #24762

Closed

ngxson closed this Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : restore forwarding of base CLI args to router child instances#24763

server : restore forwarding of base CLI args to router child instances#24763
jhsmith409 wants to merge 1 commit into
ggml-org:masterfrom
jhsmith409:fix-router-forward-cli-args

jhsmith409 commented Jun 18, 2026

Uh oh!

ggml-gh-bot Bot commented Jun 18, 2026

Uh oh!

jhsmith409 commented Jun 18, 2026

Uh oh!

ngxson commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jhsmith409 commented Jun 18, 2026

Summary

Root cause

Fix

Verification

Uh oh!

ggml-gh-bot Bot commented Jun 18, 2026

Uh oh!

jhsmith409 commented Jun 18, 2026

Uh oh!

ngxson commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants