Skip to content

server : restore forwarding of base CLI args to router child instances#24763

Closed
jhsmith409 wants to merge 1 commit into
ggml-org:masterfrom
jhsmith409:fix-router-forward-cli-args
Closed

server : restore forwarding of base CLI args to router child instances#24763
jhsmith409 wants to merge 1 commit into
ggml-org:masterfrom
jhsmith409:fix-router-forward-cli-args

Conversation

@jhsmith409

Copy link
Copy Markdown

Summary

In router mode (--models-preset / --models-max), the parent llama-server's own CLI flags stopped being forwarded to the child instances it spawns. This regressed between b9641 (good) and b9692 (bad). Flags that aren't derivable from the preset .ini — e.g. --parallel, --cache-type-k/v, --flash-attn, --n-gpu-layers, --threads, --numa — silently disappear from the child's argument list, so children fall back to defaults.

The most damaging case is --parallel, whose default is now -1 = auto (resolves to n_parallel = 4 on a multi-core host). A child whose ctx-size was tuned for a single slot then allocates ~4× the KV cache and OOMs.

Fixes #24762. Same underlying cause as #24735 ("build b9688 no respect context size"), which manifests via --ctx-size.

Root cause

server_models::load_models() builds each model's effective preset by cascading the [*] global preset and the per-model section, but the final pass that merged the parent's CLI args (base_preset) into every model preset was dropped in the router rework (#23976).

In b9641, after final_presets is assembled:

// server base preset from CLI args takes highest precedence
for (auto & [name, preset] : final_presets) {
    preset.merge(base_preset);
}

In b9692 this block is gone. base_preset is still constructed (load_from_args(argc, argv)) and cleaned (unset_reserved_args), but it is never merged into the per-model presets — so it has no effect on the rendered child args.

Fix

Restore the dropped merge pass right after final_presets is assembled. base_preset and common_preset::merge() (overwrite semantics → CLI takes precedence) are unchanged, so this re-establishes the pre-#23976 behavior with no other changes.

Verification

  • Confirmed via the router's spawning server instance with args: log: on b9641 the child receives --parallel, --cache-type-k/v, --flash-attn, -ngl, etc.; on b9692 only preset-derived keys (--ctx-size, --model, --split-mode, --tensor-split, --main-gpu, --alias) survive, and the child logs n_parallel is set to auto, using n_parallel = 4 then OOMs the KV cache.
  • The restored block is a verbatim reinstatement of code that shipped in b9641; common_preset::merge(const common_preset&) and the base_preset member are unchanged in b9692.

Note: I was unable to run a full CUDA build to runtime-test this (no CUDA toolchain on the build host — it lives only in the prebuilt server-cuda image). The change is a faithful restoration of the b9641 code path; a maintainer CI build would confirm.

In router mode, parent llama-server CLI flags that aren't derivable from the
preset .ini (e.g. --parallel, --cache-type-k/v, --flash-attn, -ngl, --threads,
--numa) stopped reaching spawned child instances after ggml-org#23976. Children fell
back to defaults -- most damagingly --parallel's new default of -1=auto, which
resolves to n_parallel=4 and multiplies the KV cache, OOMing models whose
ctx-size was tuned for a single slot.

The refactor dropped the final `final_presets[*].merge(base_preset)` pass that
gave the parent's CLI args highest precedence. Restore it. base_preset and
common_preset::merge() are unchanged, so this re-establishes the pre-ggml-org#23976
behavior. Fixes ggml-org#24762 (and the same-symptom ggml-org#24735).
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

Hi @jhsmith409, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@jhsmith409

Copy link
Copy Markdown
Author

The PR restored the code to match the prior code in the repo. Neither I nor AI wrote it, I just put it back. Claude diagnosed it and wrote the description.

So AI didn't write the code from my perspective unless the repo was previously written by AI.

@ngxson

ngxson commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

check before you push - the fix is already deployed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

2 participants