refactor(server): move speculative init to speculative.cpp by wadealexc · Pull Request #24952 · ggml-org/llama.cpp

wadealexc · 2026-06-23T18:27:17Z

Overview

This PR unifies draft/mtp parameter initialization, model, and context loading. I wrote this primarily considering the perspective of server_context_impl::load_model, after reviewing the main/text model initialization path and comparing against draft/mtp model/context init paths.

Draft/MTP model/context are now initialized almost identically to the main model - a constructor that uses a pimpl and exposes model/context via accessor methods. server_context_impl has been changed so that model_dft and ctx_dft are raw pointers, like the main model.

I also added a helper for common_params initialization that captures the behavior spread across a few different places in load_model and provides a single method that should handle all initialization cases correctly. (see common_base_params_to_speculative)

Additional information

A few things I want to point out:

When loading the draft model, the master branch does not set params_dft.n_outputs_max = params_base.n_parallel. Every other use of params_dft in load_model does (both fit and spec_mtp init), meaning those uses inherit server_n_outputs_max(params_base). This seemed unintended to me (possibly reserving more VRAM than necessary?), so I fixed it in common_base_params_to_speculative.
Since draft and mtp initialization now sits in a single branch, I opted to bookend the entire branch with load_progress_callback to capture the prior spec_mtp branch behavior. There might be a cleaner way to accomplish this.
When common_speculative_init fails, we now reset the model as well as the context (master only resets the context).

I'm happy to make any adjustments as needed.

Requirements

I have read and agree with the contributing guidelines: Yes
AI usage disclosure: Yes:
- I used Qwen 3.5 27B to help me find/correct all of the locations in server-context.cpp where model_dft and ctx_dft were referenced (they're now raw pointers)
- I used Claude to help review my fixes

- unifies draft/spec mtp parameter initialization, model, and context load - changes server_context_impl model_dft and ctx_dft to use raw pointers

refactor(server): move speculative init to speculative.cpp

649ce7d

- unifies draft/spec mtp parameter initialization, model, and context load - changes server_context_impl model_dft and ctx_dft to use raw pointers

wadealexc requested review from a team as code owners June 23, 2026 18:27

github-actions Bot added examples server labels Jun 23, 2026

wadealexc closed this Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(server): move speculative init to speculative.cpp#24952

refactor(server): move speculative init to speculative.cpp#24952
wadealexc wants to merge 1 commit into
ggml-org:masterfrom
wadealexc:refactor/load-model-draft-params

wadealexc commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wadealexc commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wadealexc commented Jun 23, 2026 •

edited

Loading