Skip to content

refactor(server): move speculative init to speculative.cpp#24952

Closed
wadealexc wants to merge 1 commit into
ggml-org:masterfrom
wadealexc:refactor/load-model-draft-params
Closed

refactor(server): move speculative init to speculative.cpp#24952
wadealexc wants to merge 1 commit into
ggml-org:masterfrom
wadealexc:refactor/load-model-draft-params

Conversation

@wadealexc

@wadealexc wadealexc commented Jun 23, 2026

Copy link
Copy Markdown

Overview

This PR unifies draft/mtp parameter initialization, model, and context loading. I wrote this primarily considering the perspective of server_context_impl::load_model, after reviewing the main/text model initialization path and comparing against draft/mtp model/context init paths.

Draft/MTP model/context are now initialized almost identically to the main model - a constructor that uses a pimpl and exposes model/context via accessor methods. server_context_impl has been changed so that model_dft and ctx_dft are raw pointers, like the main model.

I also added a helper for common_params initialization that captures the behavior spread across a few different places in load_model and provides a single method that should handle all initialization cases correctly. (see common_base_params_to_speculative)

Additional information

A few things I want to point out:

  1. When loading the draft model, the master branch does not set params_dft.n_outputs_max = params_base.n_parallel. Every other use of params_dft in load_model does (both fit and spec_mtp init), meaning those uses inherit server_n_outputs_max(params_base). This seemed unintended to me (possibly reserving more VRAM than necessary?), so I fixed it in common_base_params_to_speculative.
  2. Since draft and mtp initialization now sits in a single branch, I opted to bookend the entire branch with load_progress_callback to capture the prior spec_mtp branch behavior. There might be a cleaner way to accomplish this.
  3. When common_speculative_init fails, we now reset the model as well as the context (master only resets the context).

I'm happy to make any adjustments as needed.

Requirements

  • I have read and agree with the contributing guidelines: Yes
  • AI usage disclosure: Yes:
    • I used Qwen 3.5 27B to help me find/correct all of the locations in server-context.cpp where model_dft and ctx_dft were referenced (they're now raw pointers)
    • I used Claude to help review my fixes

- unifies draft/spec mtp parameter initialization, model, and context load
- changes server_context_impl model_dft and ctx_dft to use raw pointers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant