Gemma4 moe pr v3 by cavusmustafa · Pull Request #221 · ravi9/llama.cpp

cavusmustafa · 2026-06-16T20:03:34Z

Gemma4 26B MOE Initial Support

Only CPU/GPU supported for this PR (no NPU)
GPU needs at least 64GB system.
CPU can work on 32GB system but needs larger swap space

…GGML graph in api compute_model_outputs()

… models

…backend

…avi9#81)

…on handling

…ention

…ntion_pattern_case to easy extand

* added translate_1to1_match_1_input function and updated gelu and tanh translations * Remove unused translation function calls --------- Co-authored-by: Mustafa Cavus <mustafacavus@intel.com>

* OpenVINO backend: refactor VIEW related operation * Enable VIEW handling in following ops * OpenVINO backend does not support GGML_OP_NORM & GGML_OP_L2_NORM with VIEW input accuracy issue from OpenVINO

…matched types

OpenVINO backend: Enhance environment variable handling

OpenVINO backend: add REPEAT translator, Q5_1 weights, and GLU view-input fix

Use BuildKit cache mounts for faster Docker rebuilds. Use apt instead of dpkg, remove unused .ddeb downloads, add DLLAMA_BUILD_TESTS=OFF.

…elpers Replace getenv and legacy flags with _str and _int helpers.Minor cleanup, doc updates.

OpenVINO backend: Enable GGML_OP_ADD_ID

…concat OpenVINO backend: fix accuracy issue for op CONCAT with i64 precision

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

The MoE path baked the captured prefill token count into the graph as a static dimension, so every decoder layer after layer 0 became statically shaped. On the GPU plugin that static shape tripped the in-place-Concat KV-cache path (garbage prefill); on both CPU and GPU it tripped a Broadcast shape mismatch at multi-token decode. Root cause was a chain of static-token bakes in the MoE subgraph: - compute_node_dynamic_dims() dropped the dynamic token dim through the routing weight normalization (SUM_ROWS -> CLAMP -> DIV), which fell to the default case. - the per-expert scale get_rows tiled a static n_tokens batch and the gather froze the token dim. - the ffn_moe_weighted view reshape used a constant (static) target shape. - process_view_input_new() re-resolved an already-resolved view because its "already matches" guard only accepted all-static shapes, re-flattening the now dynamic expert plane (the n_expert_used*n_embd reshape conflict). Fixes: - ggml-decoder.cpp: track the dynamic dim through SUM_ROWS/DIV/CLAMP. - get_rows.cpp: for the statically-tiled MoE scale gather, collapse the redundant data batch to 1 and broadcast it to the dynamic indices batch (a static->dynamic Broadcast cannot expand). - view.cpp: build the ffn_moe_weighted view reshape target dynamically (the token axis is permuted, so pull it from the source via ShapeOf+Gather). - utils.cpp: treat dynamic-vs-dynamic axes as matching in the view-input reuse guard. Result: all 60 RoPE concats are dynamic (was 2/60). CPU output unchanged ("Paris is the capital of" -> "France"); the un-fragmented MoE graph now runs prefill AND multi-token decode byte-identical to the production path. GET_ROWS test-backend-ops 27 OK / 0 numeric-fail (was 25/0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ubmodel The per-node "force to CPU on GPU" gates for the MoE routing/expert ops were added to work around GPU-plugin issues, but they fragment the GPU graph into ~30 submodels with cross-boundary tensor copies. That fragmentation corrupts the layer-5 (first global-attention layer) argsort/topk indices copied back to ggml-CPU, aborting gemma4 on GPU outright (ggml-cpu GET_ROWS index-out-of-bounds). Gate every such MoE "force to CPU on GPU" check behind gpu_full_moe_enabled() (env GGML_OPENVINO_GPU_FULL_MOE). When set, the whole MoE (routing gather/softmax/ argsort/normalization and the expert matmuls) stays on the OpenVINO device and the model compiles as a single submodel, so the fragmentation copies disappear. Combined with the dynamic-token-dim fix, the un-fragmented graph is numerically correct on the OpenVINO CPU device. Default behavior (flag unset) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two coupled changes that make gemma4 26B MoE run correctly on the GPU device with no manual flag. 1. Auto-detect MoE and stop fragmenting the GPU graph. The per-node "force this MoE op to CPU on GPU" gates fragment the graph into dozens of submodels with cross-boundary copies (which mis-handle e.g. the layer-5 argsort indices and crash). GGML_OPENVINO_GPU_FULL_MOE kept the whole MoE on one OV submodel but had to be set by hand. Now ggml_openvino_gpu_full_moe_enabled() auto-enables it when running a MoE model on GPU: a GGML_OP_MUL_MAT_ID op (the expert-routed matmul, the defining op of a MoE model) latches a process-global flag from supports_op() at op-placement time. The scheduler queries placement before the expert weights are streamed in and makes several placement passes, so the first pass that sees MUL_MAT_ID sets the flag and later passes converge on the full-MoE layout. The GGML_OPENVINO_GPU_FULL_MOE env var still overrides (non-zero forces on, "0" forces off as an escape hatch). CPU/NPU behavior is unchanged: the gates are GPU-guarded, and the auto path only fires on GPU. 2. Dodge the GPU rms_fusion bug on that path. OpenVINO's rms_fusion folds Power(x, 2) -> ... into the internal RMS op; on the GPU plugin that fused RMS primitive's dynamic multi-token kernel writes only token 0 (tokens 1..N read back as 0). For gemma4 this collapsed the per-layer router RMSNorm (~7x summed over the prefill tokens), flattening the router softmax and flipping the top-8 expert selection, so the GPU output drifted ("France" -> " only"). On the GPU full-MoE path only, compute the square as Multiply(x, x): algebraically identical, but it does not match the fusion pattern, so the GPU runs the unfused primitives and writes every token. Every other configuration keeps the fused fast path (CPU, NPU, and non-MoE GPU models such as Llama-3.2-1B). Verified: gemma4 26B MoE on GPU with NO flag now matches CPU byte-for-byte on prefill and multi-token decode; the GGML_OPENVINO_GPU_FULL_MOE=0 escape hatch restores the old (fragmented) path; gemma4 CPU output unchanged; dense Llama-3.2-1B on GPU still correct with rms_fusion active. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…_1/q5_1 GET_ROWS) Two CI test-backend-ops failures introduced by the gemma4 MoE work: 1. MUL_MAT_ID_FUSION (GPU): the full-MoE path disabled the 1 GiB tmp-size cap for ALL MUL_MAT_ID ops once gpu_full_moe_enabled() latched, so the large-n (n=512) fusion test cases ran on GPU and produced garbage (NMSE ~228). Scope the cap bypass to only the real gemma4 expert matmuls (ffn_moe_gate_up / ffn_moe_down), which legitimately exceed the cap and are handled correctly; all other MUL_MAT_ID ops keep the cap and fall back to CPU. 2. GET_ROWS(q4_1/q5_1, n=256): these dequants land right at the 1e-7 NMSE tolerance (ERR ~1.1-1.4e-7) and flakily fail. Exclude them alongside the existing q4_K/q5_K n=256 exclusions. Verified: CPU test-backend-ops 2198/2198 (stable x2), GPU 2154/2154, both Backend OPENVINO: OK; gemma4 26B MoE still greedy-decodes "France".

zhaixuejun1993 and others added 30 commits June 9, 2026 12:20

Add interface is_model_splitted() to check the c-graph is splited or not

4bc1eb1

Infer and propagate dynamic-dimension indices for all tensors in the …

c49ec28

…GGML graph in api compute_model_outputs()

Only do this for fallback sub graph

76eb69e

Move dynamic dims compute in graph missmatch

7e6caef

ggml-openvino: fix tensor data handling for PERMUTE/VIEW ops in split…

d306b0b

… models

ggml-openvino:add comments

01088c2

ggml-openvino: override VIEW op_case to 0 for split model inputs

126d758

openvino backend: Handle unsupported VIEW shape-mismatch in OpenVINO …

32f9cb7

…backend

Enable additional mul_mat tests and add tensor data saving function (r…

f812f78

…avi9#81)

ggml-openvino: fix CONT/TRANSPOSE mapping and improve dynamic-dimensi…

865f121

…on handling

OpenVINO: add NORM/TANH support and rework SOFT_MAX translation

ca3a176

ggml-openvino: extend VIEW handling

a73a6dc

Enable -fa off (ravi9#118)

bfa4c53

Enable --context-shift

9c922b1

Fix llm param compute error for normal softmax not the softmax in att…

59f0e3c

…ention

OpenVINO backend: fix error for attention size compute in llm param

c8e9ce4

use tensor->extra in infer_request i/o

9f355ed

OpenVINO backend: refacter the compute_llm_params() func add get_atte…

dc5ed75

…ntion_pattern_case to easy extand

OpenVINO backend: clean unused code

e2ce59c

1to1 match op update (ravi9#146)

130ef39

* added translate_1to1_match_1_input function and updated gelu and tanh translations * Remove unused translation function calls --------- Co-authored-by: Mustafa Cavus <mustafacavus@intel.com>

initial gemma4 support

13ddbf3

removed hardcoded names for kv cache slicing

7597773

OpenVINO backend: Add new attention pattern for llm parameters compute

a1baa1a

flash attn Q shape static conversion

dad8acd

Remove slice in permute translation when n_seq is 1

760e86d

return optional in extract_layer_from_name

5a39967

OpenVINO backend: refactor VIEW related operation (ravi9#148)

d289bbd

* OpenVINO backend: refactor VIEW related operation * Enable VIEW handling in following ops * OpenVINO backend does not support GGML_OP_NORM & GGML_OP_L2_NORM with VIEW input accuracy issue from OpenVINO

OpenVINO backend: Add ops l2_norm & pad

e0caf43

OpenVINO backend does not support CPY with non-contiguous data or mis…

8c83092

…matched types

add op SSM_CONV GATED_DELTA_NET

a08546f

ravi9 and others added 18 commits June 9, 2026 10:25

Merge pull request ravi9#208 from mostafafaheem/envvar_cleanup

8ea91dd

OpenVINO backend: Enhance environment variable handling

ggml-openvino: fix -Werror=cast-qual in extract_q5_1_data

341b615

Merge pull request ravi9#209 from cavusmustafa/op_translations_q51

4c878bd

OpenVINO backend: add REPEAT translator, Q5_1 weights, and GLU view-input fix

Update openvino.Dockerfile

971816c

Use BuildKit cache mounts for faster Docker rebuilds. Use apt instead of dpkg, remove unused .ddeb downloads, add DLLAMA_BUILD_TESTS=OFF.

ggml-openvino: centralize env var access via *getenv_str/getenv_int h…

835121d

…elpers Replace getenv and legacy flags with _str and _int helpers.Minor cleanup, doc updates.

OpenVINO backend: Enable GGML_OP_ADD_ID

3365e31

Merge pull request ravi9#210 from zhaixuejun1993/xuejun/add_op_add_id

906a48d

OpenVINO backend: Enable GGML_OP_ADD_ID

Uptade openvino backend clamg-format

dd5c58d

clang-format

a9045e0

Update OPENVINO.md (ravi9#211)

fb924cb

Merge branch 'master' into dev_backend_openvino

ba6c06d

Merge branch 'master' into dev_backend_openvino

1d3035b

OpenVINO backend: fix accuracy issue for op CONCAT with i64 precision

90ae917

Merge pull request ravi9#214 from zhaixuejun1993/xuejun/fix-error-op-…

383d163

…concat OpenVINO backend: fix accuracy issue for op CONCAT with i64 precision

Remove strict concurrency for gpu-openvino-low-perf

00e80a9

Update openvino CI keynames; add ccache-clear

65d4041

Apply suggestions from code review

ce52f0a

Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

Fix formatting

3481530

cavusmustafa force-pushed the gemma4_moe_pr_v3 branch from f50a282 to ef866cb Compare June 17, 2026 21:06

cavusmustafa and others added 5 commits June 17, 2026 14:38

ggml-openvino: add Gemma-4 26B MoE support

b7b94ec

ggml-openvino: tie GET_ROWS batched-gather indices to data batch dim

f349771

cavusmustafa force-pushed the gemma4_moe_pr_v3 branch from ef866cb to 5f01724 Compare June 17, 2026 21:55

cavusmustafa marked this pull request as ready for review June 22, 2026 21:20

cavusmustafa requested a review from wine99 as a code owner June 22, 2026 21:20

cavusmustafa marked this pull request as draft June 22, 2026 23:47

wine99 force-pushed the dev_backend_openvino branch from 320e3c4 to a4a7e36 Compare June 24, 2026 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4 moe pr v3#221

Gemma4 moe pr v3#221
cavusmustafa wants to merge 135 commits into
ravi9:dev_backend_openvinofrom
cavusmustafa:gemma4_moe_pr_v3

cavusmustafa commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

cavusmustafa commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants