Gemma 4 31B: chat template, inv_freq dedup, CI hardening by mergennachin · Pull Request #19614 · pytorch/executorch

mergennachin · 2026-05-15T13:53:34Z

Chat template: Gemma 4 31B-IT is instruction-tuned and produces degenerate output without chat-template wrapping. Auto-wrap --prompt with the IT template in both inference.py and the C++ runner; --raw-prompt / --raw_prompt skips wrapping for
pre-formatted input.
inv_freq dedup: Extract _compute_inv_freq() on Gemma4Attention so init and materialize_runtime_buffers share a single implementation instead of duplicating the RoPE frequency computation.
CI hardening: Check for "Paris" in the export inference sanity check instead of just checking the script doesn't crash. Restore gemma4_31b unit tests in the CUDA build job.
Docs: Update README.md and model.md to reflect chat template and inv_freq changes.

pytorch-bot · 2026-05-15T13:53:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19614

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-05-15T13:54:26Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Applies the Gemma 4 IT chat template to user prompts by default in both the Python inference.py and C++ runner, with an opt-out flag for pre-formatted input, to avoid degenerate output from the instruction-tuned model.

Changes:

Add apply_chat_template helper and --raw-prompt flag in inference.py.
Add --raw_prompt flag and template-wrapping logic in main.cpp (BOS prepended separately).
Document the auto-wrapping and opt-out flags in the README.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
examples/models/gemma4_31b/inference.py	Defines chat template, wraps prompt by default, adds `--raw-prompt` flag
examples/models/gemma4_31b/main.cpp	Adds `--raw_prompt` flag and wraps prompt text with the IT chat template
examples/models/gemma4_31b/README.md	Documents auto-wrapping behavior and skip flags for both runners

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mergennachin · 2026-05-15T14:05:57Z

@claude Review this PR

claude · 2026-05-15T14:06:42Z

Claude finished @mergennachin's task in 1m 31s —— View job

Review of PR #19614

Gather context and read changed files
Review code changes
Post review feedback

Overall this is a clean, well-scoped change. Two issues worth addressing before merge:

1. `str.format()` will crash on prompts with curly braces — `inference.py:78`

apply_chat_template uses _CHAT_TEMPLATE.format(prompt=prompt). If the user prompt contains curly braces (e.g., "explain {dict} in Python"), this will raise a KeyError/ValueError.

# Current (broken on curly-brace input):
_CHAT_TEMPLATE = (
    "<bos><|turn>user\n{prompt}<turn|>\n<|turn>model\n<|channel>thought\n<channel|>"
)

def apply_chat_template(prompt: str) -> str:
    return _CHAT_TEMPLATE.format(prompt=prompt)

A simple fix is to use str.replace() instead:

def apply_chat_template(prompt: str) -> str:
    return "<bos><|turn>user\n" + prompt + "<turn|>\n<|turn>model\n<|channel>thought\n<channel|>"

Fix this →

2. BOS handling inconsistency between Python and C++ — `inference.py:72` vs `main.cpp:243`

The Python template includes the literal string <bos>:

_CHAT_TEMPLATE = (
    "<bos><|turn>user\n{prompt}<turn|>\n<|turn>model\n<|channel>thought\n<channel|>"
)

The C++ template does not:

prompt_text = "<|turn>user\n" + prompt_text +
    "<turn|>\n<|turn>model\n<|channel>thought\n<channel|>";

Both runners separately prepend BOS as a token ID after encoding (Python at inference.py:104, C++ at main.cpp:255-256). Whether this causes a double-BOS in Python depends on how the tokenizers library handles the <bos> string during encode() — if it recognizes <bos> as a special token and emits token id 2, the guard in generate() will skip the extra prepend and things work out; if it tokenizes <bos> as literal subword pieces, the prompt gets garbled.

Either way, the two runners should behave identically. I'd suggest making the Python template match the C++ one (omit <bos> from the template string) since both runners already handle BOS prepending at the token-ID level:

def apply_chat_template(prompt: str) -> str:
    return "<|turn>user\n" + prompt + "<turn|>\n<|turn>model\n<|channel>thought\n<channel|>"

Fix this →

Minor observations (no action needed)

Flag naming: --raw-prompt (Python) vs --raw_prompt (C++) — this is fine and idiomatic for each tool (argparse vs gflags), and the README correctly documents both.
Prompt display (inference.py:229): prints args.prompt (unwrapped), which is good UX — the user sees what they typed. If debugging template issues, users can add --raw-prompt and pass the full template manually.

Looks good overall — nice improvement for the IT model's usability. The two issues above are the only things I'd want addressed before merge.

Gemma 4 31B-IT is instruction-tuned and produces degenerate output without the chat template wrapping. Auto-wrap --prompt with the IT template (<bos><|turn>user\n{prompt}<turn|>\n<|turn>model\n <|channel>thought\n<channel|>) by default; --raw-prompt / --raw_prompt skips wrapping for pre-formatted input.

### Summary Currently `materialize_runtime_buffers` in model.py was zeroing out ALL meta buffers, including each layer's inv_freq (RoPE frequencies). The follow-up `attn.inv_freq.to(device)` was a no-op on already-zero tensors. So RoPE produced cos=1, sin=0 for every position → model had NO positional information → introduce the period-N echo cycle pattern. This PR fix the issue by recomputing inv_freq per-layer with real values (using the layer's head_dim, partial_rotary, rope_theta, is_sliding flag) in materialize_runtime_buffers. ### Test plan Add e2e ci for gemma4-31b model and check its output.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

examples/models/gemma4_31b/model.py:700

The PR title and description state the change is about applying the Gemma 4 IT chat template in inference.py and the C++ runner. However, the diff also includes several unrelated changes that are not mentioned in the description:

model.py: replaces attn.inv_freq = attn.inv_freq.to(device) with a full re-computation of inv_freq (including partial-rotary / NoPE handling) in materialize_runtime_buffers.
inference.py: adds a new --bf16 input path that calls Gemma4_31B.from_hf_checkpoint.
.github/workflows/cuda.yml: removes the pip install gguf + pytest examples/models/gemma4_31b/quant/tests/ examples/models/gemma4_31b/tests/ step, and adds Gemma 4 31B-IT to the export/e2e matrices.
.ci/scripts/export_model_artifact.sh and .ci/scripts/test_model_e2e.sh: add a new Gemma 4 31B pipeline.

Please either update the PR description to cover these changes, or split them into separate PRs so each change can be reviewed against a description that matches its scope.

        if attn.is_sliding:
            rotary_dim = attn.head_dim
        else:
            rotary_dim = int(attn.head_dim * attn.partial_rotary)
        rope_angles = rotary_dim // 2
        inv_freq_rotated = 1.0 / (
            attn.rope_theta
            ** (
                torch.arange(0, rotary_dim, 2, device=device, dtype=torch.float32)
                / attn.head_dim
            )
        )
        nope_angles = attn.head_dim // 2 - rope_angles
        if nope_angles > 0:
            inv_freq = torch.cat(
                [
                    inv_freq_rotated,
                    torch.zeros(nope_angles, device=device, dtype=torch.float32),
                ]
            )
        else:
            inv_freq = inv_freq_rotated
        attn.register_buffer("inv_freq", inv_freq, persistent=False)

…t tests - Extract _compute_inv_freq() on Gemma4Attention so __init__ and materialize_runtime_buffers share a single implementation. - Check for "Paris" in the export CI inference sanity check instead of just checking the script doesn't crash. - Restore gemma4_31b quant/pipeline unit tests in the CUDA build job. - Update model.md to reflect that inv_freq is recomputed, not moved.

Copilot AI review requested due to automatic review settings May 15, 2026 13:53

mergennachin requested a review from lucylq as a code owner May 15, 2026 13:53

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 15, 2026

mergennachin requested a review from Gasoonjia May 15, 2026 13:53

Copilot started reviewing on behalf of mergennachin May 15, 2026 13:54 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Gasoonjia approved these changes May 15, 2026

View reviewed changes

mergennachin force-pushed the gemma4-chat-template branch from 78ee61f to 5d5c26e Compare May 15, 2026 19:05

Copilot AI review requested due to automatic review settings May 18, 2026 14:13

Copilot started reviewing on behalf of mergennachin May 18, 2026 14:14 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread examples/models/gemma4_31b/model.py Outdated

Comment thread .github/workflows/cuda.yml

mergennachin changed the title ~~Apply Gemma 4 IT chat template in inference.py and C++ runner~~ Gemma 4 31B: chat template, inv_freq dedup, CI hardening May 18, 2026

mergennachin merged commit 3ceb89c into main May 18, 2026
360 of 367 checks passed

mergennachin deleted the gemma4-chat-template branch May 18, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4 31B: chat template, inv_freq dedup, CI hardening#19614

Gemma 4 31B: chat template, inv_freq dedup, CI hardening#19614
mergennachin merged 3 commits into
mainfrom
gemma4-chat-template

mergennachin commented May 15, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

mergennachin commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mergennachin commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19614

❗ 2 Active SEVs

Uh oh!

github-actions Bot commented May 15, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mergennachin commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #19614

1. str.format() will crash on prompts with curly braces — inference.py:78

2. BOS handling inconsistency between Python and C++ — inference.py:72 vs main.cpp:243

Minor observations (no action needed)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergennachin commented May 15, 2026 •

edited

Loading

pytorch-bot Bot commented May 15, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented May 15, 2026 •

edited

Loading

1. `str.format()` will crash on prompts with curly braces — `inference.py:78`

2. BOS handling inconsistency between Python and C++ — `inference.py:72` vs `main.cpp:243`