fix(spu): kernel restore + Jina V4 trained-context cap (PR-1.H)#297
Conversation
PR-0.5.E (#278) deleted `crates/weaver-inference/kernels/transformer.cu` when it consolidated `weaver-inference` into `weaver-spu` but never ported the file across. Result: any `--features cuda` build (and transitively `--features inference`) fails to link with `undefined symbol: launch_allreduce_2gpu` (and ~5 other host-side launchers from `crates/weaver-spu/src/core/gpu/forward.rs`). This was masked on main because no CI runs `--features inference` and the production daemon binary at `/opt/weaver/bin/weaver-infer` was built before PR-0.5.E. PR #296 added a `build.rs` fallback that *skipped* the cuda compile when the file was missing — kept the build script from panicking, but didn't help the linker. Real fix: restore the kernel file from `git show 79bb649^:crates/ weaver-inference/kernels/transformer.cu` into `crates/weaver-spu/ kernels/transformer.cu` (1266 lines, unchanged). The cudarc decoder path links cleanly again. Also removed the now-dead `build.rs` fallback so the script reverts to its pre-PR-0.5.E shape. The cudarc path is still slated for full retirement in Phase 3 cleanup per Cargo.toml's legacy-backends comment. Until then, keeping it functional avoids cascading breakage in `serve.rs` etc. Validates: `cargo build -p weaver-interface --features inference,embedder-rust --release` — clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The smoke test for the in-process embedder backend (task #117) caught a real identity-drift bug: `JinaV4Embedder::max_seq_len` was reading `config.json::max_position_embeddings` directly, which on the canonical 853c867 snapshot is **128_000** (the Qwen2.5-VL architecture's RoPE-extended max). But Jina V4 was *trained* on 32K-token sequences; using more is OOD and produces degraded embeddings. The Python embedder service correctly reports 32768. First smoke run failed exactly here: Refuse to start: embedder identity drifted from pin at /opt/weaver/state/embedder.pin.json. pinned: jinaai/jina-embeddings-v4 (dim=2048, rev=853c867...) live: jinaai/jina-embeddings-v4 (dim=2048, rev=853c867...) delta: max_seq_length: pin=32768 live=128000 Error: embedder identity mismatch Fix: - New `pub const JINA_V4_TRAINED_CONTEXT: usize = 32_768` in `encoder/jina_v4.rs`, documenting why this cap exists. - `from_snapshot()` now sets: `max_seq_len = JINA_V4_TRAINED_CONTEXT.min(raw.max_position_embeddings)` The `min()` keeps the cap honest if a future snapshot ships `max_position_embeddings < 32768` (unlikely but cheap insurance). Outcomes: - `EmbedderInfo::max_seq_length` matches the Python service report. Cohort-pin guard stays consistent across backends — no operator pin reset required at the Phase 1 cutover. - `encode_text` boundary check still fires correctly, now at the operator-meaningful limit (32768) rather than the architecture max (128000). Catches OOD inputs before encoding instead of after. Validated end-to-end via the smoke test in `docs/infrastructure/embedder-rust-cutover-evidence/smoke-2026-05-08.md` (separate commit). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures the first end-to-end verification that `weaver serve`
constructs an in-process `EmbedderClient` against a real Jina V4
snapshot, the cohort-pin guard fires correctly through the new path,
and the daemon listens.
Documents:
- The pre-flight fixes the run surfaced (kernel restore +
`JINA_V4_TRAINED_CONTEXT` cap, both committed separately on this
branch).
- What the smoke test confirms (boot-time wiring through the new
backend) vs. what it doesn't (actual embed traffic, long-running
stability — those are task #118 territory).
- Operator-facing migration note: with the `JINA_V4_TRAINED_CONTEXT`
cap, the existing pin file's `max_seq_length=32768` keeps matching
after the cutover. No pin reset needed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds a comprehensive transformer CUDA kernel compilation unit and extern "C" launchers, makes build.rs always compile CUDA when the ChangesTransformer CUDA Kernels and Jina V4 Context Limit
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/weaver-spu/kernels/transformer.cu`:
- Around line 511-512: The FlashAttention fast path in forward.rs must be
guarded for head sizes that fit the kernel's static limits: update both call
sites of launch_flash_attention in forward.rs (the branches around
use_flash_attention at lines where launch_flash_attention is invoked) to only
select the FlashAttention path when head_dim <= 256 (so FA2_MAX_HALF_DIM=128
covers O_acc) and when computed smem_bytes does not exceed the device shared
memory (avoid head_dim > 128 causing smem >48KiB); mirror the check used in
forward_gemma4.rs (if layer.is_sliding && head_dim <= 256) by adding an explicit
head_dim bound (and optional smem_bytes sanity) before invoking
launch_flash_attention to prevent out-of-bounds accumulator indexing and failed
kernel launches.
In `@crates/weaver-spu/src/encoder/jina_v4.rs`:
- Around line 264-266: The overflow error message still refers to the raw config
max_position_embeddings even though self.max_seq_len is now capped using
JINA_V4_TRAINED_CONTEXT; update the rejection error text emitted where the code
checks sequence length (the path that compares input length against
self.max_seq_len) to reference "max_seq_length" or "trained context" (or similar
wording) instead of "max_position_embeddings" so it matches the exposed value in
EmbedderInfo and the capped limit computed from JINA_V4_TRAINED_CONTEXT and
raw.max_position_embeddings.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 937096ed-5918-40ef-a702-edc188b2bddc
📒 Files selected for processing (4)
crates/weaver-spu/build.rscrates/weaver-spu/kernels/transformer.cucrates/weaver-spu/src/encoder/jina_v4.rsdocs/infrastructure/embedder-rust-cutover-evidence/smoke-2026-05-08.md
| #define FA2_MAX_HALF_DIM 128 // head_dim / 2, supports up to head_dim=256 | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "FlashAttention call sites / any guard or fallback:"
rg -n -C4 '\blaunch_flash_attention\b|\bcudaFuncSetAttribute\b|\blaunch_attention\b' crates/weaver-spu
echo
echo "Head-dimension assumptions referenced in the CUDA path:"
rg -n -C3 '\b(FA2_MAX_HALF_DIM|head_dim|Gemma 4|partial_rotary_factor)\b' crates/weaver-spuRepository: toddwbucy/WeaverTools
Length of output: 50381
🏁 Script executed:
sed -n '600,620p' crates/weaver-spu/src/core/gpu/forward.rsRepository: toddwbucy/WeaverTools
Length of output: 923
🏁 Script executed:
sed -n '1540,1560p' crates/weaver-spu/src/core/gpu/forward.rsRepository: toddwbucy/WeaverTools
Length of output: 945
🏁 Script executed:
sed -n '590,640p' crates/weaver-spu/src/core/gpu/forward.rsRepository: toddwbucy/WeaverTools
Length of output: 2312
🏁 Script executed:
sed -n '1530,1580p' crates/weaver-spu/src/core/gpu/forward.rsRepository: toddwbucy/WeaverTools
Length of output: 2403
🏁 Script executed:
rg -n 'use_flash_attention\s*=' crates/weaver-spu/src/core/gpu/forward.rs | head -20Repository: toddwbucy/WeaverTools
Length of output: 47
🏁 Script executed:
rg -n 'use_flash_attention\s*=' crates/weaver-spu/src/core/gpu/forward.rs -B5 -A5 | head -60Repository: toddwbucy/WeaverTools
Length of output: 47
🏁 Script executed:
rg -n 'use_flash_attention' crates/weaver-spu/src/core/gpu/forward.rs | head -30Repository: toddwbucy/WeaverTools
Length of output: 343
🏁 Script executed:
sed -n '170,190p' crates/weaver-spu/src/core/gpu/forward.rsRepository: toddwbucy/WeaverTools
Length of output: 963
🏁 Script executed:
sed -n '410,430p' crates/weaver-spu/src/core/gpu/forward.rsRepository: toddwbucy/WeaverTools
Length of output: 793
🏁 Script executed:
sed -n '1320,1340p' crates/weaver-spu/src/core/gpu/forward.rsRepository: toddwbucy/WeaverTools
Length of output: 759
🏁 Script executed:
rg -n 'forward\(' crates/weaver-spu/src/core/gpu/forward.rs | grep -v '^\s*//' | head -20Repository: toddwbucy/WeaverTools
Length of output: 85
🏁 Script executed:
rg -n '::forward\(' crates/weaver-spu/src/core/gpu --no-heading -A2 -B2 | head -80Repository: toddwbucy/WeaverTools
Length of output: 47
🏁 Script executed:
rg -n 'forward\(' crates/weaver-spu/src/core/gpu/ --type rust | grep -v '::forward' | head -20Repository: toddwbucy/WeaverTools
Length of output: 127
🏁 Script executed:
rg -n 'use_flash_attention\s*=' crates/weaver-spu/src/core/gpu/ --type rust -B10 | head -100Repository: toddwbucy/WeaverTools
Length of output: 47
🏁 Script executed:
rg -n '::forward\(|forward\(' crates/weaver-spu/src/core/gpu/engine.rs -A5 -B2 | head -100Repository: toddwbucy/WeaverTools
Length of output: 146
🏁 Script executed:
rg -n 'flash_attention' crates/weaver-spu/src/core/gpu/engine.rs | head -20Repository: toddwbucy/WeaverTools
Length of output: 146
🏁 Script executed:
ls -la crates/weaver-spu/src/core/gpu/Repository: toddwbucy/WeaverTools
Length of output: 609
🏁 Script executed:
rg -n 'forward\(.*use_flash_attention' crates/weaver-spu/src/core/gpu/ -A10 | head -60Repository: toddwbucy/WeaverTools
Length of output: 47
🏁 Script executed:
rg -n 'gpu::forward' crates/weaver-spu/src/decoder/ -A3 -B3 | head -100Repository: toddwbucy/WeaverTools
Length of output: 782
🏁 Script executed:
rg -n 'use_flash_attention' crates/weaver-spu/src/decoder/ --type rust -B5 -A2 | head -150Repository: toddwbucy/WeaverTools
Length of output: 7436
🏁 Script executed:
rg -n 'load_with_grant\|load_with_options' crates/weaver-spu/src/decoder/ -B5 -A5 | head -120Repository: toddwbucy/WeaverTools
Length of output: 47
🏁 Script executed:
rg -n 'use_flash_attention\s*=' crates/weaver-spu/src/ --type rust -B8 -A2 | head -150Repository: toddwbucy/WeaverTools
Length of output: 47
🏁 Script executed:
rg -n 'features\.flash_attention\|use_flash_attention.*true\|use_flash_attention.*false' crates/weaver-spu/src/ --type rust -B3 -A3Repository: toddwbucy/WeaverTools
Length of output: 47
🏁 Script executed:
rg -n 'flash_attention' crates/weaver-spu/src/ --type rust | head -40Repository: toddwbucy/WeaverTools
Length of output: 3082
🏁 Script executed:
sed -n '290,310p' crates/weaver-spu/src/main.rsRepository: toddwbucy/WeaverTools
Length of output: 817
🏁 Script executed:
sed -n '450,470p' crates/weaver-spu/src/main.rsRepository: toddwbucy/WeaverTools
Length of output: 863
🏁 Script executed:
sed -n '120,135p' crates/weaver-spu/src/decoder/startup.rsRepository: toddwbucy/WeaverTools
Length of output: 744
🏁 Script executed:
rg -n 'struct ModelConfig' crates/weaver-spu/src/decoder/ -A20Repository: toddwbucy/WeaverTools
Length of output: 1575
🏁 Script executed:
rg -n 'head_dim\|head_dimension' crates/weaver-spu/src/decoder/ --type rust | head -30Repository: toddwbucy/WeaverTools
Length of output: 47
Guard the FlashAttention fast path against larger head sizes in forward.rs.
launch_flash_attention in forward.rs (lines 609, 1548) accepts any head_dim without bounds checking. While forward_gemma4.rs properly guards with if layer.is_sliding && head_dim <= 256, the forward.rs path has no such protection. The kernel requires O_acc[FA2_MAX_HALF_DIM] (line 511, FA2_MAX_HALF_DIM=128) to cover only head_dim ≤ 256, and smem_bytes exceeds 48 KiB for head_dim > 128. For the unguarded forward.rs paths, this can fail to launch or index past accumulator bounds.
Verification note
`forward_gemma4.rs` at line 327 includes the necessary guard (`head_dim <= 256`), but `forward.rs` at lines 606 and 1546 only checks `use_flash_attention` flag without validating `head_dim`. The upstream engine initialization passes `use_flash_attention` as a boolean profile flag with no head_dim validation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/weaver-spu/kernels/transformer.cu` around lines 511 - 512, The
FlashAttention fast path in forward.rs must be guarded for head sizes that fit
the kernel's static limits: update both call sites of launch_flash_attention in
forward.rs (the branches around use_flash_attention at lines where
launch_flash_attention is invoked) to only select the FlashAttention path when
head_dim <= 256 (so FA2_MAX_HALF_DIM=128 covers O_acc) and when computed
smem_bytes does not exceed the device shared memory (avoid head_dim > 128
causing smem >48KiB); mirror the check used in forward_gemma4.rs (if
layer.is_sliding && head_dim <= 256) by adding an explicit head_dim bound (and
optional smem_bytes sanity) before invoking launch_flash_attention to prevent
out-of-bounds accumulator indexing and failed kernel launches.
… review) After capping `max_seq_len = JINA_V4_TRAINED_CONTEXT.min(...)`, the "input exceeds Jina V4's max_position_embeddings" error message no longer matches the value being checked — `self.max_seq_len` is the trained-context cap (32768), not the architecture max (128000). Reworded to "trained context (max_seq_length = N)" so operators see the same number `EmbedderInfo::max_seq_length` exposes. Skipped the FlashAttention head-dim bounds-check finding (forward.rs call sites of launch_flash_attention) — pre-existing in main, not introduced by PR #297. Cudarc decoder path is slated for Phase 3 retirement; adding multi-site head_dim bounds checks belongs in its own focused PR with proper test coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/weaver-spu/src/encoder/jina_v4.rs`:
- Around line 264-266: Update the rustdoc for the function max_seq_len() to
state that it returns the effective capped sequence length (the minimum of
JINA_V4_TRAINED_CONTEXT and raw.max_position_embeddings) rather than the raw
architectural ceiling; mention both JINA_V4_TRAINED_CONTEXT and
raw.max_position_embeddings in the doc so readers know the method applies the
cap and returns the capped value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 140817ed-d147-4a81-9ffe-b981189ae883
📒 Files selected for processing (1)
crates/weaver-spu/src/encoder/jina_v4.rs
…capped value (PR #297 review) The getter's rustdoc said it returned the architectural ceiling read from `config.json::max_position_embeddings`. That was correct when the getter was added in PR #294 but became stale after the cap landed earlier in this PR. Now states explicitly that it returns `min(JINA_V4_TRAINED_CONTEXT, raw.max_position_embeddings)` and references both terms so a reader can see the relationship between the cap, the raw config field, and the value `EmbedderInfo` exposes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Three small fixes that together unblock the in-process Rust embedder backend end-to-end. All discovered while completing task #117 — the first real `weaver serve --features inference,embedder-rust` smoke test against a Jina V4 snapshot.
Why each fix is needed
1. Kernel restore
PR-0.5.E (#278) consolidated `weaver-inference` into `weaver-spu` but deleted `crates/weaver-inference/kernels/transformer.cu` without porting it across. `weaver-spu/build.rs` still expected it at `kernels/transformer.cu`, and the Rust FFI declarations at `crates/weaver-spu/src/core/gpu/forward.rs` still reference its symbols (`launch_allreduce_2gpu` etc.).
Result: any `cargo build --features cuda` (or transitively `--features inference`) fails at link with:
```
rust-lld: error: undefined symbol: launch_allreduce_2gpu
```
Masked on main because no CI builds with `--features inference` and the production daemon binary at `/opt/weaver/bin/weaver-infer` was built before PR-0.5.E.
PR #296 added a `build.rs` fallback that skipped the cuda compile when the file was missing — kept the build script from panicking, but didn't help the linker (the Rust FFI declarations still need defined symbols).
This PR restores the file from `git show 79bb649^:crates/weaver-inference/kernels/transformer.cu` and reverts the now-dead fallback.
2. `max_seq_len` cap
Smoke test caught a real identity drift between the Rust and Python backends:
```
Refuse to start: embedder identity drifted from pin at /opt/weaver/state/embedder.pin.json.
pinned: jinaai/jina-embeddings-v4 (dim=2048, rev=853c867...)
live: jinaai/jina-embeddings-v4 (dim=2048, rev=853c867...)
delta: max_seq_length: pin=32768 live=128000
```
The pin file (written by the Python service on 2026-04-20) records 32768. The Rust client was reading the architecture max (128000). Capping to `JINA_V4_TRAINED_CONTEXT = 32_768` makes the two backends report the same identity, so the Phase 1 cutover doesn't require an operator pin reset.
Bonus: `encode_text` boundary check now fires at the operator-meaningful limit instead of letting OOD inputs through.
3. Smoke evidence
Documents what was actually verified end-to-end. Notes the matters this run does not cover (actual embed traffic, long-running stability — task #118 territory).
Test plan
Follow-ups
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation