Skip to content

audit: embedder swap — Jina V4 → Qwen3-Embedding-0.6B-Q8 GGUF (PR 1 of 2)#317

Merged
toddwbucy merged 3 commits into
mainfrom
audit/embedder-swap-qwen3
May 12, 2026
Merged

audit: embedder swap — Jina V4 → Qwen3-Embedding-0.6B-Q8 GGUF (PR 1 of 2)#317
toddwbucy merged 3 commits into
mainfrom
audit/embedder-swap-qwen3

Conversation

@toddwbucy
Copy link
Copy Markdown
Owner

@toddwbucy toddwbucy commented May 12, 2026

Summary

Planning artifact for the embedder swap discussed last night and this morning. No code changes — just the audit + decisions so PR 2 (implementation) has a clear scope and a single review surface.

Decisions locked

Decision Choice
Carrier Qwen3-Embedding-0.6B-Q8_0 GGUF (~635MB, 1024 dim, 32K context)
Serving stack llama.cpp via existing `llama-cpp-sys-2` (same engine as Qwen3-Coder decoder)
Candle fork Unchanged — Jina V4 backend remains valid per per-profile pluggability
Default context size 32K native — don't truncate; preserves late-chunking for later workloads
GGUF install path `/opt/weaver/models/qwen3-embedding-0.6b/`
GPU placement Same-GPU per agent default, configurable via `spu.encoder.gpu` / `spu.decoder.gpu`

Framework leaks identified (fix in PR 2)

Site Issue
`belief.rs:114-120` `BELIEF_EMBEDDING_INDEX` const has `dimension: 2048` hardcoded → must become builder-constructed from agent's spu config
`harness.rs:207-250` Test fixtures assert `model == "jina-v4"` and `dim == 2048` → must assert against configured embedder
`toml_edit.rs`, `serve.rs` Config writers emit Jina V4 defaults
`runtime_schema.rs`, `episode.rs` Schema-side 2048 references
Several test fixtures across crates Pin Jina specifics

The `weaver-spu/src/encoder/` references (77 hits) are legitimate — that's the embedder implementation layer where Jina V4 lives as one valid backend. They stay.

Verification status

llama.cpp + Qwen3-Embedding-0.6B-Q8 GGUF compatibility confirmed via `llama-cpp-python` smoke test this morning:

  • ✓ 1024-dim embeddings
  • ✓ Last-token pooling (GGUF default is correct for Qwen3-Embedding)
  • ✓ Retrieval ranking correct on a HeroBench fixture query (cosine 0.66 for relevant vs 0.18 for irrelevant)

What ships in PR 2 (next)

  1. `LlamaCppEmbedderClient` in `weaver-spu/src/encoder/llama_cpp_client.rs` implementing `Embedder` trait
  2. Backend dispatch in daemon load handler (Jina V4 → candle; Qwen3-Embedding → llama.cpp)
  3. `BELIEF_EMBEDDING_INDEX` const → builder function parameterized on dim
  4. Test fixture updates
  5. Default agent YAML template referencing Qwen3-Embedding-0.6B-Q8_0
  6. Decision log entries in `belief-nodes-embedding-Spec.md` + `memory-substrate-Spec.md`
  7. GGUF file move `/opt/weaver/huggingface/hub/gguf-models/qwen3-embedding-0.6B/` → `/opt/weaver/models/qwen3-embedding-0.6b/`

Test plan

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive audit documenting the planned embedder backend swap to Qwen3 embeddings, migration scope, constraints, and an ordered implementation checklist.
    • Includes architecture choices, migration steps for embeddings and indexes, agent template defaults and test updates, documentation touchpoints, prioritized items to remove hardcoded assumptions, locked decisions, and success criteria.

Review Change Stack

…f 2)

Planning artifact for the embedder swap discussed 2026-05-11/12. No code
changes; ships the audit + decisions so PR 2 (implementation) has a
clear scope and a single review surface.

Decisions locked:
- Carrier: Qwen3-Embedding-0.6B-Q8_0 GGUF (~635MB, 1024 dim, 32K context)
- Serving: llama.cpp via llama-cpp-sys-2 (same engine as the decoder;
  no Python in runtime; no new dependency)
- Candle fork: unchanged (Jina V4 backend remains one valid profile
  option per per-profile pluggability commit)
- Default context size: 32K native (don't truncate; preserves late-
  chunking for future workloads)
- GGUF install path: /opt/weaver/models/qwen3-embedding-0.6b/
- GPU placement: same-GPU per agent default, configurable per spu yaml

Framework leaks identified (will be fixed in PR 2):
- belief.rs:114-120 BELIEF_EMBEDDING_INDEX const with dimension: 2048
  hardcoded — must become builder constructed from agent spu config
- harness.rs:207-250 test fixtures asserting model=="jina-v4" and
  dim==2048 — must assert against configured embedder, not literals
- toml_edit.rs / serve.rs config writers emitting Jina V4 defaults
- runtime_schema.rs / episode.rs schema-side 2048 references
- A handful of test fixtures across crates pinning Jina specifics

Verification status: llama.cpp + Qwen3-Embedding-0.6B-Q8 GGUF
compatibility confirmed via llama-cpp-python smoke test 2026-05-12
morning — 1024-dim embeddings, last-token pooling (GGUF default is
correct), retrieval ranking correct on a HeroBench fixture query
(cosine 0.66 for the relevant passage vs 0.18 for an irrelevant one).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 06a96ec3-7d9f-468f-b9ef-85af4b815ca9

📥 Commits

Reviewing files that changed from the base of the PR and between 2ddf52a and 81185ce.

📒 Files selected for processing (1)
  • docs/audit/embedder-swap-2026-05-12.md

📝 Walkthrough

Walkthrough

Adds an audit document planning the embedder backend swap from Jina V4 to Qwen3-Embedding-0.6B (GGUF): records architecture choices, inventories Jina/2048 references, lists framework leaks to fix, describes PR 2 implementation steps (client, YAML, docs), locks decisions, and defines success criteria.

Changes

Embedder-swap audit plan (Jina V4 to Qwen3-GGUF)

Layer / File(s) Summary
Audit overview and architectural decisions
docs/audit/embedder-swap-2026-05-12.md
Header identifying the audit, date, PR sequencing, and core architectural choices including llama.cpp/GGUF stack selection, profile-driven embedding-dimension/index construction, Candle fork stance, and dev-DB rebuild requirement.
Reference inventory and constraint boundaries
docs/audit/embedder-swap-2026-05-12.md
Reference-count summary tracking Jina and 2048-dim occurrences; defines "legitimate references" permitted within the embedder implementation layer; lists prioritized "framework leaks" requiring parameterization or updates.
PR 2 implementation plan and roadmap
docs/audit/embedder-swap-2026-05-12.md
Describes embedder client construction (add LlamaCppEmbedderClient alongside existing candle-based client), enumerates dispatcher call sites and agent YAML template updates (new Qwen3 default and test agent), and documents related docs and deferred items.
Locked decisions and success criteria
docs/audit/embedder-swap-2026-05-12.md
Records locked choices (native 32K context, canonical GGUF install path, GPU placement honoring agent YAML), the ordered PR 2 implementation sequence, and explicit success criteria (tests, provisioning/retrieval behavior, removal of hardcoded Jina/2048 outside embedder layer, docs alignment, decision-log entry).

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

  • toddwbucy/WeaverTools#117: Adds a gguf-based llama.cpp embedding module (text and vision) that implements the gguf embedder functionality referenced by this audit.
  • toddwbucy/WeaverTools#298: Trait-typed, opt-in in-process embedder work that overlaps backend wiring and construction call sites inventoried by the audit.
  • toddwbucy/WeaverTools#293: Adds a JinaV4-based embedder producing 2048-d vectors, directly intersecting the audit's inventory of Jina/2048 usages.

Poem

🐰 In quiet docs I nibble a chart,
Jina to Qwen3 — a tidy new start,
I count each leak, pin every hard code,
Pack GGUF paths down the embedder road,
Hop, test, and log — a tidy migration art.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: an audit document planning an embedder swap from Jina V4 to Qwen3-Embedding-0.6B-Q8 GGUF, and is specific, concise, and directly related to the PR's primary objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/audit/embedder-swap-2026-05-12.md`:
- Line 117: There is a conflict between the two default context_size values:
change the document so the `context_size` is consistent for the new default
template (`Qwen3-Embedding-0.6B-Q8_0.gguf`, `model_id:
"qwen3-embedding-0.6b-gguf"`) and the locked default YAML; either set both
occurrences to the same numeric value (e.g., 8192 or 32768) or explicitly label
one as “operational default: context_size: 8192” and the other as “native
capability: context_size: 32768” with clear wording so readers know which is the
runtime default used by PR 2 and which is the model’s native max.
- Around line 27-33: The fenced code block in
docs/audit/embedder-swap-2026-05-12.md (the block showing the JinaV4 and metrics
lines) is missing a language identifier; update that triple-backtick fence to
include a language such as "text" (e.g., ```text) so markdownlint rule MD040 is
satisfied, leaving the block contents unchanged and preserving alignment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2fc9f999-56ac-4d36-a7c1-32dd2b1360f0

📥 Commits

Reviewing files that changed from the base of the PR and between 150ec39 and cdcdc31.

📒 Files selected for processing (1)
  • docs/audit/embedder-swap-2026-05-12.md

Comment thread docs/audit/embedder-swap-2026-05-12.md Outdated
Comment thread docs/audit/embedder-swap-2026-05-12.md Outdated
Two real findings:

1. §7 context_size was stale at 8192 — I forgot to back-propagate the
   §10.1 decision (32K native, don't truncate) when locking the
   decisions section. Updated §7 to reflect 32768 and added a cross-
   reference to §10.1 so the reason isn't lost on future readers.

2. The reference-count fence at §3 lacked a language identifier;
   markdownlint MD040 wants one. Changed to ```text.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/audit/embedder-swap-2026-05-12.md`:
- Line 12: The sentence conflates GGUF artifact file size with runtime GPU
memory; update the phrasing in the paragraph mentioning "Jina V4" and
"Qwen3-Embedding-0.6B-Q8_0" so it explicitly separates the GGUF weights
footprint (~635MB) from expected runtime VRAM (which is workload- and
backend-dependent), e.g., state that "~635MB refers to the GGUF artifact size,
not total runtime GPU memory, which varies with batch size, sequence length, and
allocator/engine," and keep the note about preserving 32K context and alignment
with the Qwen3-Coder decoder and the substrate-pluggability reference
(`memory-substrate-Spec.md` §3.5).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 94f3fb4f-a7f7-4d1e-b95a-aa002b1062b8

📥 Commits

Reviewing files that changed from the base of the PR and between cdcdc31 and 2ddf52a.

📒 Files selected for processing (1)
  • docs/audit/embedder-swap-2026-05-12.md

Comment thread docs/audit/embedder-swap-2026-05-12.md Outdated
CR flagged that "~635MB VRAM" conflated GGUF artifact size with total
runtime GPU memory. The artifact size is weights-on-disk; actual VRAM
also includes KV cache (significant with 32K context), intermediate
activations, and allocator overhead. Rewrote the motivation sentence
to make the distinction explicit and noted that empirical runtime
VRAM will be measured during PR 2's smoke test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@toddwbucy toddwbucy merged commit f7c4689 into main May 12, 2026
1 check passed
@toddwbucy toddwbucy deleted the audit/embedder-swap-qwen3 branch May 12, 2026 14:10
toddwbucy added a commit that referenced this pull request May 12, 2026
… leak (PR 2 of N) (#318)

* refactor(belief): parameterize embedding-index dim — remove framework leak

Removes the framework-side hardcoded 2048 from
weaver-database/src/graph/belief.rs (the highest-priority framework
leak identified in the embedder-swap audit, PR #317). The index
definition's static metadata (name, field, metric, n_lists) stays as
named consts; the per-agent `dimension` is supplied at construction
time via a new `belief_embedding_index_for_dim(dim) -> BeliefEmbeddingIndexDef`
const-fn builder.

Per the substrate-pluggability commitment in `memory-substrate-Spec.md`
§3.5 and the per-profile mapping in `project_embedder_per_profile`,
different agents use different embedders (Jina V4 → 2048,
Qwen3-Embedding-0.6B → 1024); each agent's index must match the
embedder it's bound to. A single framework const can't capture that.

Changes:

- `weaver-database/src/graph/belief.rs`:
  - New consts `BELIEF_EMBEDDING_INDEX_NAME`, `_METRIC`, `_N_LISTS`
    carry the invariant metadata.
  - New const-fn `belief_embedding_index_for_dim(dim) -> def`.
  - Removed the `BELIEF_EMBEDDING_INDEX` const (the actual framework
    leak — that const baked dim 2048 into the framework).
  - Doc comments updated to reflect per-agent dim provenance.
  - Test refactored to exercise the builder with multiple dims,
    proving the builder propagates dim verbatim and metadata stays
    invariant across calls.

- `weaver-demo/src/herobench/belief.rs`:
  - New `HEROBENCH_DEFAULT_EMBEDDING_DIM = 2048` named const, marked
    transitional in its doc comment. Demo-side carries the bridge
    default during the embedder-swap migration; framework side stays
    clean.
  - `validate_belief_embedding_schema()` (returned metric+dim) split
    into `validate_belief_embedding_metric()` (metric only). Dimension
    is now threaded explicitly from the named const, not buried in
    a framework return value.
  - `create_belief_embedding_index` uses the new
    `BELIEF_EMBEDDING_INDEX_N_LISTS` const directly.

- `weaver-demo/tests/herobench_integration.rs`:
  - Drive-by: fix pre-existing clippy nit
    (`manual_is_multiple_of`) that the upgraded clippy version
    started flagging. Unrelated to the embedder swap; happened to
    surface in the same crate's test build.

12 unit tests in `weaver-database::graph::belief` pass; clippy + fmt
clean across weaver-database + weaver-demo.

Next commits in this PR:
- New LlamaCppEmbedderClient in weaver-spu/src/encoder/llama_cpp_client.rs
- Backend dispatch in server.rs::load_per_agent_embedder
- Default agent YAML template referencing Qwen3-Embedding-0.6B-Q8_0
- GGUF file move from HF cache to /opt/weaver/models/
- Decision log entries in specs

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review: rename validate_ → parse_belief_embedding_metric

Per PR #318 CR review: the function parses a string into the
VectorMetric enum (it's the only consumer that matters; an unknown
metric string is the only failure path). "validate" was a misnomer
inherited from the pre-split `validate_belief_embedding_schema`;
"parse_" is what it actually does.

Body and return type unchanged; call site updated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* review: drop redundant BELIEF_NODES_COLLECTION import in herobench belief.rs

Per PR #318 CR review: the function-local `use` in
create_belief_embedding_index was pulling
`BELIEF_NODES_COLLECTION` from `weaver_database::graph::belief`,
shadowing the file-local `pub const BELIEF_NODES_COLLECTION` at line
~906. Both happen to have the same value ("belief_nodes"), so no
runtime divergence, but the shadow is sloppy.

Removed `BELIEF_NODES_COLLECTION` from the import list. Comment
inline explains why so the next reader doesn't reintroduce it. The
file-local const is now the single source for this function.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: r3d91ll <r3d91ll@bucy-medrano.me>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants