feat(noxa-68r): noxa-rag RAG pipeline crate by jmagar · Pull Request #4 · jmagar/noxa

jmagar · 2026-04-12T12:57:28Z

Summary

New crates/noxa-rag lib+binary crate: watches output directory for ExtractionResult JSON files, chunks them, embeds via TEI, and upserts to Qdrant
EmbedProvider + VectorStore async traits with Arc<dyn Trait> pluggable impls
TeiProvider: batched embedding (96/batch, RTX 4070 tuned), 429/503 retry backoff
QdrantStore: plain reqwest REST calls against port 6333 (no gRPC/protoc dependency)
chunker: MarkdownSplitter + Qwen3 tokenizer, manual sliding-window overlap
pipeline: notify-debouncer-mini watcher → bounded mpsc(256) → worker pool, per-URL mutex for safe delete-before-upsert
noxa-rag-daemon binary: full startup sequence, signal handling, 10s shutdown timeout
TOML config with serde tagged enums for runtime provider swapping

Bead

noxa-68r: noxa-rag RAG pipeline crate (TEI embeddings + Qdrant)

Testing

cargo check -p noxa-rag passes (0 errors)
Chunker unit tests pass
Services: TEI at 127.0.0.1:52000, Qdrant REST at 127.0.0.1:53333

Notes

qdrant-client crate dropped — uses plain reqwest REST (v1.x is gRPC-only)
tokenizers Rust crate has no from_pretrained — local_path required (tokenizer already cached at ~/.cache/huggingface/hub/models--Qwen--Qwen3-Embedding-0.6B/...)

Summary by cubic

Adds crates/noxa-rag: a RAG indexing pipeline and noxa-rag-daemon that watch noxa output, chunk content, embed with TEI, and upsert to Qdrant; now includes a startup delta scan using SHA‑256, safer multi‑format file parsing, web provenance metadata, a targeted JSON schema correction prompt in noxa-llm, and a two‑phase replace (upsert then evict stale). Implements Linear noxa-68r.

New Features
- Library + daemon: watch ExtractionResult JSON; chunk → embed → upsert; startup scan computes content_hash and skips unchanged docs via a Qdrant url+hash check; two‑phase replace uses delete_stale_by_url to avoid data loss and logs warn on Qdrant errors during the delta check.
- Formats + sources: parse_file() supports 20+ types (md/txt/log/rst/org/yaml/yml/toml/html/pdf/docx/odt/pptx/ipynb/jsonl/xml/opml/rss/atom/vtt/srt); strips notebook outputs; PDF via noxa-pdf; secure file:// with path canonicalization; adds a DOCX per‑entry decompressed‑size guard; auto file_path/last_modified and git_branch.
- Providers + store: TEI embedder with larger batches and extended 429/503 backoff; Qdrant REST store with create‑if‑missing, vector‑size validation, document‑level upsert, and delete_stale_by_url to evict only outdated chunks.
Bug Fixes
- Robustness: fixed byte‑slice preview panics in TEI and Qdrant error logs; connect/request timeouts; clearer errors with truncated bodies.
- Correct metadata: source_type is set to "web" for HTTP documents; startup stats count files_indexed only when chunks > 0.
- Pipeline safety: longer 429 backoff with real retry exhaustion; blocking bridges (no dropped jobs); per‑URL lock eviction and normalized URL used consistently across payloads, IDs, and locks.

^{Written for commit d8f5378. Summary will update on new commits.}

Summary by CodeRabbit

New Features
- Adds a background daemon CLI for filesystem-watching ingestion that chunks content, obtains embeddings (TEI provider), and upserts to a vector store (Qdrant) with concurrency, retries, backpressure, and graceful shutdown.
- Configurable TOML-driven pipeline with tokenizer handling, deterministic UUID point IDs, per-URL locking, embed/provider & store abstraction, failure logging, and runtime health/size checks.
Documentation
- Detailed README with quickstart, config examples, runtime requirements, operational notes, and troubleshooting.

- Bootstrap crates/noxa-rag with lib+binary layout - EmbedProvider trait (1 method: embed) - VectorStore trait (4 methods: upsert, delete_by_url, search, name) - RagError enum (#[non_exhaustive], thiserror, From impls) - Types: Chunk, Point, PointPayload, SearchResult - Config: RagConfig with serde tagged enums for all providers - load_config() with embed_concurrency > 0 validation - Tokenizer Sync compile-time assertion in lib.rs - Stub implementations for all modules (noxa-68r.2-.8)

… store noxa-68r.2 — chunker.rs: - MarkdownSplitter with ChunkConfig::new(lower..upper).with_sizer(tokenizer) - chunk_char_indices() API (not chunks_with_offsets — doesn't exist in text-splitter 0.29) - Manual sliding-window overlap via overlap_prefix() - min_words filter + max_chunks_per_page cap - plain_text fallback, empty content -> Vec::new() noxa-68r.3 — embed/tei.rs: - TeiProvider::new() (hardcoded 1024 dims) + new_with_probe() (dynamic) - Batch size 96 (RTX 4070 tuned); halve to 48 on 413 - Exponential backoff on 429/503 (100/200/400ms, max 3 retries) - EmbedRequest always sends inputs as array for consistent Vec<Vec<f32>> response noxa-68r.4 — store/qdrant.rs: - QdrantStore with REST via reqwest feature (gRPC port 6334) - create_collection: Cosine/HNSW m=16 ef_construct=200, payload indexes - upsert() batched 256 with .wait(true) - delete_by_url() with URL normalization - search() extracts payload from ScoredPoint

- build_embed_provider(): TEI startup probe via new_with_probe(), is_available() check - build_vector_store(): collection create-if-missing, dims validation warning - NOXA_RAG_QDRANT_API_KEY env var override for Qdrant api_key - embed_concurrency > 0 validation in load_config() - failed_jobs_log absolute path validation - Fix config.rs: Qdrant URL is gRPC port 6334 (not 6333 REST)

- Pipeline::new() + run() with CancellationToken shutdown - notify-debouncer-mini via std::sync::mpsc::Sender + spawn_blocking bridge (0.4.x uses callback API, not receiver() — std::Sender implements DebounceEventHandler) - Watch ALL event kinds, filter by .json extension + path.exists() (catches vim/emacs renames) - Bounded mpsc channel(256) + embed_concurrency workers via Arc<Mutex<Receiver>> - Per-URL keyed mutex via DashMap prevents concurrent delete+upsert races - TOCTOU-safe file read: single open(), metadata() from same FD - 50MB file size guard before read_to_string - JSON parse failure -> append_failed_job NDJSON log + Ok(()) - URL scheme validation: http/https only, RFC-1918 blocking - UUID v5 deterministic point IDs (config.uuid_namespace) - tracing::Span carried in IndexJob (tokio::spawn drops span)

- noxa-rag-daemon: --config, --log-level, --version args (clap) - Startup sequence: config load -> watch_dir create-if-missing -> embed probe -> vector store -> tokenizer -> Pipeline::new() -> signal handlers -> run() - Embed dims probed dynamically (dummy embed call, no hardcode) - tokenizer.json loaded from local_path (Rust tokenizers crate has no from_pretrained) - Clear error message with huggingface-cli download command - Accepts directory or direct file path - Tracing to stderr (stdout may be piped) - World-readable config warning on unix - 10s force-exit timeout after CancellationToken cancel - SIGTERM + Ctrl-C via tokio signal handlers - README.md: CRITICAL TEI --pooling last-token warning, full config reference, quickstart, architecture diagram - LEARNED: tokenizers Rust crate has no from_pretrained — local_path is required

qdrant-client v1.x is gRPC-only — reqwest feature is snapshot downloads only. Rewrite QdrantStore using direct HTTP calls against Qdrant REST API (port 6333): - collection_exists: GET /collections/{name} - create_collection: PUT /collections/{name} + PUT /collections/{name}/index - upsert: PUT /collections/{name}/points?wait=true (256-batch) - delete_by_url: POST /collections/{name}/points/delete?wait=true - search: POST /collections/{name}/points/search No new dependencies — reqwest already in workspace.

coderabbitai · 2026-04-12T12:57:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new noxa-rag crate implementing a filesystem‑watching RAG daemon that chunks extraction JSON files, obtains embeddings (TEI), and indexes vectors to Qdrant, providing configuration, provider/store abstractions, a concurrent pipeline, and graceful shutdown.

Changes

Cohort / File(s)	Summary
Manifest & Docs `crates/noxa-rag/Cargo.toml`, `crates/noxa-rag/README.md`	New crate manifest and README describing the daemon, dependencies, quickstart, TEI runtime requirements, example `noxa-rag.toml`, and operational notes (watcher behavior, failed-job logging, idempotent re-indexing).
Daemon Binary `crates/noxa-rag/src/bin/noxa-rag-daemon.rs`	New Tokio CLI daemon: arg parsing, logging, config load, permissions warning (Unix), embed-provider probe, tokenizer load (TEI-local only), vector store construction, pipeline creation, and signal-handling shutdown via CancellationToken.
Configuration & Errors `crates/noxa-rag/src/config.rs`, `crates/noxa-rag/src/error.rs`	TOML-deserializable `RagConfig` with tagged enums for source/embed/store, defaults/validation (`embed_concurrency`, absolute `failed_jobs_log`), and `RagError` enum for domain and wrapped errors.
Chunking Logic `crates/noxa-rag/src/chunker.rs`	Markdown-aware chunker that selects markdown/plain_text, token-range splitting, sliding-window overlap assembly, word-count filtering, tokenizer-backed token estimation fallback, domain extraction helpers, and unit tests.
Embedding Provider `crates/noxa-rag/src/embed/...` `crates/noxa-rag/src/embed/mod.rs`, `crates/noxa-rag/src/embed/tei.rs`	`EmbedProvider` trait + `DynEmbedProvider`; TEI provider with probe to infer dims, health check, batched `POST /embed`, retry/backoff for 429/503, special handling for 413 by reducing batch size, and async embed API.
Vector Store Abstraction `crates/noxa-rag/src/store/mod.rs`	`VectorStore` trait (async upsert/delete/search) with `DynVectorStore` alias and qdrant backend re-export.
Qdrant Backend `crates/noxa-rag/src/store/qdrant.rs`	`QdrantStore` implementing collection existence/creation, payload index creation, collection_vector_size, batched upsert, delete_by_url via payload filter, search parsing with payload extraction, URL normalization, and REST error mapping.
Factory Functions `crates/noxa-rag/src/factory.rs`	Async builders: `build_embed_provider` (TEI probe, is_available, dims check) and `build_vector_store` (API key resolution, validate or create Qdrant collection matching embed dims).
Pipeline Core `crates/noxa-rag/src/pipeline.rs`	`Pipeline` with debounced filesystem watcher (notify-debouncer-mini), bounded job queue, worker pool sized by `embed_concurrency`, per-URL mutexes (DashMap) to serialize delete+upsert, JSON size/parse limits, URL/IP validation, deterministic UUID v5 point IDs, optional NDJSON failed-job logging, and graceful shutdown/drain.
Types & Lib Root `crates/noxa-rag/src/types.rs`, `crates/noxa-rag/src/lib.rs`	Core types (`Chunk`, `Point`, `PointPayload`, `SearchResult`) and crate root with module declarations and public re-exports (config, embed, error, factory, store, types).

sequenceDiagram
    participant FS as Filesystem
    participant Watcher as DebouncedWatcher
    participant Queue as JobQueue
    participant Worker as WorkerPool
    participant Chunker as Chunker
    participant TEI as TEI_Service
    participant Qdrant as Qdrant_Store

    FS->>Watcher: write .json extraction file
    Watcher->>Queue: enqueue IndexJob (debounced)
    Queue->>Worker: deliver job
    Worker->>Worker: read & parse ExtractionResult
    Worker->>Chunker: chunk(result, config, tokenizer)
    Chunker->>Worker: return chunks
    Worker->>TEI: embed(batch of chunk texts)
    TEI->>TEI: batch, retry 429/503, fallback on 413
    TEI->>Worker: return embeddings
    Worker->>Worker: build Points (UUID v5)
    Worker->>Qdrant: delete_by_url(normalized_url)
    Qdrant->>Worker: ack delete
    Worker->>Qdrant: upsert(points)
    Qdrant->>Worker: ack upsert
    Worker->>Worker: record failures or finish

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hop through json, nibble each chunk,
TEI sings vectors, tidy and shrunk,
Qdrant beds them down, neat in a row,
Debounced I listen, index seeds I sow,
A rabbit daemon, quiet and quick to show.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat(noxa-68r): noxa-rag RAG pipeline crate' clearly and concisely describes the main change—the addition of a new RAG pipeline crate with comprehensive ecosystem integration (daemon, async traits, providers, pipeline orchestration).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/noxa-rag

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2e4343905f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot

Pull request overview

Introduces a new noxa-rag crate (lib + noxa-rag-daemon binary) that watches a noxa output directory for ExtractionResult JSON files and indexes them into a vector store by chunking, embedding (TEI), and upserting (Qdrant).

Changes:

Add filesystem-watcher pipeline with worker pool, URL validation, and delete-before-upsert indexing flow.
Add pluggable async traits for embedding providers and vector stores, with TEI and Qdrant REST implementations.
Add TOML configuration, daemon entrypoint, and README quickstart/docs.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
crates/noxa-rag/src/types.rs	Defines Chunk/Point/SearchResult types for indexing and retrieval.
crates/noxa-rag/src/store/qdrant.rs	Implements Qdrant REST upsert/delete/search backend.
crates/noxa-rag/src/store/mod.rs	Defines VectorStore trait + dyn wrapper and exports QdrantStore.
crates/noxa-rag/src/pipeline.rs	Implements watcher→queue→workers orchestration and indexing logic.
crates/noxa-rag/src/lib.rs	Crate module wiring + public re-exports.
crates/noxa-rag/src/factory.rs	Factory functions to build/probe embed provider and vector store.
crates/noxa-rag/src/error.rs	RagError error type for the crate.
crates/noxa-rag/src/embed/tei.rs	TEI embed provider with batching and retry/backoff.
crates/noxa-rag/src/embed/mod.rs	EmbedProvider trait + dyn wrapper and exports.
crates/noxa-rag/src/config.rs	TOML config schema + validation and loader.
crates/noxa-rag/src/chunker.rs	Markdown/text chunking with overlap and token estimates (+ unit tests).
crates/noxa-rag/src/bin/noxa-rag-daemon.rs	Daemon CLI, startup probes, tokenizer loading, signal handling.
crates/noxa-rag/README.md	Usage docs, quickstart, and architecture overview.
crates/noxa-rag/Cargo.toml	Declares new crate, binary target, and dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Three bugs identified in code review (chatgpt-codex-connector[bot]): fix(noxa-rag-daemon): move 10s shutdown timeout to drain phase only Previously tokio::time::timeout wrapped the entire pipeline.run() call, causing the daemon to exit after 10s even during normal steady-state operation with no shutdown signal. The timeout now lives inside pipeline.run() and only applies to the post-cancellation worker drain. Resolves review comment on noxa-rag-daemon.rs:186 (P0) fix(pipeline): replace try_send with blocking_send to stop dropping jobs try_send silently discarded index events when the 256-slot queue was full, causing permanent data loss for burst writes unless files were modified again. blocking_send (safe inside spawn_blocking) blocks until capacity is available, providing backpressure instead of data loss. Resolves review comment on pipeline.rs:203 (P1) fix(pipeline): normalize URL before storing payload and taking url_lock The stored payload URL and per-URL mutex key were derived directly from ExtractionResult without normalization. delete_by_url normalizes before filtering, so trailing slashes or fragments would cause delete-before- upsert to miss prior points and leave stale chunks. URL is now normalized immediately after validation; normalize_url is pub(crate) in qdrant.rs. Resolves review comment on pipeline.rs:405 (P1)

cubic-dev-ai

6 issues found across 14 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/noxa-rag/src/embed/tei.rs">

<violation number="1" location="crates/noxa-rag/src/embed/tei.rs:173">
P2: The reduced-batch retry path swallows the real sub-batch error and always returns a synthetic 413 error, causing incorrect failure reporting.</violation>
</file>

<file name="crates/noxa-rag/src/pipeline.rs">

<violation number="1" location="crates/noxa-rag/src/pipeline.rs:263">
P1: IPv6 private address check only covers loopback, missing unique-local (`fc00::/7`) and link-local (`fe80::/10`). A URL with a private IPv6 address like `http://[fd00::1]/...` would pass validation. Add manual octet checks for these ranges.</violation>
</file>

<file name="crates/noxa-rag/src/store/qdrant.rs">

<violation number="1" location="crates/noxa-rag/src/store/qdrant.rs:158">
P1: `upsert` stores raw URLs while `delete_by_url` matches normalized URLs, which can break delete-before-upsert for equivalent URL forms.</violation>
</file>

<file name="crates/noxa-rag/src/factory.rs">

<violation number="1" location="crates/noxa-rag/src/factory.rs:82">
P1: Existing Qdrant collections are not dimension-validated at startup, so incompatible vector sizes are only discovered later during upserts.</violation>
</file>

<file name="crates/noxa-rag/src/chunker.rs">

<violation number="1" location="crates/noxa-rag/src/chunker.rs:48">
P2: The overlap builder can return more than `overlap_tokens` because it checks the token budget after inserting the next word and breaks without undoing that insertion.</violation>
</file>

<file name="crates/noxa-rag/README.md">

<violation number="1" location="crates/noxa-rag/README.md:72">
P1: README points users to Qdrant gRPC port 6334, but this crate uses HTTP REST endpoints and expects port 6333. The documented config URL will not work.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/noxa-rag/src/pipeline.rs">

<violation number="1" location="crates/noxa-rag/src/pipeline.rs:247">
P1: Do not call `std::process::exit(0)` from library shutdown logic; return an error instead so callers can handle timeout safely.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

coderabbitai

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/noxa-rag/README.md`:
- Line 7: The README currently states Qdrant is running on gRPC port 6334 but
the crate uses Qdrant REST via reqwest; update every README occurrence (the
lines referencing gRPC/6334) to reflect REST/HTTP usage and the correct default
REST port (6333) and/or the full REST URL format used by this crate; refer to
src/config.rs for the canonical REST URL/config keys (so the README and the
implementation agree) and replace "gRPC/6334" wording with "HTTP REST (default
port 6333) / REST URL as configured in src/config.rs".
- Line 133: The README contains an unlabeled fenced code block (the block
starting with the lines containing "noxa-cli (writes .json) → watch_dir ...")
which triggers markdownlint MD040; update that fenced block to include a
language identifier (e.g., add "text" after the opening ``` so it becomes
```text) to explicitly label the block and satisfy the linter while leaving the
block content unchanged.

In `@crates/noxa-rag/src/chunker.rs`:
- Around line 45-49: The overlap-building loop currently inserts each word into
selected before checking token_estimate, so the first word that pushes the
estimate over overlap_tokens is retained; fix by checking the candidate's token
count before finalizing the insertion or by inserting then removing the last
word if the estimate >= overlap_tokens. Update the loop around
selected.insert(0, word) / selected.join(" ") to compute the candidate with the
prospective word and only keep it (or break) when token_estimate(&candidate,
tokenizer) < overlap_tokens, ensuring the overlap prefix stays within the
configured budget; reference the variables/functions words, selected,
token_estimate, tokenizer, and overlap_tokens when making the change.

In `@crates/noxa-rag/src/embed/tei.rs`:
- Around line 69-75: The code currently sets `dimensions` to
`DEFAULT_DIMENSIONS` when the deserialized `vecs` is empty, which can mask a bad
TEI probe; update the logic around `let vecs: Vec<Vec<f32>> =
resp.json().await?;` and `dimensions` so that if `vecs.is_empty()` you return an
error (or use `anyhow::bail!` / `Err(...)`) instead of falling back to
`DEFAULT_DIMENSIONS`, providing a clear error message that the TEI probe
returned an empty embedding response; adjust callers of this function to
propagate the error as needed.
- Around line 166-184: Replace the brittle string-match and synthetic-error
masking: stop checking Err(RagError::Embed(ref msg)) if msg.contains("413") and
instead detect a real 413 response via a dedicated status or variant on RagError
(e.g., add/ use a RagError::HttpStatus(u16) or include a status_code field on
RagError::Embed) so the fallback is only triggered for an actual 413; and when
iterating sub_chunk results from self.embed_batch(...) do not throw away the
original error—propagate Err(e) back to the caller instead of converting it into
a fabricated "TEI returned 413 even on reduced batch size" message (i.e.,
preserve and return the original Err from embed_batch when it fails).

In `@crates/noxa-rag/src/factory.rs`:
- Around line 17-34: The current check using provider.dimensions() can miss the
case where TeiProvider::new_with_probe() fell back to DEFAULT_DIMENSIONS because
the /embed probe returned no vectors; to fix this, after the availability check
call the provider.embed(...) probe (using the same small test input used by
TeiProvider::new_with_probe or a single-token input) and inspect the returned
vector(s) to ensure they are non-empty and derive actual dimensions; if the
embed response contains no vectors or zero-length vectors return
Err(RagError::Config(...)) instead of accepting the provider. Reference
TeiProvider::new_with_probe, provider.is_available(), provider.embed(), and
provider.dimensions() when locating where to add the re-probe and validation.
- Around line 81-96: The startup currently only warns when a collection exists
but may have mismatched vector dims; implement an explicit check in
build_vector_store(): when store.collection_exists() returns true, call a new
QdrantStore::collection_info() (returning CollectionInfoResponse) to fetch GET
/collections/{collection_name}, extract the stored vectors.size, compare it to
embed_dims, and return an Err if they differ (instead of logging a warning);
keep the existing create_collection(embed_dims) path for the false branch.
Ensure you add collection_info() to QdrantStore, parse the response into
CollectionInfoResponse, and use the collection and embed_dims identifiers from
the current scope for clear error logging.

In `@crates/noxa-rag/src/pipeline.rs`:
- Around line 396-423: The code constructs UUIDs, PointPayload.url, and uses the
url_locks key from the raw extracted URL, but delete_by_url and PointPayload.url
are expected to use a normalized/canonical URL; normalize the URL once up front
and use that canonical value everywhere: replace uses of the raw `url` when
calling `uuid::Uuid::new_v5(...)`, when setting `PointPayload { url: ... }`, and
when accessing `url_locks.entry(...)`/locking, and pass the same canonical URL
into `store.delete_by_url(&...)` and `store.upsert(points)` so IDs, payloads,
and the per-URL mutex all align with `delete_by_url`'s normalization logic.
- Around line 422-423: The current replace deletes existing vectors first
(store.delete_by_url(&url).await?) and then upserts new points, which can cause
data loss if upsert fails; change to a two-phase replace: first upsert the new
points (store.upsert(points).await?) using a new generation/version tag or
temporary ID, verify success, then remove the old vectors
(store.delete_by_url(&url).await?) or delete only entries with the previous
generation ID; alternatively, if the store supports transactions or conditional
deletes, perform the upsert and deletion atomically or use versioned IDs (e.g.,
attach generation/version to points) and run a stale-ID cleanup after successful
upsert to ensure transient failures don’t make data disappear.
- Around line 261-264: The IPv6 branch of is_private_ip only checks
is_loopback(), letting IPv6 unique-local and link-local addresses pass; update
the match arm in is_private_ip (the IpAddr::V6 branch) to also call
is_unique_local() and is_unicast_link_local() so it mirrors the IPv4 checks
(i.e., include ip.is_unique_local() || ip.is_unicast_link_local() alongside
is_loopback()).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fe1ba399-a1a1-4b7d-965f-d091ae09e231

📥 Commits

Reviewing files that changed from the base of the PR and between adf4b6b and 2e43439.

📒 Files selected for processing (14)

crates/noxa-rag/Cargo.toml
crates/noxa-rag/README.md
crates/noxa-rag/src/bin/noxa-rag-daemon.rs
crates/noxa-rag/src/chunker.rs
crates/noxa-rag/src/config.rs
crates/noxa-rag/src/embed/mod.rs
crates/noxa-rag/src/embed/tei.rs
crates/noxa-rag/src/error.rs
crates/noxa-rag/src/factory.rs
crates/noxa-rag/src/lib.rs
crates/noxa-rag/src/pipeline.rs
crates/noxa-rag/src/store/mod.rs
crates/noxa-rag/src/store/qdrant.rs
crates/noxa-rag/src/types.rs

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

crates/noxa-rag/src/pipeline.rs (3)

264-273: ⚠️ Potential issue | 🟠 Major

Block IPv6 unique-local and link-local addresses to match IPv4 validation.

The IPv6 branch only checks is_loopback(), allowing fc00::/7 (unique-local) and fe80::/10 (link-local) addresses to bypass validation. Both is_unique_local() and is_unicast_link_local() are stable since Rust 1.84.0.

🔒 Proposed fix to complete IPv6 validation

 fn is_private_ip(host: &str) -> bool {
     if let Ok(addr) = host.parse::<IpAddr>() {
         return match addr {
             IpAddr::V4(ip) => ip.is_private() || ip.is_loopback() || ip.is_link_local(),
-            IpAddr::V6(ip) => ip.is_loopback(),
+            IpAddr::V6(ip) => ip.is_loopback() || ip.is_unique_local() || ip.is_unicast_link_local(),
         };
     }
     false
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/noxa-rag/src/pipeline.rs` around lines 264 - 273, The function
is_private_ip currently checks only IPv6 loopback addresses and thus lets IPv6
unique-local (fc00::/7) and link-local (fe80::/10) addresses through; update the
IPv6 branch in is_private_ip to also return true for ip.is_unique_local() and
ip.is_unicast_link_local() (in addition to is_loopback()) so IPv6 validation
mirrors the IPv4 checks (private/link-local/loopback).

398-421: ⚠️ Potential issue | 🟠 Major

UUID and payload URL are derived from unnormalized chunk.source_url.

The URL is normalized at line 377 for the mutex key and delete_by_url, but chunk.source_url (from the raw ExtractionResult) is used for UUID generation (line 406) and PointPayload.url (line 413). If the source URL has a trailing slash or fragment that normalization removes, the UUID won't match subsequent delete operations, and duplicate vectors may accumulate.

🔧 Proposed fix to use normalized URL consistently

     // ── 6. Build points with deterministic UUID v5 ────────────────────────────
     let points: Vec<Point> = chunks
         .iter()
         .zip(vectors.iter())
         .enumerate()
         .map(|(i, (chunk, vector))| {
             let id = uuid::Uuid::new_v5(
                 &config.uuid_namespace,
-                format!("{}#chunk{}", chunk.source_url, i).as_bytes(),
+                format!("{}#chunk{}", url, i).as_bytes(),
             );
             Point {
                 id,
                 vector: vector.clone(),
                 payload: PointPayload {
                     text: chunk.text.clone(),
-                    url: chunk.source_url.clone(),
+                    url: url.clone(),
                     domain: chunk.domain.clone(),
                     chunk_index: chunk.chunk_index,
                     total_chunks: chunk.total_chunks,
                     token_estimate: chunk.token_estimate,
                 },
             }
         })
         .collect();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/noxa-rag/src/pipeline.rs` around lines 398 - 421, The UUID and
PointPayload.url are being derived from raw chunk.source_url; change the mapping
in the points builder (where uuid::Uuid::new_v5 is called and PointPayload.url
is set) to use the normalized URL variable (the same normalized value used for
the mutex key/delete_by_url logic) instead of chunk.source_url so UUIDs and
payload URLs are consistent with deletes; keep using config.uuid_namespace for
UUID v5 but pass the normalized URL string into format!("{}#chunk{}",
normalized_url, i) and set PointPayload.url = normalized_url.clone().

423-433: ⚠️ Potential issue | 🟠 Major

Delete-before-upsert can lose data on transient store failures.

If delete_by_url succeeds but upsert fails (e.g., network blip, Qdrant overload), the document's vectors are deleted until a subsequent filesystem event re-triggers indexing. This is a known trade-off noted in past reviews. Consider documenting this behavior or implementing a two-phase replace (upsert new points with generation tag, then delete old) if data loss is unacceptable.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/noxa-rag/src/pipeline.rs` around lines 423 - 433, The current
delete-then-upsert under the per-URL mutex (url_locks, url_lock) can cause
permanent data loss if store.delete_by_url(&url) succeeds but
store.upsert(points) fails; either document this behavior in the pipeline logic
or change to a two-phase replace: generate a new generation tag (e.g.,
generation_id) for the incoming points, call
store.upsert(points_with_generation) under the same mutex, then call a targeted
delete that removes only older generations (or delete_by_url_except_generation)
so an upsert failure does not remove the previous vectors; update any helper
methods (store.upsert, store.delete_by_url or add
delete_by_generation/cleanup_old_generations) and keep the same url_lock around
both operations to retain atomicity semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/noxa-rag/src/bin/noxa-rag-daemon.rs`:
- Around line 103-112: The probe call that computes embed_dims is redundant
because build_embed_provider already probes during TeiProvider::new_with_probe;
change build_embed_provider to return the discovered dimension alongside the
provider (e.g., return (DynEmbedProvider, usize)) and update the caller in
noxa-rag-daemon.rs to destructure that tuple (replace the embed + probe block
with let (embed, embed_dims) = build_embed_provider(&config).await?). Update
TeiProvider (or the provider wrapper) to expose dimensions via a dimensions()
accessor if needed so build_embed_provider can obtain and return the usize.

In `@crates/noxa-rag/src/pipeline.rs`:
- Around line 236-249: The forced shutdown on worker drain timeout currently
calls std::process::exit(0) which reports a successful exit despite the abnormal
condition; change the exit to a non-zero code (e.g., std::process::exit(1)) so
supervisors can detect the failure. Locate the timeout handling around
tokio::time::timeout(Duration::from_secs(10), drain).await where the Err(_)
branch logs "workers did not drain within 10s, forcing exit" and replace the
exit status there; ensure the surrounding symbols (drain async block and
worker_handles) remain unchanged.

---

Duplicate comments:
In `@crates/noxa-rag/src/pipeline.rs`:
- Around line 264-273: The function is_private_ip currently checks only IPv6
loopback addresses and thus lets IPv6 unique-local (fc00::/7) and link-local
(fe80::/10) addresses through; update the IPv6 branch in is_private_ip to also
return true for ip.is_unique_local() and ip.is_unicast_link_local() (in addition
to is_loopback()) so IPv6 validation mirrors the IPv4 checks
(private/link-local/loopback).
- Around line 398-421: The UUID and PointPayload.url are being derived from raw
chunk.source_url; change the mapping in the points builder (where
uuid::Uuid::new_v5 is called and PointPayload.url is set) to use the normalized
URL variable (the same normalized value used for the mutex key/delete_by_url
logic) instead of chunk.source_url so UUIDs and payload URLs are consistent with
deletes; keep using config.uuid_namespace for UUID v5 but pass the normalized
URL string into format!("{}#chunk{}", normalized_url, i) and set
PointPayload.url = normalized_url.clone().
- Around line 423-433: The current delete-then-upsert under the per-URL mutex
(url_locks, url_lock) can cause permanent data loss if store.delete_by_url(&url)
succeeds but store.upsert(points) fails; either document this behavior in the
pipeline logic or change to a two-phase replace: generate a new generation tag
(e.g., generation_id) for the incoming points, call
store.upsert(points_with_generation) under the same mutex, then call a targeted
delete that removes only older generations (or delete_by_url_except_generation)
so an upsert failure does not remove the previous vectors; update any helper
methods (store.upsert, store.delete_by_url or add
delete_by_generation/cleanup_old_generations) and keep the same url_lock around
both operations to retain atomicity semantics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: eb8b0e68-ed02-4df4-af56-61890ba6c40e

📥 Commits

Reviewing files that changed from the base of the PR and between 2e43439 and 23f3922.

📒 Files selected for processing (3)

crates/noxa-rag/src/bin/noxa-rag-daemon.rs
crates/noxa-rag/src/pipeline.rs
crates/noxa-rag/src/store/qdrant.rs

- #5: remove url_locks DashMap entries after drop when Arc::strong_count == 1 - #6/0xMassi#16/0xMassi#30: extend is_private_ip() to cover IPv6 unique-local (fc00::/7) and link-local (fe80::/10) - #7: clarify error message — "IP literal" not "resolves to IP" (no DNS happens) - #8: fix embed_concurrency validation message (no Semaphore in pipeline) - #9/0xMassi#20/0xMassi#27: propagate real errors from 413 sub-batch via ? instead of masking with synthetic message - #10: remove unused QdrantStatus struct - #11/0xMassi#18/0xMassi#29: implement collection_vector_size() on QdrantStore; fail fast in build_vector_store on dim mismatch - 0xMassi#15/0xMassi#21/0xMassi#25: rewrite overlap_prefix() O(n) with budget-check-before-insert fix - 0xMassi#22: replace process::exit(0) in drain timeout with Err(RagError::Generic) - 0xMassi#24: add 'text' language identifier to architecture diagram code fence (MD040) - 0xMassi#26: error instead of silent fallback to DEFAULT_DIMENSIONS when TEI probe returns empty - 0xMassi#32: log explicit tracing::error before returning Err on upsert failure after delete - 0xMassi#33: build_embed_provider returns (DynEmbedProvider, usize); daemon uses returned dims, removes redundant probe - #12/#13/#14/0xMassi#19/0xMassi#23: replace all gRPC/6334 references in README with REST/6333

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (1)

crates/noxa-rag/src/pipeline.rs (1)

405-423: ⚠️ Potential issue | 🟠 Major

UUID and payload use raw URL instead of normalized URL.

As noted earlier, the UUID generation (line 408) and PointPayload.url (line 415) use chunk.source_url which contains the raw URL from the extraction result. However, delete_by_url at line 437 normalizes the URL before filtering.

This creates inconsistency:

http://example.com/page/ and http://example.com/page will get different UUIDs
Delete won't find points with trailing slashes if the new file lacks them

Use the normalized url variable (from line 379) for both UUID generation and payload.

Proposed fix

     let points: Vec<Point> = chunks
         .iter()
         .zip(vectors.iter())
         .enumerate()
-        .map(|(i, (chunk, vector))| {
+        .map(|(i, (chunk, vector))| {
             let id = uuid::Uuid::new_v5(
                 &config.uuid_namespace,
-                format!("{}#chunk{}", chunk.source_url, i).as_bytes(),
+                format!("{}#chunk{}", url, i).as_bytes(),
             );
             Point {
                 id,
                 vector: vector.clone(),
                 payload: PointPayload {
                     text: chunk.text.clone(),
-                    url: chunk.source_url.clone(),
+                    url: url.clone(),
                     domain: chunk.domain.clone(),
                     chunk_index: chunk.chunk_index,
                     total_chunks: chunk.total_chunks,
                     token_estimate: chunk.token_estimate,
                 },
             }
         })
         .collect();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/noxa-rag/src/pipeline.rs` around lines 405 - 423, The UUID generation
and payload URL currently use chunk.source_url inside the map closure (where
uuid::Uuid::new_v5 is called and PointPayload.url is set), causing mismatches
with delete_by_url which uses a normalized url; update the map to use the
already-computed normalized url variable (from earlier in the scope) for both
the new_v5 namespace input and PointPayload.url so UUIDs and stored URLs are
consistent with delete_by_url filtering (keep other payload fields unchanged and
continue cloning vector/text as before).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/noxa-rag/src/bin/noxa-rag-daemon.rs`:
- Around line 175-177: The pipeline error is being printed but the program still
returns Ok(()), so failures don't set a non-zero exit code; modify the code
around pipeline.run().await to propagate the error instead of swallowing
it—either return Err(e.into()) from main (or use the ? operator on
pipeline.run().await) so the process returns a non-zero exit status, or
explicitly call std::process::exit(1) after logging; locate the
pipeline.run().await call and adjust its handling to propagate the error (e.g.,
change the if let Err(e) = pipeline.run().await { ... } block to use ? or return
Err(e.into())).
- Around line 110-115: The match on config.embed_provider sets tokenizer_model
with a fallback branch that is unreachable because build_embed_provider returns
Err for non-Tei providers; update the match in noxa-rag-daemon.rs so the non-Tei
case is explicit: either remove the unreachable "_ => ..." fallback and instead
handle other variants of EmbedProviderConfig explicitly (e.g., match
EmbedProviderConfig::OpenAI | EmbedProviderConfig::VoyageAI => return Err or
panic with a clear message), or change the binding to return an Option/Result so
callers handle absence; target the tokenizer_model binding and the
EmbedProviderConfig::Tei arm when making this change.

In `@crates/noxa-rag/src/chunker.rs`:
- Around line 82-88: The chunker stores the raw result.metadata.url into
source_url (and uses extract_domain) while the pipeline normalizes URLs into a
local url variable but still uses chunk.source_url when building
PointPayload.url and generating UUIDs, causing mismatches with delete_by_url
which normalizes before filtering; fix by making URL normalization consistent:
either normalize and store the canonical URL in chunk.source_url in the chunker
(replace result.metadata.url -> normalized URL using the same normalization used
by delete_by_url) or update the pipeline to always use the normalized url
variable (not chunk.source_url) when generating UUIDs and setting
PointPayload.url; ensure the same normalization function is used in both places
so extract_domain, PointPayload.url, uuid generation, and delete_by_url all
operate on the identical canonical URL.

In `@crates/noxa-rag/src/embed/tei.rs`:
- Around line 138-141: The retry backoff currently caps at 400ms by using
(delay_ms * 2).min(400) inside the retry loop that checks status and
MAX_RETRIES; update this to a larger or configurable cap: replace the hardcoded
400 with a constant or config variable (e.g., BACKOFF_MAX_MS or a field on the
client) and use (delay_ms * 2).min(backoff_max_ms) so callers can raise the cap
for aggressive 429 handling; ensure the new constant/config is used where
delay_ms is initialized and documented/tested.

In `@crates/noxa-rag/src/store/qdrant.rs`:
- Around line 31-34: The current CollectionVectors struct assumes the Qdrant
response has an unnamed vectors object with a direct size field, which breaks
for named-vector responses; update parsing used in collection_vector_size() to
accept both shapes by replacing CollectionVectors with a deserializable type
that handles either { "size": ... } or a map of named vectors (e.g.,
HashMap<String, VectorInfo>) or implement a custom Deserialize for
CollectionVectors that checks if the value is an object with "size" or an object
of named entries and then extracts a size; reference the CollectionVectors type
and the collection_vector_size() callsite so the deserializer change will be
used there.
- Around line 265-282: The current filter_map in the block that builds results
(the closure using hit.payload, extracting "text" and "url" and constructing
SearchResult) silently drops hits with missing fields; change this to explicitly
detect missing or malformed payload fields, log a warning with the hit id/score
and which field(s) were missing (use the existing logging/tracing facility in
the crate, e.g., tracing::warn! or log::warn!), and then skip the hit rather
than silently discarding it; update the closure around hit.payload /
payload.get("text") / payload.get("url") to record the missing-field details
before returning None and still construct SearchResult as before when all fields
are present.

---

Duplicate comments:
In `@crates/noxa-rag/src/pipeline.rs`:
- Around line 405-423: The UUID generation and payload URL currently use
chunk.source_url inside the map closure (where uuid::Uuid::new_v5 is called and
PointPayload.url is set), causing mismatches with delete_by_url which uses a
normalized url; update the map to use the already-computed normalized url
variable (from earlier in the scope) for both the new_v5 namespace input and
PointPayload.url so UUIDs and stored URLs are consistent with delete_by_url
filtering (keep other payload fields unchanged and continue cloning vector/text
as before).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 58d62a4e-ed7b-4af5-b6a4-10bae3a01711

📥 Commits

Reviewing files that changed from the base of the PR and between 23f3922 and 31cb9d8.

📒 Files selected for processing (8)

crates/noxa-rag/README.md
crates/noxa-rag/src/bin/noxa-rag-daemon.rs
crates/noxa-rag/src/chunker.rs
crates/noxa-rag/src/config.rs
crates/noxa-rag/src/embed/tei.rs
crates/noxa-rag/src/factory.rs
crates/noxa-rag/src/pipeline.rs
crates/noxa-rag/src/store/qdrant.rs

cubic-dev-ai

1 issue found across 8 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/noxa-rag/src/pipeline.rs">

<violation number="1" location="crates/noxa-rag/src/pipeline.rs:452">
P2: The new `remove_if` eviction condition never matches because the local `url_lock` clone is still held, so lock entries are not cleaned up.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

- pipeline: use normalized URL for UUID seed and payload.url so delete_by_url always matches (fixes delete-before-upsert no-op on trailing-slash / fragment URL variants) - pipeline: drop url_lock Arc clone before remove_if so strong_count reaches 1 and DashMap entries are actually evicted (fixes unbounded map growth) - daemon: propagate pipeline.run() error so process exits non-zero on failure (fixes misleading exit-0 for process supervisors) - daemon: replace unreachable tokenizer fallback branch with explicit error (dead code cleanup) - qdrant: log tracing::warn when search hit is dropped for missing payload fields (improves debuggability) - tei: increase backoff cap from 400ms to 2s for sustained 429 relief

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

crates/noxa-rag/src/embed/tei.rs (1)
130-135: ⚠️ Potential issue | 🟠 Major

Stop using error-message parsing to trigger the 413 fallback.

The split-and-retry path still depends on msg.contains("413"). Any wording change in embed_batch() or any unrelated embed error containing 413 can take the wrong branch. Carry the status as structured data on RagError and match that instead.

Also applies to: 164-170
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/noxa-rag/src/embed/tei.rs` around lines 130 - 135, The error handling
currently embeds HTTP status info into freeform strings and other code (e.g.,
the embed_batch() caller) detects 413 by checking msg.contains("413"); change
RagError so it carries structured status data (e.g., add a variant or field like
RagError::Embed { message: String, status: Option<u16> } or similar), return
that structured error where you detect status.as_u16() == 413 in embed_batch(),
and update all call sites that check for "413" in the error string to instead
match on the new status field (match RagError::Embed { status: Some(413), .. }
). Ensure both the 130–135 and 164–170 error returns populate the status field.
crates/noxa-rag/src/store/qdrant.rs (1)
31-34: ⚠️ Potential issue | 🟡 Minor

Accept named-vector collection configs when validating dimensions.

collection_vector_size() only deserializes the single-vector shape (vectors.size). Qdrant also returns named-vector configs, so startup can fail against an existing collection even when the vector size is compatible. Deserialize both response shapes before comparing dimensions.

Also applies to: 168-174
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/noxa-rag/src/store/qdrant.rs` around lines 31 - 34,
collection_vector_size() currently only deserializes the single-vector shape
into CollectionVectors { size } and will fail when Qdrant returns a named-vector
schema; change the parsing to first try the existing CollectionVectors path and,
if that fails, attempt deserializing the "vectors" field as a map of named
configs (e.g., a temporary CollectionNamedVectors { vectors: HashMap<String,
CollectionVectors> }) then extract the vector size(s) (ensure either all named
vectors have the same size or pick the appropriate named vector according to
existing logic) and use that size for the dimension comparison; apply the same
dual-deserialization fix wherever the same logic is duplicated (the other block
referenced in the review) so both single-vector and named-vector responses are
accepted.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/noxa-rag/src/pipeline.rs`:
- Around line 429-456: After acquiring the mutex (url_lock / _guard) ensure the
DashMap entry is evicted on all return paths: if store.delete_by_url or
store.upsert fails you must drop the guard and the local Arc clone and call
url_locks.remove_if(&url, |_, v| Arc::strong_count(v) == 1) before returning the
error. Update the error branches around store.delete_by_url() and the upsert()
failure path (the block that logs "upsert failed after delete") to perform
drop(_guard); drop(url_lock); url_locks.remove_if(&url, |_, v|
Arc::strong_count(v) == 1); then return Err(e) so failed attempts don't leak
url_locks entries.
- Around line 153-177: The intermediate unbounded std::sync::mpsc::channel
(notify_tx/notify_rx) lets the debouncer emit events without backpressure —
replace it with a bounded channel (e.g., std::sync::mpsc::sync_channel) with a
small capacity or change the debouncer's event handler to call blocking_send()
directly so the debouncer will block when the Tokio tx (used in blocking_send)
is backpressured; update the code that constructs the debouncer via
new_debouncer and the bridge task (bridge_handle) to use the bounded
notify_tx/notify_rx (or to remove the intermediate channel and call
tx.blocking_send from the debouncer callback) so memory cannot grow unbounded
during filesystem churn.

In `@crates/noxa-rag/src/store/qdrant.rs`:
- Around line 95-98: The reqwest client built in qdrant.rs (the variable client
from reqwest::Client::builder()) lacks any timeouts causing operations to
potentially hang; update the builder to set sensible timeouts (e.g.,
connect_timeout and overall request timeout via
connect_timeout(Duration::from_secs(...)) and timeout(Duration::from_secs(...)))
before calling build(), so all collection/search/delete/upsert requests using
this client will be bounded; preserve the existing
map_err(RagError::Config(...)) behavior when calling build().

---

Duplicate comments:
In `@crates/noxa-rag/src/embed/tei.rs`:
- Around line 130-135: The error handling currently embeds HTTP status info into
freeform strings and other code (e.g., the embed_batch() caller) detects 413 by
checking msg.contains("413"); change RagError so it carries structured status
data (e.g., add a variant or field like RagError::Embed { message: String,
status: Option<u16> } or similar), return that structured error where you detect
status.as_u16() == 413 in embed_batch(), and update all call sites that check
for "413" in the error string to instead match on the new status field (match
RagError::Embed { status: Some(413), .. } ). Ensure both the 130–135 and 164–170
error returns populate the status field.

In `@crates/noxa-rag/src/store/qdrant.rs`:
- Around line 31-34: collection_vector_size() currently only deserializes the
single-vector shape into CollectionVectors { size } and will fail when Qdrant
returns a named-vector schema; change the parsing to first try the existing
CollectionVectors path and, if that fails, attempt deserializing the "vectors"
field as a map of named configs (e.g., a temporary CollectionNamedVectors {
vectors: HashMap<String, CollectionVectors> }) then extract the vector size(s)
(ensure either all named vectors have the same size or pick the appropriate
named vector according to existing logic) and use that size for the dimension
comparison; apply the same dual-deserialization fix wherever the same logic is
duplicated (the other block referenced in the review) so both single-vector and
named-vector responses are accepted.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8e9cfe37-ba9b-48e2-bc99-824a2031cf33

📥 Commits

Reviewing files that changed from the base of the PR and between 31cb9d8 and 4ceb218.

📒 Files selected for processing (4)

crates/noxa-rag/src/bin/noxa-rag-daemon.rs
crates/noxa-rag/src/embed/tei.rs
crates/noxa-rag/src/pipeline.rs
crates/noxa-rag/src/store/qdrant.rs

- pipeline: replace unbounded std::sync::mpsc::channel with sync_channel(256) + BoundedSender wrapper so backpressure from the Tokio job queue propagates back to the debouncer under sustained load - pipeline: capture store_result before returning so url_locks.remove_if always runs on error paths (store outage no longer leaks DashMap entries) - qdrant: add connect_timeout(5s) + timeout(30s) to reqwest client so stalled Qdrant connections don't block workers indefinitely

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

crates/noxa-rag/src/pipeline.rs (1)
447-476: ⚠️ Potential issue | 🟠 Major

Per-URL locking still doesn't make the replace atomic.

delete_by_url() and upsert() are still two independent store calls. If the delete succeeds and the upsert fails, the document is gone until another filesystem event replays it; with Qdrant's batched upserts, it can also come back only partially. Consider a versioned/two-phase replace instead of deleting first.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/noxa-rag/src/pipeline.rs` around lines 447 - 476, The current
delete-then-upsert in the async block (store.delete_by_url(&url).await?;
store.upsert(points).await) is not atomic; if upsert fails you temporarily lose
or partially lose the document. Change to a versioned / two-phase replace:
produce a new versioned document (or batch) and call store.upsert for the new
version first, then atomically retire the old version (e.g.,
delete_by_url_excluding_version or conditional delete), or use a
conditional/compare-and-swap upsert API if the store supports it. Update the
logic around store.upsert, store.delete_by_url, and store_result so you upsert
new data before removing the old, ensure the eviction/drop of _guard and
url_locks remains unchanged, and add any version metadata handling required by
the store implementation to make replace idempotent and safe on failures.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/noxa-rag/src/pipeline.rs`:
- Around line 94-103: Reject or clamp embed_concurrency before creating the job
channel and spawning workers: check self.config.pipeline.embed_concurrency and
if it is 0 either return an error (fail fast) or set it to 1, so the bounded
channel (tx/rx) and worker pool loop (for worker_id in
0..self.config.pipeline.embed_concurrency) always has at least one worker to
drain jobs; update the code around the channel creation and the
worker_handles/for loop to use the validated/clamped value.
- Around line 211-246: The bridge uses tx_clone.blocking_send(job) which can
block indefinitely and prevent prompt shutdown; replace it with a
cancellation-aware loop that calls tx_clone.try_send(job) and handles
TrySendError: on Ok -> break out of the loop; on Full -> check
shutdown_clone.is_cancelled() and if cancelled break/return, otherwise
sleep/yield briefly (e.g., thread::sleep or std::thread::yield_now) and retry;
on Disconnected -> break/return. Keep the rest of the bridge logic
(shutdown_clone, drop(tx), bridge_handle.await) unchanged so the bridge can exit
immediately when cancellation is requested.

In `@crates/noxa-rag/src/store/qdrant.rs`:
- Around line 189-218: The upsert() implementation currently upserts in
256-point chunks and returns on first failed batch, leaving earlier successful
chunks persisted and potentially causing partial documents; modify upsert() (the
loop building QdrantPoint batches and sending UpsertRequest via self.client.put)
to be atomic for a single logical document: either (recommended) avoid chunking
per logical document by sending all points for that URL in one UpsertRequest, or
(if chunking must be kept) track the IDs successfully written per chunk and on
any failure call the Qdrant delete-by-ids API (via self.client) to remove those
IDs before returning Err(RagError::Store(...)); ensure you reference the same
QdrantPoint.id values when deleting and propagate the original error text in the
returned RagError.

---

Duplicate comments:
In `@crates/noxa-rag/src/pipeline.rs`:
- Around line 447-476: The current delete-then-upsert in the async block
(store.delete_by_url(&url).await?; store.upsert(points).await) is not atomic; if
upsert fails you temporarily lose or partially lose the document. Change to a
versioned / two-phase replace: produce a new versioned document (or batch) and
call store.upsert for the new version first, then atomically retire the old
version (e.g., delete_by_url_excluding_version or conditional delete), or use a
conditional/compare-and-swap upsert API if the store supports it. Update the
logic around store.upsert, store.delete_by_url, and store_result so you upsert
new data before removing the old, ensure the eviction/drop of _guard and
url_locks remains unchanged, and add any version metadata handling required by
the store implementation to make replace idempotent and safe on failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c69020c6-19d7-4340-b72d-2a9d2588ea25

📥 Commits

Reviewing files that changed from the base of the PR and between 4ceb218 and b398294.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock and included by **/*

📒 Files selected for processing (2)

crates/noxa-rag/src/pipeline.rs
crates/noxa-rag/src/store/qdrant.rs

cubic-dev-ai

1 issue found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/noxa-rag/src/embed/tei.rs">

<violation number="1" location="crates/noxa-rag/src/embed/tei.rs:63">
P2: The `max retries exceeded` error block is unreachable because the loop always returns before reaching it.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Resolves review thread PRRT_kwDOR_mP6c56Z0UG Resolves review thread PRRT_kwDOR_mP6c56ZvqF Resolves review thread PRRT_kwDOR_mP6c56ZvqI Resolves review thread PRRT_kwDOR_mP6c56ZvqJ - make TEI retry exhaustion reachable on repeated 429/503 responses - add a defensive zero-worker guard in Pipeline::run - make bridge queue backpressure cancellation-aware - upsert each logical document in a single Qdrant request

cubic-dev-ai

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/noxa-rag/src/embed/tei.rs">

<violation number="1" location="crates/noxa-rag/src/embed/tei.rs:117">
P2: Off-by-one retry regression: `MAX_RETRIES` now limits total attempts instead of retries, so 429/503 responses get one fewer retry than configured.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

…Result fields

… serialization

…ata constructors across workspace

…ete counts, truncated errors, collection stats

… body truncation

… to noxa-rag

cubic-dev-ai

5 issues found across 15 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/noxa-rag/src/store/qdrant.rs">

<violation number="1" location="crates/noxa-rag/src/store/qdrant.rs:147">
P1: Avoid byte-slicing error text; it can panic on non-ASCII UTF-8 and crash error handling.</violation>

<violation number="2" location="crates/noxa-rag/src/store/qdrant.rs:174">
P2: Do not create `content_type`/`file_format` indexes until those payload fields are written or queried; currently they add index build cost with no benefit.</violation>
</file>

<file name="crates/noxa-fetch/src/document.rs">

<violation number="1" location="crates/noxa-fetch/src/document.rs:114">
P2: Document responses fetched over HTTP are incorrectly labeled as `source_type = "file"`; they should be `"web"` for correct provenance.</violation>
</file>

<file name="crates/noxa-rag/src/embed/tei.rs">

<violation number="1" location="crates/noxa-rag/src/embed/tei.rs:172">
P1: Slicing the response body with `&body[..min(512)]` can panic on non-ASCII UTF-8 text; use a char-boundary-safe truncation for error previews.</violation>
</file>

<file name="crates/noxa-rag/src/pipeline.rs">

<violation number="1" location="crates/noxa-rag/src/pipeline.rs:153">
P3: `files_indexed` is incremented for skipped jobs (0-chunk `Ok` paths), so indexed/average session metrics are inflated and inaccurate.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

…ath security, git metadata noxa-cl1: is_indexable() now accepts 23 extensions (md/txt/log/rst/org/yaml/yml/toml/html/htm/ ipynb/pdf/docx/odt/pptx/jsonl/xml/opml/vtt/srt/rss/atom/json). Deferred: epub/eml/mbox. validate_url_scheme() adds file:// with remote-host guard. collect_indexable_paths_recursive() skips symlinks to prevent watch_dir traversal attacks. process_job() canonicalizes path and checks starts_with(watch_dir) to block TOCTOU rename/hardlink attacks. Format dispatch skeleton added (non-JSON prints warn, returns early pending noxa-iua ingester.rs). noxa-9ww: detect_git_branch() walks .git/HEAD up from job.path — no subprocess. file_path and last_modified populated from job.path metadata if not set by the source ingester. git_branch added to PointPayload and populated at upsert time. Tests: 5 new unit tests (is_indexable accepts/rejects extensions, detect_git_branch with/ without repo, detect_git_branch on detached HEAD). 459 tests passing workspace-wide.

cubic-dev-ai

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/noxa-rag/src/pipeline.rs">

<violation number="1" location="crates/noxa-rag/src/pipeline.rs:592">
P2: Binary formats (`.pdf`, `.docx`, `.odt`, `.pptx`) accepted by `is_indexable` will fail at `read_to_string` with a UTF-8 error before the format dispatch is reached, bypassing the intended graceful "not yet supported" skip. Move the extension check before the content read, or gate `read_to_string` on the format being text-based.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

…, web provenance fields Three beads landed together (noxa-rlr, noxa-iua, noxa-i42): noxa-rlr — startup scan with SHA-256 delta detection - On daemon start, scans watch_dir before watcher connects - Computes SHA-256 of each file; checks Qdrant content_hash to skip unchanged - New VectorStore::url_with_hash_exists() method + QdrantStore impl - tokio::select! backpressure guard prevents scan blocking shutdown - Logs total/queued/skipped counts on completion - Added sha2 = "0.10" dependency noxa-iua — format-dispatch parse_file() - Replaces read_to_string with read_to_end (Vec<u8>) for binary format support - New parse_file(path, bytes) dispatches by extension: html, md, txt, yaml, toml, log, ipynb, pdf, docx, odt, pptx, jsonl, xml, opml, rss, vtt, srt - parse_ipynb: strips cell outputs (PII/env dump guard) - parse_pdf: via noxa_pdf::extract_pdf + noxa_pdf::to_markdown - parse_office_zip: 100 MiB/entry + 1000-entry ZIP bomb guard; DOCX delegates to noxa_fetch::document::extract_document; ODT/PPTX via quick-xml text-node scan - Added RagError::Parse variant; added noxa-fetch workspace dep - make_text_result() helper builds ExtractionResult from pre-extracted text noxa-i42 — web provenance fields in noxa-fetch metadata - Added seed_url, crawl_depth, search_query, fetched_at to noxa-core::Metadata - crawler.rs stamps seed_url + crawl_depth on each crawled page - client.rs stamps fetched_at = Utc::now().to_rfc3339() on every fetch - All 9 Metadata constructors across workspace updated

cubic-dev-ai

2 issues found across 18 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/noxa-rag/src/pipeline.rs">

<violation number="1" location="crates/noxa-rag/src/pipeline.rs:363">
P2: Qdrant errors during the startup delta check are silently discarded. If Qdrant is unreachable or misconfigured, every file will be re-enqueued with zero diagnostic output. Log the error at `warn` level before falling through to the conservative re-enqueue path.</violation>

<violation number="2" location="crates/noxa-rag/src/pipeline.rs:852">
P2: The per-entry decompressed-size guard (`MAX_ENTRY_SIZE`) is bypassed for DOCX because the function returns early before the entry-iteration loop. Only the `MAX_ENTRIES` count check runs. If `noxa_fetch::document::extract_document` lacks its own decompressed-size guard, a crafted DOCX zip bomb could cause excessive memory use.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Format tests: md, txt, rst/org/yaml/toml group, log (ANSI strip), html, ipynb (outputs stripped), docx (ZIP fixture), odt (ZIP fixture), pdf (defensive — accepts Ok or Parse error from lopdf). Utility tests: collect_indexable_paths (md/html/ipynb/json collected, png excluded), is_indexable (rejects exe/png/unknown/nonexistent), validate_url_scheme (4 cases — accepts file:///local, rejects file://remotehost and ftp://). EPUB deferred: not implemented in parse_file, is_indexable explicitly skips it.

…prompt build_schema_correction_prompt() re-validates the parsed value, extracts instance_path + kind keyword from the first ValidationError, and formats: "Field '/price' failed 'type' check. Return ONLY corrected JSON." Hard-capped at 200 chars. Never embeds raw model output or web content. Removes the old capped_response assistant-message injection entirely. Adds 3 tests: correction includes path/keyword, handles missing required fields, and verifies raw model output is not embedded in retry messages.

cubic-dev-ai · 2026-04-13T04:31:50Z

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

Resolves review threads: - PRRT_kwDOR_mP6c56bfis: log Qdrant errors at warn in startup delta check - PRRT_kwDOR_mP6c56bfit: add per-entry decompressed-size guard for DOCX before delegating to extract_document - PRRT_kwDOR_mP6c56a_7y: fix byte-slice panic in qdrant.rs (chars().take(512)) - PRRT_kwDOR_mP6c56a_76: fix byte-slice panic in tei.rs (chars().take(512)) - PRRT_kwDOR_mP6c56a_78: remove unused content_type/file_format payload indexes - PRRT_kwDOR_mP6c56a_7-: fix source_type for HTTP-fetched documents ("file"→"web") - PRRT_kwDOR_mP6c56a_8B: only increment files_indexed for non-zero-chunk jobs - PRRT_kwDOR_mP6c56Xguf: two-phase replace (upsert first, delete_stale second) Changes: - store/mod.rs: add delete_stale_by_url to VectorStore trait - store/qdrant.rs: implement delete_stale_by_url using must_not has_id filter; fix all 8 byte-slice preview panics; drop content_type/file_format indexes - embed/tei.rs: fix 2 byte-slice preview panics - pipeline.rs: split Ok(false)|Err(_) arm to log warn on Qdrant error; add DOCX zip-bomb guard before noxa-fetch delegation; two-phase replace using upsert-then-delete_stale_by_url; gate files_indexed on chunks > 0 - noxa-fetch/src/document.rs: set source_type = "web" for HTTP documents

jmagar added 6 commits April 12, 2026 07:13

Copilot AI review requested due to automatic review settings April 12, 2026 12:57

Copilot started reviewing on behalf of jmagar April 12, 2026 12:58 View session

chatgpt-codex-connector Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/bin/noxa-rag-daemon.rs Outdated

Comment thread crates/noxa-rag/src/pipeline.rs Outdated

Comment thread crates/noxa-rag/src/pipeline.rs Outdated

Copilot AI reviewed Apr 12, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/pipeline.rs Outdated

coderabbitai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/bin/noxa-rag-daemon.rs Outdated

Comment thread crates/noxa-rag/src/pipeline.rs

coderabbitai Bot reviewed Apr 12, 2026

View reviewed changes

cubic-dev-ai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/pipeline.rs

coderabbitai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/pipeline.rs Outdated

Comment thread crates/noxa-rag/src/pipeline.rs

Comment thread crates/noxa-rag/src/store/qdrant.rs

jmagar added 2 commits April 12, 2026 17:21

Update Cargo.lock

b398294

coderabbitai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/pipeline.rs

Comment thread crates/noxa-rag/src/pipeline.rs Outdated

Comment thread crates/noxa-rag/src/store/qdrant.rs Outdated

fix(noxa-rag): harden TEI and Qdrant error handling

b6a8d68

cubic-dev-ai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/embed/tei.rs

cubic-dev-ai Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/embed/tei.rs Outdated

jmagar added 2 commits April 12, 2026 22:52

feat(noxa-ut9): expand Metadata + add IngestionContext, extend Search…

9bd72fe

…Result fields

feat(noxa-cz5): expand PointPayload with full metadata, update upsert…

ad298d5

… serialization

jmagar added 5 commits April 12, 2026 22:56

feat(noxa-xn0): add 4 keyword indexes to create_collection; fix Metad…

65073e5

…ata constructors across workspace

feat(observability): per-step timing, session summary, heartbeat, del…

d8f0334

…ete counts, truncated errors, collection stats

feat(noxa-bvr): add TEI batch/retry/413 structured logging with error…

9c37510

… body truncation

feat(noxa-bzn): add embed_tokens_per_sec to indexed/reindexed log line

afa4d16

feat(noxa-s5r): add noxa-pdf, zip, quick-xml, strip-ansi-escapes deps…

09a8f29

… to noxa-rag

cubic-dev-ai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/store/qdrant.rs Outdated

Comment thread crates/noxa-rag/src/embed/tei.rs Outdated

Comment thread crates/noxa-rag/src/store/qdrant.rs Outdated

Comment thread crates/noxa-fetch/src/document.rs Outdated

Comment thread crates/noxa-rag/src/pipeline.rs Outdated

cubic-dev-ai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/pipeline.rs Outdated

cubic-dev-ai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread crates/noxa-rag/src/pipeline.rs Outdated

Comment thread crates/noxa-rag/src/pipeline.rs

jmagar added 2 commits April 13, 2026 00:31

jmagar merged commit 88364ca into main Apr 13, 2026
1 check passed

jmagar deleted the feature/noxa-rag branch April 13, 2026 05:16

This was referenced Apr 26, 2026

Port full upstream vertical extractor parity #13

Merged

fix(rag): 20-item review pass — security, perf, arch, simplification #14

Open

Conversation

jmagar commented Apr 12, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bead

Testing

Notes

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmagar commented Apr 12, 2026 •

edited by cubic-dev-ai Bot

Loading

coderabbitai Bot commented Apr 12, 2026 •

edited

Loading