Port full upstream vertical extractor parity by jmagar · Pull Request #13 · jmagar/noxa

jmagar · 2026-04-26T21:09:30Z

Summary

Adds the full 28-extractor vertical catalog in noxa-fetch, including vertical_data on ExtractionResult, static dispatch, safe URL auto-dispatch, and explicit extractor dispatch.
Exposes vertical extraction through CLI with --list-extractors and --extractor <name> for single URL and batch modes.
Exposes vertical extraction through MCP with scrape.extractor and a new extractors catalog tool.
Documents the feature and includes a live CLI extractor report covering all 28 extractors.

Verification

cargo test --workspace
cargo clippy --workspace --all-targets -- -D warnings
Live CLI extractor sweep: 28/28 passed, documented in docs/reports/live-extractor-cli-report-2026-04-26.md

Notes

Broad page extractors remain explicit-only to avoid surprising generic page extraction behavior.
noxa-hv9 remains open for separate optional non-extractor parity items: Docker entrypoint shim, crawler scope knobs, and Safari iOS profile.

Summary by cubic

Ports full vertical extractor parity: site-specific extractors with catalog, static/auto/explicit dispatch, and vertical_data in results. Exposes extractors via CLI (--list-extractors, --extractor) and MCP (scrape.extractor, extractors tool).

New Features
- Safe auto-dispatch for known URLs; broad page extractors remain explicit-only.
- CLI: list extractors and force one in single or batch mode; rejects --extractor with --stdin/--file/--cloud.
- MCP: scrape accepts an optional extractor; new extractors catalog tool.
- Extractor updates: broader Amazon/eBay TLDs, tighter Substack match, support YouTube nocookie; Reddit auto path attaches a Reddit-specific vertical_data payload.
Bug Fixes
- Crawler: validate include/exclude globs (length, wildcards, globstars); harden robots.txt sitemap parsing (case-insensitive, strip comments, URL validation).
- JSON-LD/recovery: handle raw newlines, avoid false matches in alt text, multibyte-safe search, ignore empty needles.
- Reddit: detect verify-wall HTML and use a JSON API user agent.
- Watch: parse on-change commands with shlex to prevent shell injection.
- Vertical extractors: tighten URL matchers and make Shopify API URLs query-safe; improve YouTube, Instagram, arXiv, GitHub release, and structured-data edge cases.
- Tooling: stabilize CI on current stable Rust; address current stable clippy lints (no behavior change).

^{Written for commit 145b3ee. Summary will update on new commits.}

Summary by CodeRabbit

Release Notes

New Features
- Introduced 28 vertical extractors for popular platforms including GitHub, arXiv, Amazon, Instagram, and others.
- Added --list-extractors CLI flag to display available vertical extractors.
- Added --extractor CLI flag to explicitly select a vertical extractor for URL processing.
- Added extractors MCP tool to retrieve the full catalog of available extractors.
- Extended scrape MCP tool with optional extractor parameter for explicit vertical extraction.
- Extraction results now include site-specific structured data via vertical_data field.

coderabbitai · 2026-04-26T21:09:46Z

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive vertical extractor system to Noxa, adding 28 site-specific extractors (e.g., GitHub, Reddit, YouTube, npm) alongside infrastructure for URL dispatch, CLI selection/listing, and MCP integration. A new vertical_data field carries extractor-specific structured payloads through the extraction pipeline, with integration across fetch, CLI, MCP, and RAG layers.

Changes

Cohort / File(s)	Summary
Core Type System `crates/noxa-core/src/types.rs`, `crates/noxa-core/src/lib.rs`	Added `VerticalData` struct and optional `vertical_data` field to `ExtractionResult` to carry site-specific extraction payloads.
Extractor Catalog & Dispatch `crates/noxa-fetch/src/extractors/mod.rs`, `crates/noxa-fetch/src/extractors/http.rs`	Introduced extractor registry with `ExtractorInfo`, URL/name-based dispatch, error types, and comprehensive test suite validating all 28 extractors.
Vertical Extractors (28 modules) `crates/noxa-fetch/src/extractors/{amazon_product,arxiv,crates_io,dev_to,...,youtube_video}.rs`	New site-specific extractors for GitHub (repo/issue/PR/release), npm, PyPI, Reddit, YouTube, eBay, Etsy, Docker Hub, Shopify, dev.to, Hugging Face, arXiv, Stack Overflow, LinkedIn, Instagram, Substack, Trustpilot, and more; each with `matches()` and `extract()` functions.
Extractor Helpers & Product Parsing `crates/noxa-fetch/src/extractors/{product,summary}.rs`	Shared product page JSON-LD/OG parsing, Trustpilot review extraction, and markdown-from-title helpers.
Fetch Client Integration `crates/noxa-fetch/src/client/{fetch,batch}.rs`	Added `fetch_and_extract_vertical()` and `fetch_and_extract_batch_vertical_with_options()` methods with extractor dispatch and vertical result building.
CLI Extractor Flags & Logic `crates/noxa-cli/src/app/{cli,entry,fetching/extract}.rs`	Added `--extractor` and `--list-extractors` CLI flags; routed list-extractors to early exit; validated extractor incompatibility with stdin/cloud/raw-html modes; wired vertical extraction into normal fetch flow.
CLI Extractor Printing & Batch `crates/noxa-cli/src/app/{printing,batch}.rs`	Added extractor catalog printing (text/JSON formats); extended batch flow to conditionally use vertical extraction when extractor specified.
MCP Tool Extensions `crates/noxa-mcp/src/{server,tools}.rs`	Extended `ScrapeParams` with optional `extractor` field; added `scrape_after_validation` vertical extraction path; introduced new `extractors()` MCP tool returning full catalog.
RAG Pipeline Updates `crates/noxa-rag/src/pipeline/{parse/mod,mcp_bridge,process,scan,watcher}.rs`	Ensured all `ExtractionResult` construction paths initialize `vertical_data: None`; refactored conditionals and formatting throughout.
Structural Refactoring `crates/noxa-cli/src/app/{crawl_watch,rag_watch,rag_daemon,watch,watch_singleton,store_ops}.rs` & `crates/noxa-store/src/content_store/{enumerate,tests}.rs`	Refactored option chaining (`map_or` → `is_none_or`), combined conditionals, and improved formatting for readability without behavioral changes.
Supporting Updates `crates/noxa-fetch/src/{crawler,document,linkedin,reddit,sitemap}.rs`	Initialized `vertical_data: None` in results; added Reddit verification-wall detection; improved glob validation and sitemap parsing; exposed `json_api_user_agent()`.
Test Fixtures `crates/noxa-fetch/tests/fixtures/extractors/*.{json,xml,html}`	Added 28+ fixture files (JSON, XML, HTML) representing realistic API/page responses for all extractors (GitHub, npm registry, arXiv Atom feed, Instagram GraphQL, etc.).
Dependencies & Config `crates/noxa-cli/Cargo.toml`, `crates/noxa-fetch/Cargo.toml`	Added `shlex` crate for CLI argument parsing; added `async-trait` and `regex` to `noxa-fetch`.
Documentation `README.md`, `docs/CHANGELOG.md`, `docs/superpowers/{specs,plans}/*.md`, `docs/reports/live-extractor-cli-report-2026-04-26.md`, `crates/noxa-mcp/README.md`	Updated README with extractor CLI usage and MCP tool count; added comprehensive design spec, implementation plan, live test report, and changelog entries documenting 28 extractors and safe auto-dispatch model.

Possibly related PRs

Refactor CLI and RAG god files into focused modules #11: Refactored CLI/app module structure (batch.rs, cli.rs, printing.rs, fetching, watch.rs) that forms the foundation for extractor flag/flow integration.
feat(noxa-68r): noxa-rag RAG pipeline crate #4: Introduced noxa-rag pipeline layers that now consume and propagate vertical_data through parse/process/scan/watcher modules.
refactor: add noxa mcp subcommand #2: Extended MCP server integration (tools, server-level code) that this PR expands with scrape extractor parameter and new extractors tool.

Poem

🐰 Twenty-eight new sites, each one a prize,
From GitHub stars to Instagram highs!
Vertical magic flows through the stack,
With structured JSON riding the pack.
The extractor dance, now safe and bold—
More wisdom from the web to behold! 🌐

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.60% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately summarizes the main change: porting the full upstream vertical extractor parity feature, which is the primary focus of this large changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch upstream-fix-port

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR ports “vertical extractor” support into Noxa, adding a 28-extractor catalog in noxa-fetch, plumbing a new vertical_data payload through ExtractionResult, and exposing explicit extractor selection via the CLI and MCP while preserving existing generic scraping defaults.

Changes:

Add vertical_data: Option<VerticalData> to noxa_core::ExtractionResult and update constructors/tests across the workspace.
Introduce noxa-fetch::extractors (catalog + dispatch + per-site implementations) and integrate safe auto-dispatch + explicit dispatch into fetch and batch flows.
Expose vertical extractors via CLI (--list-extractors, --extractor) and MCP (optional scrape.extractor, new extractors tool), plus documentation and reports.

Reviewed changes

Copilot reviewed 109 out of 110 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
docs/superpowers/specs/2026-04-26-full-upstream-extractor-parity-design.md	Adds design spec for vertical extractor parity (catalog, dispatch, schema).
docs/superpowers/plans/2026-04-26-full-upstream-extractor-parity-plan.md	Adds step-by-step implementation plan for extractor parity and tests.
docs/reports/live-extractor-cli-report-2026-04-26.md	Adds live CLI sweep report covering 28/28 extractors.
docs/CHANGELOG.md	Documents new vertical extractor capabilities for fetch/CLI/MCP.
crates/noxa-store/src/content_store/tests.rs	Updates test helpers to include `vertical_data: None`.
crates/noxa-store/src/content_store/enumerate.rs	Refactors manifest cache fast-path with `if let ... &&` style.
crates/noxa-rag/src/store/qdrant/tests.rs	Small refactors in test parsing + route matching.
crates/noxa-rag/src/pipeline/watcher.rs	Formatting / minor control-flow refactors in watcher setup and job dispatch.
crates/noxa-rag/src/pipeline/scan.rs	Refactors conditionals; adjusts formatting; minor cleanup.
crates/noxa-rag/src/pipeline/process.rs	Refactors log rotation conditional + formatting tweaks; minor code cleanup.
crates/noxa-rag/src/pipeline/parse/tests.rs	Updates test fixture `ExtractionResult` to include `vertical_data: None`.
crates/noxa-rag/src/pipeline/parse/mod.rs	Updates helper `ExtractionResult` construction to include `vertical_data: None`.
crates/noxa-rag/src/mcp_bridge.rs	Updates extraction construction to include `vertical_data: None`.
crates/noxa-mcp/src/tools.rs	Adds optional `extractor: Option<String>` to `ScrapeParams` + test.
crates/noxa-mcp/src/server/content_tools.rs	Updates placeholder extraction to include `vertical_data: None`.
crates/noxa-mcp/src/server.rs	Adds explicit vertical scrape path + new `extractors` tool + tests; updates tool list string.
crates/noxa-mcp/README.md	Documents optional `scrape.extractor` and new `extractors` tool.
crates/noxa-fetch/tests/fixtures/extractors/youtube_video.html	Adds fixture for YouTube player-response parsing.
crates/noxa-fetch/tests/fixtures/extractors/trustpilot.html	Adds fixture for Trustpilot JSON-LD parsing.
crates/noxa-fetch/tests/fixtures/extractors/substack_post.html	Adds fixture for Substack HTML extraction.
crates/noxa-fetch/tests/fixtures/extractors/stackoverflow_question.json	Adds fixture for StackOverflow question API.
crates/noxa-fetch/tests/fixtures/extractors/stackoverflow_answers.json	Adds fixture for StackOverflow answers API.
crates/noxa-fetch/tests/fixtures/extractors/shopify_product.json	Adds fixture for Shopify product `.js` API.
crates/noxa-fetch/tests/fixtures/extractors/shopify_collection.json	Adds fixture for Shopify collection `products.json` API.
crates/noxa-fetch/tests/fixtures/extractors/reddit.json	Adds fixture for Reddit JSON listing parsing.
crates/noxa-fetch/tests/fixtures/extractors/pypi.json	Adds fixture for PyPI JSON API.
crates/noxa-fetch/tests/fixtures/extractors/product_page.html	Adds generic product page fixture (JSON-LD + OG/meta).
crates/noxa-fetch/tests/fixtures/extractors/npm_registry.json	Adds fixture for npm registry API.
crates/noxa-fetch/tests/fixtures/extractors/npm_downloads.json	Adds fixture for npm downloads API.
crates/noxa-fetch/tests/fixtures/extractors/linkedin_post.html	Adds fixture for LinkedIn embed HTML parsing.
crates/noxa-fetch/tests/fixtures/extractors/instagram_profile.json	Adds fixture for Instagram web_profile_info API.
crates/noxa-fetch/tests/fixtures/extractors/instagram_post.html	Adds fixture for Instagram embed caption parsing.
crates/noxa-fetch/tests/fixtures/extractors/huggingface_model.json	Adds fixture for Hugging Face model API.
crates/noxa-fetch/tests/fixtures/extractors/huggingface_dataset.json	Adds fixture for Hugging Face dataset API.
crates/noxa-fetch/tests/fixtures/extractors/hackernews.json	Adds fixture for HN Algolia item API.
crates/noxa-fetch/tests/fixtures/extractors/github_repo.json	Adds fixture for GitHub repo API.
crates/noxa-fetch/tests/fixtures/extractors/github_release.json	Adds fixture for GitHub release API.
crates/noxa-fetch/tests/fixtures/extractors/github_pr.json	Adds fixture for GitHub PR API.
crates/noxa-fetch/tests/fixtures/extractors/github_issue.json	Adds fixture for GitHub issue API.
crates/noxa-fetch/tests/fixtures/extractors/docker_hub.json	Adds fixture for Docker Hub repo API.
crates/noxa-fetch/tests/fixtures/extractors/dev_to.json	Adds fixture for dev.to article API.
crates/noxa-fetch/tests/fixtures/extractors/crates_io.json	Adds fixture for crates.io API.
crates/noxa-fetch/tests/fixtures/extractors/arxiv.xml	Adds fixture for arXiv Atom API parsing.
crates/noxa-fetch/src/sitemap.rs	Makes robots.txt sitemap parsing more robust; adds plausibility checks + test.
crates/noxa-fetch/src/reddit.rs	Adds Reddit JSON API UA helper + verification-wall detection + tests; sets `vertical_data: None`.
crates/noxa-fetch/src/linkedin.rs	Updates LinkedIn extraction result to include `vertical_data: None`.
crates/noxa-fetch/src/lib.rs	Exposes new `extractors` module publicly.
crates/noxa-fetch/src/extractors/youtube_video.rs	Adds YouTube video extractor (match + parse player response).
crates/noxa-fetch/src/extractors/woocommerce_product.rs	Adds WooCommerce product extractor (HTML JSON-LD parsing).
crates/noxa-fetch/src/extractors/trustpilot_reviews.rs	Adds Trustpilot reviews extractor (JSON-LD parsing).
crates/noxa-fetch/src/extractors/summary.rs	Adds small markdown summary helper.
crates/noxa-fetch/src/extractors/substack_post.rs	Adds Substack post extractor (HTML + JSON-LD + tag stripping).
crates/noxa-fetch/src/extractors/stackoverflow.rs	Adds StackOverflow question/answer extractor (StackExchange API).
crates/noxa-fetch/src/extractors/shopify_product.rs	Adds Shopify product extractor (product `.js` API).
crates/noxa-fetch/src/extractors/shopify_collection.rs	Adds Shopify collection extractor (`products.json` API).
crates/noxa-fetch/src/extractors/reddit.rs	Adds Reddit extractor wrapper around existing reddit parser.
crates/noxa-fetch/src/extractors/pypi.rs	Adds PyPI package extractor (PyPI JSON API).
crates/noxa-fetch/src/extractors/product.rs	Adds shared JSON-LD product/review parsing helpers.
crates/noxa-fetch/src/extractors/npm.rs	Adds npm package extractor (registry + downloads API).
crates/noxa-fetch/src/extractors/mod.rs	Adds extractor catalog, auto-dispatch, name dispatch, dispatch errors, and fixture-based tests.
crates/noxa-fetch/src/extractors/linkedin_post.rs	Adds LinkedIn post extractor via embed page OG/body parsing.
crates/noxa-fetch/src/extractors/instagram_profile.rs	Adds Instagram profile extractor via web_profile_info API.
crates/noxa-fetch/src/extractors/instagram_post.rs	Adds Instagram post extractor via embed caption parsing.
crates/noxa-fetch/src/extractors/huggingface_model.rs	Adds Hugging Face model extractor (HF API).
crates/noxa-fetch/src/extractors/huggingface_dataset.rs	Adds Hugging Face dataset extractor (HF API).
crates/noxa-fetch/src/extractors/http.rs	Introduces `ExtractorHttp` trait and implements it for `FetchClient`.
crates/noxa-fetch/src/extractors/hackernews.rs	Adds Hacker News item extractor (Algolia API).
crates/noxa-fetch/src/extractors/github_repo.rs	Adds GitHub repo extractor (GitHub API).
crates/noxa-fetch/src/extractors/github_release.rs	Adds GitHub release extractor (GitHub API).
crates/noxa-fetch/src/extractors/github_pr.rs	Adds GitHub PR extractor (GitHub API).
crates/noxa-fetch/src/extractors/github_issue.rs	Adds GitHub issue extractor (GitHub API) with PR detection.
crates/noxa-fetch/src/extractors/etsy_listing.rs	Adds Etsy listing extractor (HTML JSON-LD product parsing).
crates/noxa-fetch/src/extractors/ecommerce_product.rs	Adds generic ecommerce product extractor (HTML JSON-LD product parsing).
crates/noxa-fetch/src/extractors/ebay_listing.rs	Adds eBay listing extractor (HTML JSON-LD product parsing).
crates/noxa-fetch/src/extractors/docker_hub.rs	Adds Docker Hub repository extractor (Docker Hub API).
crates/noxa-fetch/src/extractors/dev_to.rs	Adds dev.to article extractor (dev.to API).
crates/noxa-fetch/src/extractors/crates_io.rs	Adds crates.io crate extractor (crates.io API).
crates/noxa-fetch/src/extractors/arxiv.rs	Adds arXiv paper extractor (Atom API parsing via quick-xml).
crates/noxa-fetch/src/extractors/amazon_product.rs	Adds Amazon product extractor (HTML JSON-LD product parsing).
crates/noxa-fetch/src/document.rs	Ensures document extraction `ExtractionResult` includes `vertical_data: None`.
crates/noxa-fetch/src/crawler.rs	Adds glob-pattern validation + capacity checks + better semaphore-closed handling + tests.
crates/noxa-fetch/src/client/tests.rs	Adds unit test for `build_vertical_extraction_result` + raw response size-limit test helper.
crates/noxa-fetch/src/client/fetch.rs	Adds explicit vertical fetch method + auto-dispatch hook + vertical summary builder; hardens Reddit JSON request headers.
crates/noxa-fetch/src/client/batch.rs	Adds batch vertical extraction method with concurrency clamp.
crates/noxa-fetch/Cargo.toml	Adds `async-trait` and `regex` deps for extractor layer.
crates/noxa-core/src/types.rs	Adds `VerticalData` and `ExtractionResult::vertical_data`.
crates/noxa-core/src/structured_data.rs	Improves JSON-LD parsing robustness by escaping raw newlines inside JSON strings; adds test.
crates/noxa-core/src/llm/mod.rs	Updates test fixtures to include `vertical_data: None`.
crates/noxa-core/src/lib.rs	Exports `VerticalData`, initializes `vertical_data: None` in extraction results, adds serialization test.
crates/noxa-core/src/extractor/recovery.rs	Fixes markdown search logic for multibyte strings and empty needles; adds tests.
crates/noxa-core/src/diff.rs	Updates test fixtures to include `vertical_data: None`.
crates/noxa-cli/src/setup.rs	Updates MCP tool list output to include `extractors`.
crates/noxa-cli/src/config.rs	Refactors formatting + conditionals; no behavior change intended.
crates/noxa-cli/src/app/watch_singleton.rs	Refactors PID-file liveness check conditionals.
crates/noxa-cli/src/app/watch.rs	Hardens `--on-change` execution by avoiding shell evaluation (shlex split + direct exec).
crates/noxa-cli/src/app/tests_primary.rs	Adds CLI parser tests for extractor flags + catalog formatting tests; adjusts on-change tests.
crates/noxa-cli/src/app/store_ops.rs	Refactors unicode-safe display truncation loop.
crates/noxa-cli/src/app/retrieve.rs	Updates test fixture to include `vertical_data: None`.
crates/noxa-cli/src/app/rag_watch.rs	Refactors formatting + conditionals; minor logic tweaks.
crates/noxa-cli/src/app/rag_daemon.rs	Formatting change for path printing.
crates/noxa-cli/src/app/printing.rs	Adds extractor catalog printing/formatting helpers.
crates/noxa-cli/src/app/mod.rs	Wires extractor catalog printing into module exports; minor reordering.
crates/noxa-cli/src/app/fetching/extract.rs	Routes explicit extractor requests to `fetch_and_extract_vertical` and blocks unsupported combos.
crates/noxa-cli/src/app/entry.rs	Adds `--list-extractors` handling + centralized “unsupported extractor mode” validation.
crates/noxa-cli/src/app/crawl_watch.rs	Refactors formatting and cooldown logic for alerts.
crates/noxa-cli/src/app/cli.rs	Adds `--extractor` and `--list-extractors` flags.
crates/noxa-cli/src/app/batch.rs	Adds vertical extractor path for batch mode.
crates/noxa-cli/Cargo.toml	Adds `shlex` dependency.
README.md	Documents vertical extractor usage and updates MCP tool count/list.
Cargo.lock	Locks new dependencies (`async-trait`, `regex`, `shlex`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfb0cdd616

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T21:17:02Z

+        if let Some(result) = crate::extractors::dispatch_by_url(self, url).await {
+            let (extractor, data) = result?;
+            return Ok(build_vertical_extraction_result(extractor, url, data));


Preserve Reddit fallback on vertical extractor errors

This early return makes extractor failures terminal, so Reddit comment URLs no longer reach the legacy fallback logic below (is_reddit_url branch) that retries the .json path with Reddit-specific headers and can fall back to HTML extraction. In practice, when Reddit returns verification HTML from the JSON endpoint, dispatch_by_url now propagates an error immediately instead of attempting the previous fallback path, which regresses reliability for common www.reddit.com/.../comments/... inputs.

Useful? React with 👍 / 👎.

Resolves review thread PRRT_kwDOR_mP6c59sJHF Resolves review thread PRRT_kwDOR_mP6c59sJHO Resolves review thread PRRT_kwDOR_mP6c59sJG2 Resolves review thread PRRT_kwDOR_mP6c59sJG6 Resolves review thread PRRT_kwDOR_mP6c59sJG9 Resolves review thread PRRT_kwDOR_mP6c59sJHD Resolves review thread PRRT_kwDOR_mP6c59sJGv Resolves review thread PRRT_kwDOR_mP6c59sJGz - Broaden Amazon/eBay TLD matchers and tighten Substack dispatch. - Keep Reddit auto-fetch on the hardened JSON path while attaching vertical_data. - Return Reddit-specific vertical payloads, parse YouTube nocookie embeds, and rename Hugging Face all-time download fields.

coderabbitai

Actionable comments posted: 36

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/noxa-fetch/src/crawler.rs (1)
285-298: ⚠️ Potential issue | 🟡 Minor

Sitemap frontier cap also bounds entries that would later be excluded.

The cap uses frontier.len() >= max_pages before qualify_link/is_excluded_by_pattern runs. Once the cap is hit you break, which means matching excluded URLs that appear later in the sitemap are silently ignored rather than counted. Functionally fine for correctness, but the reported excluded count will under-report on large sitemaps. Worth a one-line comment if intentional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/noxa-fetch/src/crawler.rs` around lines 285 - 298, The frontier length
cap is checked before calling qualify_link/is_excluded_by_pattern, causing later
excluded sitemap entries to be skipped and the excluded counter under-reported;
update the loop so you first call self.qualify_link(&entry.url, &visited) and
self.is_excluded_by_pattern(&entry.url) to decide whether to push to frontier or
increment excluded, and only then check if frontier.len() >=
self.config.max_pages to break (or alternatively, if you must break early,
increment excluded for any remaining entries before breaking); refer to
frontier, self.config.max_pages, self.qualify_link, self.is_excluded_by_pattern,
excluded and entry.url when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/noxa-cli/src/app/entry.rs`:
- Around line 309-345: unsupported_extractor_mode currently checks mode
incompatibilities but doesn't verify that the extractor name provided in
cli.extractor actually exists; add a validation step (in
unsupported_extractor_mode or immediately after its call in run()) that looks up
the provided extractor name against noxa_fetch::extractors::list() and returns
an appropriate error string (e.g., "unknown extractor: <name>") if not found so
users receive a single upfront error instead of per-URL failures later in
fetch_and_extract_vertical; reference the cli.extractor field, the
unsupported_extractor_mode function, noxa_fetch::extractors::list(), and
fetch_and_extract_vertical when adding this check.

In `@crates/noxa-cli/src/app/tests_primary.rs`:
- Around line 256-264: The test
extractor_catalog_json_output_serializes_all_extractors currently asserts a
hardcoded count (assert_eq!(entries.len(), 28)), which will break when
extractors change; update the assertion to compare against the actual extractor
list instead (e.g., use noxa_fetch::extractors::list().len() as the expected
count) or at minimum assert entries.len() >= 28 so the test verifies
serialization without hardcoding; locate the test and change the assertion after
calling format_extractor_catalog(&OutputFormat::Json) and parsing to
serde_json::Value.

In `@crates/noxa-cli/src/app/watch.rs`:
- Around line 129-136: The parse_on_change_command function currently returns a
misleading message for shlex::split failures; update the shlex::split error
branch to use a broader, accurate message (e.g., "failed to parse command:
invalid quoting or escape sequences") instead of specifically mentioning
unterminated quotes, keeping the existing Err path for empty argv unchanged;
locate parse_on_change_command and change the .ok_or_else closure message to the
new wording so diagnostics reflect all parse errors from shlex::split.
- Around line 138-149: The README/CHANGELOG must document the behavioral change
that --on-change now uses shlex-style argv parsing (see run_on_change_command
and parse_on_change_command) and no longer supports shell features (pipes,
redirects, &&/||, globbing, env expansion, substitution); add a clear migration
note to README and a changelog entry stating this limitation, show an example
converting a shell recipe to an explicit script (e.g. --on-change
"/path/to/script.sh" or invoke sh -c explicitly), and include a short example
demonstrating the proper invocation and recommended wrapper script to preserve
previous behavior.

In `@crates/noxa-core/src/extractor/recovery.rs`:
- Around line 499-525: Add a test that asserts find_content_position returns
None when given an empty needle to lock the empty-needle guard behavior; update
the tests module (tests) by adding a case such as calling
find_content_position("anything", "") and asserting it equals None so the
contract is enforced alongside the existing multibyte tests that exercise
find_content_position.

In `@crates/noxa-core/src/structured_data.rs`:
- Around line 78-113: The recovery currently only escapes raw '\n' and '\r' in
escape_raw_newlines_in_json_strings; expand it to escape all unescaped control
characters U+0000..U+001F when in_string (e.g., map '\t' -> "\\t", '\x08' ->
"\\b", '\x0C' -> "\\f", '\n' -> "\\n", '\r' -> "\\r' and for any other control
like NUL or others use a "\\u00XX" hex escape), preserving existing escape
handling via escape_next and in_string flags in the function; update the match
for the in_string branch to detect ch <= '\u{1f}' and append the correct escape
sequence and set changed = true.

In `@crates/noxa-fetch/src/client/fetch.rs`:
- Around line 445-467: The metadata.image is never populated for some vertical
extractors because build_vertical_extraction_result currently looks up
["image_url","thumbnail_url"] but substack_post::extract and product::extract
emit the key "image"; update build_vertical_extraction_result to include "image"
in the string_field lookup (alongside "image_url" and "thumbnail_url") so
metadata.image pulls that value, keeping string_field behavior intact for
non-string JSON values; reference build_vertical_extraction_result,
metadata.image, and string_field when locating the change and verify
substack_post::extract and product::extract continue to emit "image".

In `@crates/noxa-fetch/src/client/tests.rs`:
- Around line 72-88: The test file's formatting in the function
vertical_extraction_result_sets_vertical_payload_and_summary is not
cargo-fmt-compliant; run `cargo fmt` (or `cargo fmt --all`) to reformat
crates/noxa-fetch/src/client/tests.rs (and the workspace) so the test hunk
matches rustfmt expectations, then stage and commit the resulting changes so
CI's `cargo fmt --check --all` succeeds.
- Around line 306-324: Add a short doc comment to the spawn_raw_response_server
helper explaining that it accepts only a single connection then exits (unlike
spawn_status_server which loops), so it is intended for single-shot tests like
fetch_rejects_oversized_html_response_from_content_length and will return
connection refused on subsequent attempts; update the comment above the async fn
spawn_raw_response_server(response: String) -> String to clearly state this
single-accept behavior and intended usage.

In `@crates/noxa-fetch/src/crawler.rs`:
- Around line 561-587: The wildcard counter currently double-counts globstars
because wildcard_count (pattern.bytes().filter ... ) counts each '*' including
those in "**", masking the specific globstar limit; update validate_glob_pattern
so the pattern.matches("**") check runs before computing wildcard_count (or
compute globstar_count first and subtract 2*globstar_count from wildcard_count)
and return the "too many recursive wildcards" error first using the existing
MAX_GLOBSTARS check; adjust the logic around validate_glob_pattern,
pattern.matches("**"), and the wildcard_count computation so "**" are accounted
for only by the globstar limit and not twice-counted against MAX_GLOB_WILDCARDS.

In `@crates/noxa-fetch/src/extractors/amazon_product.rs`:
- Around line 6-15: The matches function currently restricts hosts to
"amazon.com" which contradicts INFO.url_patterns; update the matches
implementation so it accepts any Amazon TLD (e.g., amazon.co.uk, amazon.de,
amazon.co.jp, etc.) instead of only "amazon.com". Concretely, modify the host
check in matches (which currently calls host_matches(url, "amazon.com")) to
target the Amazon second-level domain broadly (for example by calling
host_matches(url, "amazon") or otherwise matching hosts that end with
".amazon.<tld>" or contain ".amazon.") while keeping the existing path checks
(url.contains("/dp/") || url.contains("/gp/product/")) so INFO and matches stay
consistent.

In `@crates/noxa-fetch/src/extractors/arxiv.rs`:
- Around line 45-59: The parse_id function currently only takes segs[1], which
breaks old-style arXiv IDs like /abs/cs/0301001v2 by returning "cs"; update
parse_id to join all path segments after the initial "abs" or "pdf" into one
identifier (e.g., join segs[1..].join("/")), then strip a trailing ".pdf" and
remove a version suffix like "vN" (detect 'v' followed exclusively by digits at
the end) before returning; keep the existing checks (URL parsing, non-empty
segments, allowed first segment) and return None for empty results, referencing
the parse_id function and its use of segs, stripped, and no_version.
- Around line 150-163: In parse_atom_entry's Ok(Event::Text(text)) handling,
don't use text.unescape().ok()? which aborts the whole entry on an unescape
error; instead attempt to unescape and if it fails fall back to the raw text (or
skip that node) so the parser continues and populates fields like entry.id,
entry.title (append_text), entry.summary, entry.published, entry.updated,
entry.authors, entry.doi, entry.comment without returning None for the entire
entry.

In `@crates/noxa-fetch/src/extractors/docker_hub.rs`:
- Around line 6-15: The url_patterns in the ExtractorInfo (INFO) currently only
advertises "https://hub.docker.com/r/*" but parse_repo and matches also accept
the official-library form under the underscore path, so update INFO.url_patterns
to include the official-library pattern (e.g., "https://hub.docker.com/_/*" or
similar wildcard) alongside the existing "/r/*" entry so the catalog and users
see both accepted URL forms; touch the INFO constant's url_patterns array (and
ensure no other behavior changes to matches or parse_repo).

In `@crates/noxa-fetch/src/extractors/ebay_listing.rs`:
- Around line 6-15: The matcher currently restricts to host_matches(url,
"ebay.com") while INFO.url_patterns advertises https://*.ebay.*/itm/*; update
matches() to accept any eBay TLD/subdomain instead of only ebay.com—e.g., parse
the URL host (or use host_matches with a wildcard) and check that the host
contains or ends with "ebay" (or matches pattern "ebay.*") and that the path
contains "/itm/"; keep the check tied to the existing matches() function and
host_matches helper so auto-dispatch aligns with INFO.url_patterns.

In `@crates/noxa-fetch/src/extractors/github_release.rs`:
- Around line 17-22: The extract function builds the GitHub API URL by
interpolating owner, repo, and tag directly (from parse_release) which can
contain reserved characters; implement a helper (e.g., encode_path_segment) that
percent-encodes a path segment using url::form_urlencoded::byte_serialize or
percent-encoding's utf8_percent_encode, then use it when constructing api_url in
extract (and the other usage around lines 51–66) so owner, repo, and tag are
encoded before being inserted into
"https://api.github.com/repos/{owner}/{repo}/releases/tags/{tag}".

In `@crates/noxa-fetch/src/extractors/huggingface_dataset.rs`:
- Around line 36-37: The output key `downloads_30d` is incorrectly mapped from
`downloadsAllTime`; update the mapping in the Hugging Face dataset extractor
(the code building the output map that uses the `dataset` variable in
crates/noxa-fetch/src/extractors/huggingface_dataset.rs) to rename that key to
`downloads_all_time` while keeping the value as
`dataset.get("downloadsAllTime").cloned()`, and ensure the existing
`"downloads": dataset.get("downloads").cloned()` line remains as the
trailing-30-day metric.

In `@crates/noxa-fetch/src/extractors/huggingface_model.rs`:
- Around line 35-37: The mapping currently puts the lifetime count
model.get("downloadsAllTime") under the key "downloads_30d", mislabeling
lifetime downloads as a 30‑day metric; update the mapping in the vertical_data
construction so that "downloads_30d" either reads
model.get("downloads").cloned() (if you intend a ~30‑day stat) or rename the key
to "downloads_all_time" (or similar) and keep
model.get("downloadsAllTime").cloned() (if you intend lifetime totals), and
remove any duplicate/conflicting "downloads" entry accordingly.

In `@crates/noxa-fetch/src/extractors/instagram_post.rs`:
- Around line 67-94: The three extraction functions (parse_username,
parse_caption, parse_thumbnail) currently call Regex::new at each invocation;
replace these with process-wide compiled statics using std::sync::LazyLock (or
OnceLock) so each pattern is compiled once: create static LazyLock<Regex>
entries for the patterns that match CaptionUsername, Caption (the div
class="Caption" capture), the user anchor inside Caption, the generic tag re,
and the EmbeddedMediaImage src pattern, then update parse_username,
parse_caption, and parse_thumbnail to use those static Regex instances instead
of Regex::new(...)? and drop the ? short-circuiting on regex compilation (these
statics are infallible once created). Ensure you reference the existing symbol
names parse_username, parse_caption, parse_thumbnail and the specific patterns
(CaptionUsername, Caption div, user_re, tag_re, EmbeddedMediaImage) when
replacing the Regex::new calls.
- Around line 96-103: The current html_decode function only replaces a few named
entities and leaves many numeric/named entities (e.g., &#x27;, &nbsp;, &hellip;,
emoji numeric refs) un-decoded; add a dedicated HTML entity decoder crate to
Cargo.toml (for example the html_escape crate) and replace the body of
html_decode to call the crate’s decoder (e.g. html_escape::decode_html_entities)
so it returns a fully decoded String; update any imports to reference the
decoder and remove the manual .replace chain in the html_decode function.

In `@crates/noxa-fetch/src/extractors/instagram_profile.rs`:
- Around line 17-28: The extractor fails against real Instagram because the
request to api_url in extract(...) doesn't include the required X-IG-App-ID
header (and a valid session) so get_json(...) receives HTML/401; update the call
path so the request includes X-IG-App-ID: retrieve the app id from configuration
or the HTTP client and pass it as a header when calling client.get_json (or add
a get_json_with_headers method to the ExtractorHttp trait and use it here), and
ensure the client also supplies session cookies/auth as it manages for other
endpoints so the web_profile_info call returns JSON and user can be read from
body.pointer("/data/user").

In `@crates/noxa-fetch/src/extractors/mod.rs`:
- Around line 82-413: dispatch_by_url and dispatch_by_name duplicate the same
per-extractor boilerplate (calls to <extractor>::matches, <extractor>::extract,
and <extractor>::INFO.name plus run_or_mismatch) for many extractors; collapse
this into a single source-of-truth (either a macro_rules! table or a const
slice) that lists each extractor as a tuple/entry containing its matches
function, extract function, INFO.name, and an auto_dispatch boolean, then
rewrite dispatch_by_url to iterate the table and call matches/extract
dynamically and dispatch_by_name to lookup by INFO.name and call run_or_mismatch
with the stored matches/extract; keep run_or_mismatch, INFO.name, matches,
extract, and list() semantics and honor the auto_dispatch flag so explicit-only
extractors are excluded from automatic dispatch.

In `@crates/noxa-fetch/src/extractors/pypi.rs`:
- Around line 13-15: The matches function is looser than parse_project and can
mis-dispatch URLs (e.g., query strings containing "/project/"); update matches
to use the same path-check logic as parse_project instead of
url.contains("/project/") so both use parse_project for consistency — locate the
matches(url: &str) function and replace the url.contains check with a call to
parse_project (or the same path-first-segment logic), keeping host_matches(url,
"pypi.org") and ensuring mis-routed URLs no longer reach the extractor and
trigger FetchError::Build.

In `@crates/noxa-fetch/src/extractors/reddit.rs`:
- Around line 13-15: The current matches function (matches in reddit.rs) uses
url.contains("/comments/") which can match query/fragment; instead parse the URL
(e.g., with url::Url::parse) after validating host via host_matches, then
inspect the path segments (Url::path_segments or segments iterator) and check
that one of the path segments equals "comments" so only actual path occurrences
match; update matches to return false on parse error and fall back to the host
check + path-segment check accordingly.

In `@crates/noxa-fetch/src/extractors/shopify_collection.rs`:
- Around line 17-25: The api_url construction in extract() currently appends
"/products.json" to the raw input string (using url.trim_end_matches('/')),
which breaks when the input contains query strings or fragments; instead parse
the input with url::Url (e.g., Url::parse(url)), clear query and fragment
(url.set_query(None); url.set_fragment(None)), get the cleaned base
(url.as_str() or url.to_string()), trim any trailing slash from that cleaned
base, then build api_url = format!("{}/products.json", cleaned_base_trimmed).
Update the extract function to use this parsed/cleaned URL before calling
client.get_json so api_url is never polluted by ? or # parts.

In `@crates/noxa-fetch/src/extractors/shopify_product.rs`:
- Around line 17-19: The product_url construction in extract() incorrectly
appends ".js" to the full input string (variable product_url), breaking URLs
that include queries or fragments; parse the incoming URL (e.g. using the
url::Url type), modify only its path by trimming trailing '/' and appending
".js", then set that new path back on the parsed URL and rebuild the product_url
string before calling client.get_json(&product_url). This ensures query and
fragment components (if present) are preserved while the API path ends with
".js".

In `@crates/noxa-fetch/src/extractors/stackoverflow.rs`:
- Around line 13-15: The matches() function currently accepts non-question
StackOverflow paths and should be tightened to only match individual question
pages; update matches(url: &str) to require host_matches(url,
"stackoverflow.com") && a path segment pattern of "/questions/<numeric_id>"
(i.e., ensure there is a numeric segment immediately after "/questions/") so
parse_question_id and extract() only run for real question URLs; reference the
matches function and parse_question_id to implement the numeric-segment check
(e.g., parse the URL path and verify the segment following "questions" is all
digits) and reject URLs like "/questions/tagged" or "/questions?..." to preserve
the safe auto-dispatch guarantee.

In `@crates/noxa-fetch/src/extractors/substack_post.rs`:
- Around line 14-25: In the matches function replace the incorrect host check so
it only accepts substack-hosted domains: inside the closure that defines host
and has_post_path, change the condition from host.ends_with(".substack.com") ||
host != "substack.com" to host.ends_with(".substack.com") || host ==
"substack.com" so only "substack.com" and its subdomains match; leave the
existing path check (has_post_path using parsed.path_segments and "p"/slug)
intact, or if you intentionally want to allow arbitrary custom domains add a
separate predicate instead of changing this one.

In `@crates/noxa-fetch/src/extractors/trustpilot_reviews.rs`:
- Around line 6-15: INFO.url_patterns is narrower than the matches() logic:
update ExtractorInfo::url_patterns so the advertised patterns reflect the same
hosts accepted by matches() (which currently uses host_matches(url,
"trustpilot.com") and therefore allows country subdomains). Modify
INFO.url_patterns to include patterns for subdomains (for example add
"https://*.trustpilot.com/review/*" and "https://trustpilot.com/review/*") so
--list-extractors shows the same scope as the matches() function.

In `@crates/noxa-fetch/src/extractors/youtube_video.rs`:
- Around line 14-18: The matches() function currently treats
"youtube-nocookie.com" as a YouTube host but parse_video_id() only recognizes
"youtu.be" and hosts that end_with("youtube.com"), causing nocookie URLs to
fail; update parse_video_id() to treat hosts that
end_with("youtube-nocookie.com") the same as "youtube.com" (i.e., branch the
same way you do for host.ends_with("youtube.com") to extract IDs from /watch?v=,
/embed/, /shorts/, etc.), or alternatively remove "youtube-nocookie.com" from
matches() so it is not routed here—apply the same fix for the other parser
branch mentioned (lines covering the second occurrence, the 59-77 area) so both
matcher and parser are consistent.
- Around line 79-82: The regex in extract_player_response currently requires a
literal "var " before ytInitialPlayerResponse; update the Regex::new call in the
extract_player_response function to accept an optional "var" prefix (e.g. use a
non-capturing optional group like (?:var\s+)? before ytInitialPlayerResponse) so
it matches both "var ytInitialPlayerResponse = {...};" and
"ytInitialPlayerResponse = {...};", then keep extracting capture group 1 and
parsing with serde_json as before.

In `@crates/noxa-fetch/tests/fixtures/extractors/shopify_collection.json`:
- Line 11: The fixture's "images" array in shopify_collection.json currently
contains bare URL strings but must be an array of image objects matching
Shopify's real shape (e.g., objects with keys like id, product_id, position,
src, width, height, alt). Update the "images" value to an array of objects
(include realistic values for id, product_id, position and use "src" instead of
raw string), and adjust any tests/expectations that read image URLs to use the
image.src path (or the extractor behavior that accesses src) so test payloads
mirror the real /products.json response.

In `@crates/noxa-mcp/README.md`:
- Line 42: The README sentence hardcodes "28 supported extractors", which will
drift; update the line referencing `scrape` to remove the numeric count and
instead refer readers to the `extractors` tool for the current list (e.g.,
"`scrape` accepts an optional `extractor` string for explicit vertical
extraction; use the `extractors` tool to list available extractors."). Ensure
the updated wording mentions `scrape` and `extractors` so readers know how to
discover the current set.

In `@crates/noxa-mcp/src/server.rs`:
- Around line 1047-1048: The test currently asserts a fixed catalog size with
assert_eq!(entries.len(), 28), which couples it to catalog growth; change this
to assert that entries.len() is >= a baseline (e.g., assert!(entries.len() >=
28)) and add assertions to enforce uniqueness of names and required verticals:
use a HashSet over entry["name"] (same entries variable) to assert set.len() ==
entries.len() and keep the existing contains check like
entries.iter().any(|entry| entry["name"] == "github_repo") to verify required
items.

In `@crates/noxa-rag/src/store/qdrant/tests.rs`:
- Around line 73-77: The header-parsing logic is duplicated between
spawn_test_server and spawn_test_server_with_status; extract the shared code
into a helper function (e.g., read_request(stream: &mut TcpStream) ->
RecordedRequest or read_request_and_status if status is needed) that
encapsulates reading lines, parsing headers (including the content-length logic
using split_once and parse), and building the RecordedRequest, then call this
helper from both spawn_test_server and spawn_test_server_with_status to remove
the verbatim copy and prevent drift.

---

Outside diff comments:
In `@crates/noxa-fetch/src/crawler.rs`:
- Around line 285-298: The frontier length cap is checked before calling
qualify_link/is_excluded_by_pattern, causing later excluded sitemap entries to
be skipped and the excluded counter under-reported; update the loop so you first
call self.qualify_link(&entry.url, &visited) and
self.is_excluded_by_pattern(&entry.url) to decide whether to push to frontier or
increment excluded, and only then check if frontier.len() >=
self.config.max_pages to break (or alternatively, if you must break early,
increment excluded for any remaining entries before breaking); refer to
frontier, self.config.max_pages, self.qualify_link, self.is_excluded_by_pattern,
excluded and entry.url when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5cc8a691-d00d-4f5b-8479-e8eaba76162d

📥 Commits

Reviewing files that changed from the base of the PR and between 289dbe0 and dfb0cdd.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock and included by **/*

📒 Files selected for processing (109)

README.md
crates/noxa-cli/Cargo.toml
crates/noxa-cli/src/app/batch.rs
crates/noxa-cli/src/app/cli.rs
crates/noxa-cli/src/app/crawl_watch.rs
crates/noxa-cli/src/app/entry.rs
crates/noxa-cli/src/app/fetching/extract.rs
crates/noxa-cli/src/app/mod.rs
crates/noxa-cli/src/app/printing.rs
crates/noxa-cli/src/app/rag_daemon.rs
crates/noxa-cli/src/app/rag_watch.rs
crates/noxa-cli/src/app/retrieve.rs
crates/noxa-cli/src/app/store_ops.rs
crates/noxa-cli/src/app/tests_primary.rs
crates/noxa-cli/src/app/watch.rs
crates/noxa-cli/src/app/watch_singleton.rs
crates/noxa-cli/src/config.rs
crates/noxa-cli/src/setup.rs
crates/noxa-core/src/diff.rs
crates/noxa-core/src/extractor/recovery.rs
crates/noxa-core/src/lib.rs
crates/noxa-core/src/llm/mod.rs
crates/noxa-core/src/structured_data.rs
crates/noxa-core/src/types.rs
crates/noxa-fetch/Cargo.toml
crates/noxa-fetch/src/client/batch.rs
crates/noxa-fetch/src/client/fetch.rs
crates/noxa-fetch/src/client/tests.rs
crates/noxa-fetch/src/crawler.rs
crates/noxa-fetch/src/document.rs
crates/noxa-fetch/src/extractors/amazon_product.rs
crates/noxa-fetch/src/extractors/arxiv.rs
crates/noxa-fetch/src/extractors/crates_io.rs
crates/noxa-fetch/src/extractors/dev_to.rs
crates/noxa-fetch/src/extractors/docker_hub.rs
crates/noxa-fetch/src/extractors/ebay_listing.rs
crates/noxa-fetch/src/extractors/ecommerce_product.rs
crates/noxa-fetch/src/extractors/etsy_listing.rs
crates/noxa-fetch/src/extractors/github_issue.rs
crates/noxa-fetch/src/extractors/github_pr.rs
crates/noxa-fetch/src/extractors/github_release.rs
crates/noxa-fetch/src/extractors/github_repo.rs
crates/noxa-fetch/src/extractors/hackernews.rs
crates/noxa-fetch/src/extractors/http.rs
crates/noxa-fetch/src/extractors/huggingface_dataset.rs
crates/noxa-fetch/src/extractors/huggingface_model.rs
crates/noxa-fetch/src/extractors/instagram_post.rs
crates/noxa-fetch/src/extractors/instagram_profile.rs
crates/noxa-fetch/src/extractors/linkedin_post.rs
crates/noxa-fetch/src/extractors/mod.rs
crates/noxa-fetch/src/extractors/npm.rs
crates/noxa-fetch/src/extractors/product.rs
crates/noxa-fetch/src/extractors/pypi.rs
crates/noxa-fetch/src/extractors/reddit.rs
crates/noxa-fetch/src/extractors/shopify_collection.rs
crates/noxa-fetch/src/extractors/shopify_product.rs
crates/noxa-fetch/src/extractors/stackoverflow.rs
crates/noxa-fetch/src/extractors/substack_post.rs
crates/noxa-fetch/src/extractors/summary.rs
crates/noxa-fetch/src/extractors/trustpilot_reviews.rs
crates/noxa-fetch/src/extractors/woocommerce_product.rs
crates/noxa-fetch/src/extractors/youtube_video.rs
crates/noxa-fetch/src/lib.rs
crates/noxa-fetch/src/linkedin.rs
crates/noxa-fetch/src/reddit.rs
crates/noxa-fetch/src/sitemap.rs
crates/noxa-fetch/tests/fixtures/extractors/arxiv.xml
crates/noxa-fetch/tests/fixtures/extractors/crates_io.json
crates/noxa-fetch/tests/fixtures/extractors/dev_to.json
crates/noxa-fetch/tests/fixtures/extractors/docker_hub.json
crates/noxa-fetch/tests/fixtures/extractors/github_issue.json
crates/noxa-fetch/tests/fixtures/extractors/github_pr.json
crates/noxa-fetch/tests/fixtures/extractors/github_release.json
crates/noxa-fetch/tests/fixtures/extractors/github_repo.json
crates/noxa-fetch/tests/fixtures/extractors/hackernews.json
crates/noxa-fetch/tests/fixtures/extractors/huggingface_dataset.json
crates/noxa-fetch/tests/fixtures/extractors/huggingface_model.json
crates/noxa-fetch/tests/fixtures/extractors/instagram_post.html
crates/noxa-fetch/tests/fixtures/extractors/instagram_profile.json
crates/noxa-fetch/tests/fixtures/extractors/linkedin_post.html
crates/noxa-fetch/tests/fixtures/extractors/npm_downloads.json
crates/noxa-fetch/tests/fixtures/extractors/npm_registry.json
crates/noxa-fetch/tests/fixtures/extractors/product_page.html
crates/noxa-fetch/tests/fixtures/extractors/pypi.json
crates/noxa-fetch/tests/fixtures/extractors/reddit.json
crates/noxa-fetch/tests/fixtures/extractors/shopify_collection.json
crates/noxa-fetch/tests/fixtures/extractors/shopify_product.json
crates/noxa-fetch/tests/fixtures/extractors/stackoverflow_answers.json
crates/noxa-fetch/tests/fixtures/extractors/stackoverflow_question.json
crates/noxa-fetch/tests/fixtures/extractors/substack_post.html
crates/noxa-fetch/tests/fixtures/extractors/trustpilot.html
crates/noxa-fetch/tests/fixtures/extractors/youtube_video.html
crates/noxa-mcp/README.md
crates/noxa-mcp/src/server.rs
crates/noxa-mcp/src/server/content_tools.rs
crates/noxa-mcp/src/tools.rs
crates/noxa-rag/src/mcp_bridge.rs
crates/noxa-rag/src/pipeline/parse/mod.rs
crates/noxa-rag/src/pipeline/parse/tests.rs
crates/noxa-rag/src/pipeline/process.rs
crates/noxa-rag/src/pipeline/scan.rs
crates/noxa-rag/src/pipeline/watcher.rs
crates/noxa-rag/src/store/qdrant/tests.rs
crates/noxa-store/src/content_store/enumerate.rs
crates/noxa-store/src/content_store/tests.rs
docs/CHANGELOG.md
docs/reports/live-extractor-cli-report-2026-04-26.md
docs/superpowers/plans/2026-04-26-full-upstream-extractor-parity-plan.md
docs/superpowers/specs/2026-04-26-full-upstream-extractor-parity-design.md

coderabbitai · 2026-04-26T21:22:17Z

+        assert_eq!(entries.len(), 28);
+        assert!(entries.iter().any(|entry| entry["name"] == "github_repo"));


🧹 Nitpick | 🔵 Trivial

Hard-coded catalog count couples test to catalog size.

assert_eq!(entries.len(), 28) will fail any time a future PR adds (or temporarily removes) a vertical, even when behavior is correct. Consider asserting >= a baseline plus uniqueness of name, which is what the design doc actually requires.

- assert_eq!(entries.len(), 28); - assert!(entries.iter().any(|entry| entry["name"] == "github_repo")); + assert!(entries.len() >= 28, "catalog shrank: {}", entries.len()); + assert!(entries.iter().any(|entry| entry["name"] == "github_repo")); + let mut names: Vec<_> = entries.iter().map(|e| e["name"].as_str().unwrap()).collect(); + names.sort(); + let unique = names.len(); + names.dedup(); + assert_eq!(unique, names.len(), "duplicate extractor names in catalog");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/noxa-mcp/src/server.rs` around lines 1047 - 1048, The test currently asserts a fixed catalog size with assert_eq!(entries.len(), 28), which couples it to catalog growth; change this to assert that entries.len() is >= a baseline (e.g., assert!(entries.len() >= 28)) and add assertions to enforce uniqueness of names and required verticals: use a HashSet over entry["name"] (same entries variable) to assert set.len() == entries.len() and keep the existing contains check like entries.iter().any(|entry| entry["name"] == "github_repo") to verify required items.

- Tighten vertical URL matchers and query-safe Shopify API URLs. - Improve YouTube, Instagram, arXiv, GitHub release, and structured-data edge cases. - Remove drift-prone extractor counts from docs/tests and document on-change shell behavior.

jmagar added 23 commits April 24, 2026 21:14

fix(cli): remove shell injection from watch on-change command

b3940ea

fix(core): prevent UTF-8 panic in content-position recovery

04db34a

fix(core): recover JSON-LD with raw newline characters

1723da4

fix(fetch): port upstream crawler and robots hardening

cbfa4db

fix(fetch): improve reddit fallback and verify-wall handling

8b76353

docs: specify full extractor parity port

b2e0af0

docs: plan full extractor parity port

be60533

docs: tighten extractor parity plan

373753b

feat(core): add vertical extractor payload

e078113

feat(fetch): add vertical extractor catalog

a1c1a91

feat(fetch): port developer package extractors

cbf6f38

feat(fetch): port research and community extractors

b6f69c5

feat(fetch): port huggingface and social extractors

69fb821

feat(fetch): expose reddit vertical extractor

a93ed15

feat(fetch): port ecommerce vertical extractors

169d867

feat(fetch): wire vertical extractor dispatch

c4af4c4

feat(fetch): port substack vertical extractor

d6005c0

feat(cli): expose vertical extractors

e71f23d

feat(mcp): expose vertical extractors

3256515

docs: document vertical extractor parity

e907f40

fix(fetch): decode arxiv XML attributes with workspace features

d001dfa

chore: satisfy workspace clippy

ea9d8f0

docs: add live extractor CLI test report

dfb0cdd

Copilot AI review requested due to automatic review settings April 26, 2026 21:09

Copilot started reviewing on behalf of jmagar April 26, 2026 21:09 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 26, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

jmagar added 4 commits April 26, 2026 17:22

style: format workspace

fa6bbf2

fix: address extractor review edge cases

faa6390

- Tighten vertical URL matchers and query-safe Shopify API URLs. - Improve YouTube, Instagram, arXiv, GitHub release, and structured-data edge cases. - Remove drift-prone extractor counts from docs/tests and document on-change shell behavior.

fix: satisfy current stable clippy

1342ae5

fix: stabilize CI on current stable

145b3ee

jmagar merged commit a5c39be into main Apr 26, 2026
5 checks passed

jmagar deleted the upstream-fix-port branch April 26, 2026 22:19

coderabbitai Bot mentioned this pull request Apr 27, 2026

fix(rag): 20-item review pass — security, perf, arch, simplification #14

Open

5 tasks

		assert_eq!(entries.len(), 28);
		assert!(entries.iter().any(\|entry\| entry["name"] == "github_repo"));

Conversation

jmagar commented Apr 26, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Notes

Summary by cubic

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmagar commented Apr 26, 2026 •

edited by cubic-dev-ai Bot

Loading

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading