Port full upstream vertical extractor parity#13
Conversation
📝 WalkthroughWalkthroughThis PR introduces a comprehensive vertical extractor system to Noxa, adding 28 site-specific extractors (e.g., GitHub, Reddit, YouTube, npm) alongside infrastructure for URL dispatch, CLI selection/listing, and MCP integration. A new Changes
Possibly related PRs
Poem
🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR ports “vertical extractor” support into Noxa, adding a 28-extractor catalog in noxa-fetch, plumbing a new vertical_data payload through ExtractionResult, and exposing explicit extractor selection via the CLI and MCP while preserving existing generic scraping defaults.
Changes:
- Add
vertical_data: Option<VerticalData>tonoxa_core::ExtractionResultand update constructors/tests across the workspace. - Introduce
noxa-fetch::extractors(catalog + dispatch + per-site implementations) and integrate safe auto-dispatch + explicit dispatch into fetch and batch flows. - Expose vertical extractors via CLI (
--list-extractors,--extractor) and MCP (optionalscrape.extractor, newextractorstool), plus documentation and reports.
Reviewed changes
Copilot reviewed 109 out of 110 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/superpowers/specs/2026-04-26-full-upstream-extractor-parity-design.md | Adds design spec for vertical extractor parity (catalog, dispatch, schema). |
| docs/superpowers/plans/2026-04-26-full-upstream-extractor-parity-plan.md | Adds step-by-step implementation plan for extractor parity and tests. |
| docs/reports/live-extractor-cli-report-2026-04-26.md | Adds live CLI sweep report covering 28/28 extractors. |
| docs/CHANGELOG.md | Documents new vertical extractor capabilities for fetch/CLI/MCP. |
| crates/noxa-store/src/content_store/tests.rs | Updates test helpers to include vertical_data: None. |
| crates/noxa-store/src/content_store/enumerate.rs | Refactors manifest cache fast-path with if let ... && style. |
| crates/noxa-rag/src/store/qdrant/tests.rs | Small refactors in test parsing + route matching. |
| crates/noxa-rag/src/pipeline/watcher.rs | Formatting / minor control-flow refactors in watcher setup and job dispatch. |
| crates/noxa-rag/src/pipeline/scan.rs | Refactors conditionals; adjusts formatting; minor cleanup. |
| crates/noxa-rag/src/pipeline/process.rs | Refactors log rotation conditional + formatting tweaks; minor code cleanup. |
| crates/noxa-rag/src/pipeline/parse/tests.rs | Updates test fixture ExtractionResult to include vertical_data: None. |
| crates/noxa-rag/src/pipeline/parse/mod.rs | Updates helper ExtractionResult construction to include vertical_data: None. |
| crates/noxa-rag/src/mcp_bridge.rs | Updates extraction construction to include vertical_data: None. |
| crates/noxa-mcp/src/tools.rs | Adds optional extractor: Option<String> to ScrapeParams + test. |
| crates/noxa-mcp/src/server/content_tools.rs | Updates placeholder extraction to include vertical_data: None. |
| crates/noxa-mcp/src/server.rs | Adds explicit vertical scrape path + new extractors tool + tests; updates tool list string. |
| crates/noxa-mcp/README.md | Documents optional scrape.extractor and new extractors tool. |
| crates/noxa-fetch/tests/fixtures/extractors/youtube_video.html | Adds fixture for YouTube player-response parsing. |
| crates/noxa-fetch/tests/fixtures/extractors/trustpilot.html | Adds fixture for Trustpilot JSON-LD parsing. |
| crates/noxa-fetch/tests/fixtures/extractors/substack_post.html | Adds fixture for Substack HTML extraction. |
| crates/noxa-fetch/tests/fixtures/extractors/stackoverflow_question.json | Adds fixture for StackOverflow question API. |
| crates/noxa-fetch/tests/fixtures/extractors/stackoverflow_answers.json | Adds fixture for StackOverflow answers API. |
| crates/noxa-fetch/tests/fixtures/extractors/shopify_product.json | Adds fixture for Shopify product .js API. |
| crates/noxa-fetch/tests/fixtures/extractors/shopify_collection.json | Adds fixture for Shopify collection products.json API. |
| crates/noxa-fetch/tests/fixtures/extractors/reddit.json | Adds fixture for Reddit JSON listing parsing. |
| crates/noxa-fetch/tests/fixtures/extractors/pypi.json | Adds fixture for PyPI JSON API. |
| crates/noxa-fetch/tests/fixtures/extractors/product_page.html | Adds generic product page fixture (JSON-LD + OG/meta). |
| crates/noxa-fetch/tests/fixtures/extractors/npm_registry.json | Adds fixture for npm registry API. |
| crates/noxa-fetch/tests/fixtures/extractors/npm_downloads.json | Adds fixture for npm downloads API. |
| crates/noxa-fetch/tests/fixtures/extractors/linkedin_post.html | Adds fixture for LinkedIn embed HTML parsing. |
| crates/noxa-fetch/tests/fixtures/extractors/instagram_profile.json | Adds fixture for Instagram web_profile_info API. |
| crates/noxa-fetch/tests/fixtures/extractors/instagram_post.html | Adds fixture for Instagram embed caption parsing. |
| crates/noxa-fetch/tests/fixtures/extractors/huggingface_model.json | Adds fixture for Hugging Face model API. |
| crates/noxa-fetch/tests/fixtures/extractors/huggingface_dataset.json | Adds fixture for Hugging Face dataset API. |
| crates/noxa-fetch/tests/fixtures/extractors/hackernews.json | Adds fixture for HN Algolia item API. |
| crates/noxa-fetch/tests/fixtures/extractors/github_repo.json | Adds fixture for GitHub repo API. |
| crates/noxa-fetch/tests/fixtures/extractors/github_release.json | Adds fixture for GitHub release API. |
| crates/noxa-fetch/tests/fixtures/extractors/github_pr.json | Adds fixture for GitHub PR API. |
| crates/noxa-fetch/tests/fixtures/extractors/github_issue.json | Adds fixture for GitHub issue API. |
| crates/noxa-fetch/tests/fixtures/extractors/docker_hub.json | Adds fixture for Docker Hub repo API. |
| crates/noxa-fetch/tests/fixtures/extractors/dev_to.json | Adds fixture for dev.to article API. |
| crates/noxa-fetch/tests/fixtures/extractors/crates_io.json | Adds fixture for crates.io API. |
| crates/noxa-fetch/tests/fixtures/extractors/arxiv.xml | Adds fixture for arXiv Atom API parsing. |
| crates/noxa-fetch/src/sitemap.rs | Makes robots.txt sitemap parsing more robust; adds plausibility checks + test. |
| crates/noxa-fetch/src/reddit.rs | Adds Reddit JSON API UA helper + verification-wall detection + tests; sets vertical_data: None. |
| crates/noxa-fetch/src/linkedin.rs | Updates LinkedIn extraction result to include vertical_data: None. |
| crates/noxa-fetch/src/lib.rs | Exposes new extractors module publicly. |
| crates/noxa-fetch/src/extractors/youtube_video.rs | Adds YouTube video extractor (match + parse player response). |
| crates/noxa-fetch/src/extractors/woocommerce_product.rs | Adds WooCommerce product extractor (HTML JSON-LD parsing). |
| crates/noxa-fetch/src/extractors/trustpilot_reviews.rs | Adds Trustpilot reviews extractor (JSON-LD parsing). |
| crates/noxa-fetch/src/extractors/summary.rs | Adds small markdown summary helper. |
| crates/noxa-fetch/src/extractors/substack_post.rs | Adds Substack post extractor (HTML + JSON-LD + tag stripping). |
| crates/noxa-fetch/src/extractors/stackoverflow.rs | Adds StackOverflow question/answer extractor (StackExchange API). |
| crates/noxa-fetch/src/extractors/shopify_product.rs | Adds Shopify product extractor (product .js API). |
| crates/noxa-fetch/src/extractors/shopify_collection.rs | Adds Shopify collection extractor (products.json API). |
| crates/noxa-fetch/src/extractors/reddit.rs | Adds Reddit extractor wrapper around existing reddit parser. |
| crates/noxa-fetch/src/extractors/pypi.rs | Adds PyPI package extractor (PyPI JSON API). |
| crates/noxa-fetch/src/extractors/product.rs | Adds shared JSON-LD product/review parsing helpers. |
| crates/noxa-fetch/src/extractors/npm.rs | Adds npm package extractor (registry + downloads API). |
| crates/noxa-fetch/src/extractors/mod.rs | Adds extractor catalog, auto-dispatch, name dispatch, dispatch errors, and fixture-based tests. |
| crates/noxa-fetch/src/extractors/linkedin_post.rs | Adds LinkedIn post extractor via embed page OG/body parsing. |
| crates/noxa-fetch/src/extractors/instagram_profile.rs | Adds Instagram profile extractor via web_profile_info API. |
| crates/noxa-fetch/src/extractors/instagram_post.rs | Adds Instagram post extractor via embed caption parsing. |
| crates/noxa-fetch/src/extractors/huggingface_model.rs | Adds Hugging Face model extractor (HF API). |
| crates/noxa-fetch/src/extractors/huggingface_dataset.rs | Adds Hugging Face dataset extractor (HF API). |
| crates/noxa-fetch/src/extractors/http.rs | Introduces ExtractorHttp trait and implements it for FetchClient. |
| crates/noxa-fetch/src/extractors/hackernews.rs | Adds Hacker News item extractor (Algolia API). |
| crates/noxa-fetch/src/extractors/github_repo.rs | Adds GitHub repo extractor (GitHub API). |
| crates/noxa-fetch/src/extractors/github_release.rs | Adds GitHub release extractor (GitHub API). |
| crates/noxa-fetch/src/extractors/github_pr.rs | Adds GitHub PR extractor (GitHub API). |
| crates/noxa-fetch/src/extractors/github_issue.rs | Adds GitHub issue extractor (GitHub API) with PR detection. |
| crates/noxa-fetch/src/extractors/etsy_listing.rs | Adds Etsy listing extractor (HTML JSON-LD product parsing). |
| crates/noxa-fetch/src/extractors/ecommerce_product.rs | Adds generic ecommerce product extractor (HTML JSON-LD product parsing). |
| crates/noxa-fetch/src/extractors/ebay_listing.rs | Adds eBay listing extractor (HTML JSON-LD product parsing). |
| crates/noxa-fetch/src/extractors/docker_hub.rs | Adds Docker Hub repository extractor (Docker Hub API). |
| crates/noxa-fetch/src/extractors/dev_to.rs | Adds dev.to article extractor (dev.to API). |
| crates/noxa-fetch/src/extractors/crates_io.rs | Adds crates.io crate extractor (crates.io API). |
| crates/noxa-fetch/src/extractors/arxiv.rs | Adds arXiv paper extractor (Atom API parsing via quick-xml). |
| crates/noxa-fetch/src/extractors/amazon_product.rs | Adds Amazon product extractor (HTML JSON-LD product parsing). |
| crates/noxa-fetch/src/document.rs | Ensures document extraction ExtractionResult includes vertical_data: None. |
| crates/noxa-fetch/src/crawler.rs | Adds glob-pattern validation + capacity checks + better semaphore-closed handling + tests. |
| crates/noxa-fetch/src/client/tests.rs | Adds unit test for build_vertical_extraction_result + raw response size-limit test helper. |
| crates/noxa-fetch/src/client/fetch.rs | Adds explicit vertical fetch method + auto-dispatch hook + vertical summary builder; hardens Reddit JSON request headers. |
| crates/noxa-fetch/src/client/batch.rs | Adds batch vertical extraction method with concurrency clamp. |
| crates/noxa-fetch/Cargo.toml | Adds async-trait and regex deps for extractor layer. |
| crates/noxa-core/src/types.rs | Adds VerticalData and ExtractionResult::vertical_data. |
| crates/noxa-core/src/structured_data.rs | Improves JSON-LD parsing robustness by escaping raw newlines inside JSON strings; adds test. |
| crates/noxa-core/src/llm/mod.rs | Updates test fixtures to include vertical_data: None. |
| crates/noxa-core/src/lib.rs | Exports VerticalData, initializes vertical_data: None in extraction results, adds serialization test. |
| crates/noxa-core/src/extractor/recovery.rs | Fixes markdown search logic for multibyte strings and empty needles; adds tests. |
| crates/noxa-core/src/diff.rs | Updates test fixtures to include vertical_data: None. |
| crates/noxa-cli/src/setup.rs | Updates MCP tool list output to include extractors. |
| crates/noxa-cli/src/config.rs | Refactors formatting + conditionals; no behavior change intended. |
| crates/noxa-cli/src/app/watch_singleton.rs | Refactors PID-file liveness check conditionals. |
| crates/noxa-cli/src/app/watch.rs | Hardens --on-change execution by avoiding shell evaluation (shlex split + direct exec). |
| crates/noxa-cli/src/app/tests_primary.rs | Adds CLI parser tests for extractor flags + catalog formatting tests; adjusts on-change tests. |
| crates/noxa-cli/src/app/store_ops.rs | Refactors unicode-safe display truncation loop. |
| crates/noxa-cli/src/app/retrieve.rs | Updates test fixture to include vertical_data: None. |
| crates/noxa-cli/src/app/rag_watch.rs | Refactors formatting + conditionals; minor logic tweaks. |
| crates/noxa-cli/src/app/rag_daemon.rs | Formatting change for path printing. |
| crates/noxa-cli/src/app/printing.rs | Adds extractor catalog printing/formatting helpers. |
| crates/noxa-cli/src/app/mod.rs | Wires extractor catalog printing into module exports; minor reordering. |
| crates/noxa-cli/src/app/fetching/extract.rs | Routes explicit extractor requests to fetch_and_extract_vertical and blocks unsupported combos. |
| crates/noxa-cli/src/app/entry.rs | Adds --list-extractors handling + centralized “unsupported extractor mode” validation. |
| crates/noxa-cli/src/app/crawl_watch.rs | Refactors formatting and cooldown logic for alerts. |
| crates/noxa-cli/src/app/cli.rs | Adds --extractor and --list-extractors flags. |
| crates/noxa-cli/src/app/batch.rs | Adds vertical extractor path for batch mode. |
| crates/noxa-cli/Cargo.toml | Adds shlex dependency. |
| README.md | Documents vertical extractor usage and updates MCP tool count/list. |
| Cargo.lock | Locks new dependencies (async-trait, regex, shlex). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dfb0cdd616
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if let Some(result) = crate::extractors::dispatch_by_url(self, url).await { | ||
| let (extractor, data) = result?; | ||
| return Ok(build_vertical_extraction_result(extractor, url, data)); |
There was a problem hiding this comment.
Preserve Reddit fallback on vertical extractor errors
This early return makes extractor failures terminal, so Reddit comment URLs no longer reach the legacy fallback logic below (is_reddit_url branch) that retries the .json path with Reddit-specific headers and can fall back to HTML extraction. In practice, when Reddit returns verification HTML from the JSON endpoint, dispatch_by_url now propagates an error immediately instead of attempting the previous fallback path, which regresses reliability for common www.reddit.com/.../comments/... inputs.
Useful? React with 👍 / 👎.
Resolves review thread PRRT_kwDOR_mP6c59sJHF Resolves review thread PRRT_kwDOR_mP6c59sJHO Resolves review thread PRRT_kwDOR_mP6c59sJG2 Resolves review thread PRRT_kwDOR_mP6c59sJG6 Resolves review thread PRRT_kwDOR_mP6c59sJG9 Resolves review thread PRRT_kwDOR_mP6c59sJHD Resolves review thread PRRT_kwDOR_mP6c59sJGv Resolves review thread PRRT_kwDOR_mP6c59sJGz - Broaden Amazon/eBay TLD matchers and tighten Substack dispatch. - Keep Reddit auto-fetch on the hardened JSON path while attaching vertical_data. - Return Reddit-specific vertical payloads, parse YouTube nocookie embeds, and rename Hugging Face all-time download fields.
There was a problem hiding this comment.
Actionable comments posted: 36
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/noxa-fetch/src/crawler.rs (1)
285-298:⚠️ Potential issue | 🟡 MinorSitemap frontier cap also bounds entries that would later be excluded.
The cap uses
frontier.len() >= max_pagesbeforequalify_link/is_excluded_by_patternruns. Once the cap is hit youbreak, which means matchingexcludedURLs that appear later in the sitemap are silently ignored rather than counted. Functionally fine for correctness, but the reportedexcludedcount will under-report on large sitemaps. Worth a one-line comment if intentional.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/noxa-fetch/src/crawler.rs` around lines 285 - 298, The frontier length cap is checked before calling qualify_link/is_excluded_by_pattern, causing later excluded sitemap entries to be skipped and the excluded counter under-reported; update the loop so you first call self.qualify_link(&entry.url, &visited) and self.is_excluded_by_pattern(&entry.url) to decide whether to push to frontier or increment excluded, and only then check if frontier.len() >= self.config.max_pages to break (or alternatively, if you must break early, increment excluded for any remaining entries before breaking); refer to frontier, self.config.max_pages, self.qualify_link, self.is_excluded_by_pattern, excluded and entry.url when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/noxa-cli/src/app/entry.rs`:
- Around line 309-345: unsupported_extractor_mode currently checks mode
incompatibilities but doesn't verify that the extractor name provided in
cli.extractor actually exists; add a validation step (in
unsupported_extractor_mode or immediately after its call in run()) that looks up
the provided extractor name against noxa_fetch::extractors::list() and returns
an appropriate error string (e.g., "unknown extractor: <name>") if not found so
users receive a single upfront error instead of per-URL failures later in
fetch_and_extract_vertical; reference the cli.extractor field, the
unsupported_extractor_mode function, noxa_fetch::extractors::list(), and
fetch_and_extract_vertical when adding this check.
In `@crates/noxa-cli/src/app/tests_primary.rs`:
- Around line 256-264: The test
extractor_catalog_json_output_serializes_all_extractors currently asserts a
hardcoded count (assert_eq!(entries.len(), 28)), which will break when
extractors change; update the assertion to compare against the actual extractor
list instead (e.g., use noxa_fetch::extractors::list().len() as the expected
count) or at minimum assert entries.len() >= 28 so the test verifies
serialization without hardcoding; locate the test and change the assertion after
calling format_extractor_catalog(&OutputFormat::Json) and parsing to
serde_json::Value.
In `@crates/noxa-cli/src/app/watch.rs`:
- Around line 129-136: The parse_on_change_command function currently returns a
misleading message for shlex::split failures; update the shlex::split error
branch to use a broader, accurate message (e.g., "failed to parse command:
invalid quoting or escape sequences") instead of specifically mentioning
unterminated quotes, keeping the existing Err path for empty argv unchanged;
locate parse_on_change_command and change the .ok_or_else closure message to the
new wording so diagnostics reflect all parse errors from shlex::split.
- Around line 138-149: The README/CHANGELOG must document the behavioral change
that --on-change now uses shlex-style argv parsing (see run_on_change_command
and parse_on_change_command) and no longer supports shell features (pipes,
redirects, &&/||, globbing, env expansion, substitution); add a clear migration
note to README and a changelog entry stating this limitation, show an example
converting a shell recipe to an explicit script (e.g. --on-change
"/path/to/script.sh" or invoke sh -c explicitly), and include a short example
demonstrating the proper invocation and recommended wrapper script to preserve
previous behavior.
In `@crates/noxa-core/src/extractor/recovery.rs`:
- Around line 499-525: Add a test that asserts find_content_position returns
None when given an empty needle to lock the empty-needle guard behavior; update
the tests module (tests) by adding a case such as calling
find_content_position("anything", "") and asserting it equals None so the
contract is enforced alongside the existing multibyte tests that exercise
find_content_position.
In `@crates/noxa-core/src/structured_data.rs`:
- Around line 78-113: The recovery currently only escapes raw '\n' and '\r' in
escape_raw_newlines_in_json_strings; expand it to escape all unescaped control
characters U+0000..U+001F when in_string (e.g., map '\t' -> "\\t", '\x08' ->
"\\b", '\x0C' -> "\\f", '\n' -> "\\n", '\r' -> "\\r' and for any other control
like NUL or others use a "\\u00XX" hex escape), preserving existing escape
handling via escape_next and in_string flags in the function; update the match
for the in_string branch to detect ch <= '\u{1f}' and append the correct escape
sequence and set changed = true.
In `@crates/noxa-fetch/src/client/fetch.rs`:
- Around line 445-467: The metadata.image is never populated for some vertical
extractors because build_vertical_extraction_result currently looks up
["image_url","thumbnail_url"] but substack_post::extract and product::extract
emit the key "image"; update build_vertical_extraction_result to include "image"
in the string_field lookup (alongside "image_url" and "thumbnail_url") so
metadata.image pulls that value, keeping string_field behavior intact for
non-string JSON values; reference build_vertical_extraction_result,
metadata.image, and string_field when locating the change and verify
substack_post::extract and product::extract continue to emit "image".
In `@crates/noxa-fetch/src/client/tests.rs`:
- Around line 72-88: The test file's formatting in the function
vertical_extraction_result_sets_vertical_payload_and_summary is not
cargo-fmt-compliant; run `cargo fmt` (or `cargo fmt --all`) to reformat
crates/noxa-fetch/src/client/tests.rs (and the workspace) so the test hunk
matches rustfmt expectations, then stage and commit the resulting changes so
CI's `cargo fmt --check --all` succeeds.
- Around line 306-324: Add a short doc comment to the spawn_raw_response_server
helper explaining that it accepts only a single connection then exits (unlike
spawn_status_server which loops), so it is intended for single-shot tests like
fetch_rejects_oversized_html_response_from_content_length and will return
connection refused on subsequent attempts; update the comment above the async fn
spawn_raw_response_server(response: String) -> String to clearly state this
single-accept behavior and intended usage.
In `@crates/noxa-fetch/src/crawler.rs`:
- Around line 561-587: The wildcard counter currently double-counts globstars
because wildcard_count (pattern.bytes().filter ... ) counts each '*' including
those in "**", masking the specific globstar limit; update validate_glob_pattern
so the pattern.matches("**") check runs before computing wildcard_count (or
compute globstar_count first and subtract 2*globstar_count from wildcard_count)
and return the "too many recursive wildcards" error first using the existing
MAX_GLOBSTARS check; adjust the logic around validate_glob_pattern,
pattern.matches("**"), and the wildcard_count computation so "**" are accounted
for only by the globstar limit and not twice-counted against MAX_GLOB_WILDCARDS.
In `@crates/noxa-fetch/src/extractors/amazon_product.rs`:
- Around line 6-15: The matches function currently restricts hosts to
"amazon.com" which contradicts INFO.url_patterns; update the matches
implementation so it accepts any Amazon TLD (e.g., amazon.co.uk, amazon.de,
amazon.co.jp, etc.) instead of only "amazon.com". Concretely, modify the host
check in matches (which currently calls host_matches(url, "amazon.com")) to
target the Amazon second-level domain broadly (for example by calling
host_matches(url, "amazon") or otherwise matching hosts that end with
".amazon.<tld>" or contain ".amazon.") while keeping the existing path checks
(url.contains("/dp/") || url.contains("/gp/product/")) so INFO and matches stay
consistent.
In `@crates/noxa-fetch/src/extractors/arxiv.rs`:
- Around line 45-59: The parse_id function currently only takes segs[1], which
breaks old-style arXiv IDs like /abs/cs/0301001v2 by returning "cs"; update
parse_id to join all path segments after the initial "abs" or "pdf" into one
identifier (e.g., join segs[1..].join("/")), then strip a trailing ".pdf" and
remove a version suffix like "vN" (detect 'v' followed exclusively by digits at
the end) before returning; keep the existing checks (URL parsing, non-empty
segments, allowed first segment) and return None for empty results, referencing
the parse_id function and its use of segs, stripped, and no_version.
- Around line 150-163: In parse_atom_entry's Ok(Event::Text(text)) handling,
don't use text.unescape().ok()? which aborts the whole entry on an unescape
error; instead attempt to unescape and if it fails fall back to the raw text (or
skip that node) so the parser continues and populates fields like entry.id,
entry.title (append_text), entry.summary, entry.published, entry.updated,
entry.authors, entry.doi, entry.comment without returning None for the entire
entry.
In `@crates/noxa-fetch/src/extractors/docker_hub.rs`:
- Around line 6-15: The url_patterns in the ExtractorInfo (INFO) currently only
advertises "https://hub.docker.com/r/*" but parse_repo and matches also accept
the official-library form under the underscore path, so update INFO.url_patterns
to include the official-library pattern (e.g., "https://hub.docker.com/_/*" or
similar wildcard) alongside the existing "/r/*" entry so the catalog and users
see both accepted URL forms; touch the INFO constant's url_patterns array (and
ensure no other behavior changes to matches or parse_repo).
In `@crates/noxa-fetch/src/extractors/ebay_listing.rs`:
- Around line 6-15: The matcher currently restricts to host_matches(url,
"ebay.com") while INFO.url_patterns advertises https://*.ebay.*/itm/*; update
matches() to accept any eBay TLD/subdomain instead of only ebay.com—e.g., parse
the URL host (or use host_matches with a wildcard) and check that the host
contains or ends with "ebay" (or matches pattern "ebay.*") and that the path
contains "/itm/"; keep the check tied to the existing matches() function and
host_matches helper so auto-dispatch aligns with INFO.url_patterns.
In `@crates/noxa-fetch/src/extractors/github_release.rs`:
- Around line 17-22: The extract function builds the GitHub API URL by
interpolating owner, repo, and tag directly (from parse_release) which can
contain reserved characters; implement a helper (e.g., encode_path_segment) that
percent-encodes a path segment using url::form_urlencoded::byte_serialize or
percent-encoding's utf8_percent_encode, then use it when constructing api_url in
extract (and the other usage around lines 51–66) so owner, repo, and tag are
encoded before being inserted into
"https://api.github.com/repos/{owner}/{repo}/releases/tags/{tag}".
In `@crates/noxa-fetch/src/extractors/huggingface_dataset.rs`:
- Around line 36-37: The output key `downloads_30d` is incorrectly mapped from
`downloadsAllTime`; update the mapping in the Hugging Face dataset extractor
(the code building the output map that uses the `dataset` variable in
crates/noxa-fetch/src/extractors/huggingface_dataset.rs) to rename that key to
`downloads_all_time` while keeping the value as
`dataset.get("downloadsAllTime").cloned()`, and ensure the existing
`"downloads": dataset.get("downloads").cloned()` line remains as the
trailing-30-day metric.
In `@crates/noxa-fetch/src/extractors/huggingface_model.rs`:
- Around line 35-37: The mapping currently puts the lifetime count
model.get("downloadsAllTime") under the key "downloads_30d", mislabeling
lifetime downloads as a 30‑day metric; update the mapping in the vertical_data
construction so that "downloads_30d" either reads
model.get("downloads").cloned() (if you intend a ~30‑day stat) or rename the key
to "downloads_all_time" (or similar) and keep
model.get("downloadsAllTime").cloned() (if you intend lifetime totals), and
remove any duplicate/conflicting "downloads" entry accordingly.
In `@crates/noxa-fetch/src/extractors/instagram_post.rs`:
- Around line 67-94: The three extraction functions (parse_username,
parse_caption, parse_thumbnail) currently call Regex::new at each invocation;
replace these with process-wide compiled statics using std::sync::LazyLock (or
OnceLock) so each pattern is compiled once: create static LazyLock<Regex>
entries for the patterns that match CaptionUsername, Caption (the div
class="Caption" capture), the user anchor inside Caption, the generic tag re,
and the EmbeddedMediaImage src pattern, then update parse_username,
parse_caption, and parse_thumbnail to use those static Regex instances instead
of Regex::new(...)? and drop the ? short-circuiting on regex compilation (these
statics are infallible once created). Ensure you reference the existing symbol
names parse_username, parse_caption, parse_thumbnail and the specific patterns
(CaptionUsername, Caption div, user_re, tag_re, EmbeddedMediaImage) when
replacing the Regex::new calls.
- Around line 96-103: The current html_decode function only replaces a few named
entities and leaves many numeric/named entities (e.g., ', , …,
emoji numeric refs) un-decoded; add a dedicated HTML entity decoder crate to
Cargo.toml (for example the html_escape crate) and replace the body of
html_decode to call the crate’s decoder (e.g. html_escape::decode_html_entities)
so it returns a fully decoded String; update any imports to reference the
decoder and remove the manual .replace chain in the html_decode function.
In `@crates/noxa-fetch/src/extractors/instagram_profile.rs`:
- Around line 17-28: The extractor fails against real Instagram because the
request to api_url in extract(...) doesn't include the required X-IG-App-ID
header (and a valid session) so get_json(...) receives HTML/401; update the call
path so the request includes X-IG-App-ID: retrieve the app id from configuration
or the HTTP client and pass it as a header when calling client.get_json (or add
a get_json_with_headers method to the ExtractorHttp trait and use it here), and
ensure the client also supplies session cookies/auth as it manages for other
endpoints so the web_profile_info call returns JSON and user can be read from
body.pointer("/data/user").
In `@crates/noxa-fetch/src/extractors/mod.rs`:
- Around line 82-413: dispatch_by_url and dispatch_by_name duplicate the same
per-extractor boilerplate (calls to <extractor>::matches, <extractor>::extract,
and <extractor>::INFO.name plus run_or_mismatch) for many extractors; collapse
this into a single source-of-truth (either a macro_rules! table or a const
slice) that lists each extractor as a tuple/entry containing its matches
function, extract function, INFO.name, and an auto_dispatch boolean, then
rewrite dispatch_by_url to iterate the table and call matches/extract
dynamically and dispatch_by_name to lookup by INFO.name and call run_or_mismatch
with the stored matches/extract; keep run_or_mismatch, INFO.name, matches,
extract, and list() semantics and honor the auto_dispatch flag so explicit-only
extractors are excluded from automatic dispatch.
In `@crates/noxa-fetch/src/extractors/pypi.rs`:
- Around line 13-15: The matches function is looser than parse_project and can
mis-dispatch URLs (e.g., query strings containing "/project/"); update matches
to use the same path-check logic as parse_project instead of
url.contains("/project/") so both use parse_project for consistency — locate the
matches(url: &str) function and replace the url.contains check with a call to
parse_project (or the same path-first-segment logic), keeping host_matches(url,
"pypi.org") and ensuring mis-routed URLs no longer reach the extractor and
trigger FetchError::Build.
In `@crates/noxa-fetch/src/extractors/reddit.rs`:
- Around line 13-15: The current matches function (matches in reddit.rs) uses
url.contains("/comments/") which can match query/fragment; instead parse the URL
(e.g., with url::Url::parse) after validating host via host_matches, then
inspect the path segments (Url::path_segments or segments iterator) and check
that one of the path segments equals "comments" so only actual path occurrences
match; update matches to return false on parse error and fall back to the host
check + path-segment check accordingly.
In `@crates/noxa-fetch/src/extractors/shopify_collection.rs`:
- Around line 17-25: The api_url construction in extract() currently appends
"/products.json" to the raw input string (using url.trim_end_matches('/')),
which breaks when the input contains query strings or fragments; instead parse
the input with url::Url (e.g., Url::parse(url)), clear query and fragment
(url.set_query(None); url.set_fragment(None)), get the cleaned base
(url.as_str() or url.to_string()), trim any trailing slash from that cleaned
base, then build api_url = format!("{}/products.json", cleaned_base_trimmed).
Update the extract function to use this parsed/cleaned URL before calling
client.get_json so api_url is never polluted by ? or # parts.
In `@crates/noxa-fetch/src/extractors/shopify_product.rs`:
- Around line 17-19: The product_url construction in extract() incorrectly
appends ".js" to the full input string (variable product_url), breaking URLs
that include queries or fragments; parse the incoming URL (e.g. using the
url::Url type), modify only its path by trimming trailing '/' and appending
".js", then set that new path back on the parsed URL and rebuild the product_url
string before calling client.get_json(&product_url). This ensures query and
fragment components (if present) are preserved while the API path ends with
".js".
In `@crates/noxa-fetch/src/extractors/stackoverflow.rs`:
- Around line 13-15: The matches() function currently accepts non-question
StackOverflow paths and should be tightened to only match individual question
pages; update matches(url: &str) to require host_matches(url,
"stackoverflow.com") && a path segment pattern of "/questions/<numeric_id>"
(i.e., ensure there is a numeric segment immediately after "/questions/") so
parse_question_id and extract() only run for real question URLs; reference the
matches function and parse_question_id to implement the numeric-segment check
(e.g., parse the URL path and verify the segment following "questions" is all
digits) and reject URLs like "/questions/tagged" or "/questions?..." to preserve
the safe auto-dispatch guarantee.
In `@crates/noxa-fetch/src/extractors/substack_post.rs`:
- Around line 14-25: In the matches function replace the incorrect host check so
it only accepts substack-hosted domains: inside the closure that defines host
and has_post_path, change the condition from host.ends_with(".substack.com") ||
host != "substack.com" to host.ends_with(".substack.com") || host ==
"substack.com" so only "substack.com" and its subdomains match; leave the
existing path check (has_post_path using parsed.path_segments and "p"/slug)
intact, or if you intentionally want to allow arbitrary custom domains add a
separate predicate instead of changing this one.
In `@crates/noxa-fetch/src/extractors/trustpilot_reviews.rs`:
- Around line 6-15: INFO.url_patterns is narrower than the matches() logic:
update ExtractorInfo::url_patterns so the advertised patterns reflect the same
hosts accepted by matches() (which currently uses host_matches(url,
"trustpilot.com") and therefore allows country subdomains). Modify
INFO.url_patterns to include patterns for subdomains (for example add
"https://*.trustpilot.com/review/*" and "https://trustpilot.com/review/*") so
--list-extractors shows the same scope as the matches() function.
In `@crates/noxa-fetch/src/extractors/youtube_video.rs`:
- Around line 14-18: The matches() function currently treats
"youtube-nocookie.com" as a YouTube host but parse_video_id() only recognizes
"youtu.be" and hosts that end_with("youtube.com"), causing nocookie URLs to
fail; update parse_video_id() to treat hosts that
end_with("youtube-nocookie.com") the same as "youtube.com" (i.e., branch the
same way you do for host.ends_with("youtube.com") to extract IDs from /watch?v=,
/embed/, /shorts/, etc.), or alternatively remove "youtube-nocookie.com" from
matches() so it is not routed here—apply the same fix for the other parser
branch mentioned (lines covering the second occurrence, the 59-77 area) so both
matcher and parser are consistent.
- Around line 79-82: The regex in extract_player_response currently requires a
literal "var " before ytInitialPlayerResponse; update the Regex::new call in the
extract_player_response function to accept an optional "var" prefix (e.g. use a
non-capturing optional group like (?:var\s+)? before ytInitialPlayerResponse) so
it matches both "var ytInitialPlayerResponse = {...};" and
"ytInitialPlayerResponse = {...};", then keep extracting capture group 1 and
parsing with serde_json as before.
In `@crates/noxa-fetch/tests/fixtures/extractors/shopify_collection.json`:
- Line 11: The fixture's "images" array in shopify_collection.json currently
contains bare URL strings but must be an array of image objects matching
Shopify's real shape (e.g., objects with keys like id, product_id, position,
src, width, height, alt). Update the "images" value to an array of objects
(include realistic values for id, product_id, position and use "src" instead of
raw string), and adjust any tests/expectations that read image URLs to use the
image.src path (or the extractor behavior that accesses src) so test payloads
mirror the real /products.json response.
In `@crates/noxa-mcp/README.md`:
- Line 42: The README sentence hardcodes "28 supported extractors", which will
drift; update the line referencing `scrape` to remove the numeric count and
instead refer readers to the `extractors` tool for the current list (e.g.,
"`scrape` accepts an optional `extractor` string for explicit vertical
extraction; use the `extractors` tool to list available extractors."). Ensure
the updated wording mentions `scrape` and `extractors` so readers know how to
discover the current set.
In `@crates/noxa-mcp/src/server.rs`:
- Around line 1047-1048: The test currently asserts a fixed catalog size with
assert_eq!(entries.len(), 28), which couples it to catalog growth; change this
to assert that entries.len() is >= a baseline (e.g., assert!(entries.len() >=
28)) and add assertions to enforce uniqueness of names and required verticals:
use a HashSet over entry["name"] (same entries variable) to assert set.len() ==
entries.len() and keep the existing contains check like
entries.iter().any(|entry| entry["name"] == "github_repo") to verify required
items.
In `@crates/noxa-rag/src/store/qdrant/tests.rs`:
- Around line 73-77: The header-parsing logic is duplicated between
spawn_test_server and spawn_test_server_with_status; extract the shared code
into a helper function (e.g., read_request(stream: &mut TcpStream) ->
RecordedRequest or read_request_and_status if status is needed) that
encapsulates reading lines, parsing headers (including the content-length logic
using split_once and parse), and building the RecordedRequest, then call this
helper from both spawn_test_server and spawn_test_server_with_status to remove
the verbatim copy and prevent drift.
---
Outside diff comments:
In `@crates/noxa-fetch/src/crawler.rs`:
- Around line 285-298: The frontier length cap is checked before calling
qualify_link/is_excluded_by_pattern, causing later excluded sitemap entries to
be skipped and the excluded counter under-reported; update the loop so you first
call self.qualify_link(&entry.url, &visited) and
self.is_excluded_by_pattern(&entry.url) to decide whether to push to frontier or
increment excluded, and only then check if frontier.len() >=
self.config.max_pages to break (or alternatively, if you must break early,
increment excluded for any remaining entries before breaking); refer to
frontier, self.config.max_pages, self.qualify_link, self.is_excluded_by_pattern,
excluded and entry.url when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5cc8a691-d00d-4f5b-8479-e8eaba76162d
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lockand included by**/*
📒 Files selected for processing (109)
README.mdcrates/noxa-cli/Cargo.tomlcrates/noxa-cli/src/app/batch.rscrates/noxa-cli/src/app/cli.rscrates/noxa-cli/src/app/crawl_watch.rscrates/noxa-cli/src/app/entry.rscrates/noxa-cli/src/app/fetching/extract.rscrates/noxa-cli/src/app/mod.rscrates/noxa-cli/src/app/printing.rscrates/noxa-cli/src/app/rag_daemon.rscrates/noxa-cli/src/app/rag_watch.rscrates/noxa-cli/src/app/retrieve.rscrates/noxa-cli/src/app/store_ops.rscrates/noxa-cli/src/app/tests_primary.rscrates/noxa-cli/src/app/watch.rscrates/noxa-cli/src/app/watch_singleton.rscrates/noxa-cli/src/config.rscrates/noxa-cli/src/setup.rscrates/noxa-core/src/diff.rscrates/noxa-core/src/extractor/recovery.rscrates/noxa-core/src/lib.rscrates/noxa-core/src/llm/mod.rscrates/noxa-core/src/structured_data.rscrates/noxa-core/src/types.rscrates/noxa-fetch/Cargo.tomlcrates/noxa-fetch/src/client/batch.rscrates/noxa-fetch/src/client/fetch.rscrates/noxa-fetch/src/client/tests.rscrates/noxa-fetch/src/crawler.rscrates/noxa-fetch/src/document.rscrates/noxa-fetch/src/extractors/amazon_product.rscrates/noxa-fetch/src/extractors/arxiv.rscrates/noxa-fetch/src/extractors/crates_io.rscrates/noxa-fetch/src/extractors/dev_to.rscrates/noxa-fetch/src/extractors/docker_hub.rscrates/noxa-fetch/src/extractors/ebay_listing.rscrates/noxa-fetch/src/extractors/ecommerce_product.rscrates/noxa-fetch/src/extractors/etsy_listing.rscrates/noxa-fetch/src/extractors/github_issue.rscrates/noxa-fetch/src/extractors/github_pr.rscrates/noxa-fetch/src/extractors/github_release.rscrates/noxa-fetch/src/extractors/github_repo.rscrates/noxa-fetch/src/extractors/hackernews.rscrates/noxa-fetch/src/extractors/http.rscrates/noxa-fetch/src/extractors/huggingface_dataset.rscrates/noxa-fetch/src/extractors/huggingface_model.rscrates/noxa-fetch/src/extractors/instagram_post.rscrates/noxa-fetch/src/extractors/instagram_profile.rscrates/noxa-fetch/src/extractors/linkedin_post.rscrates/noxa-fetch/src/extractors/mod.rscrates/noxa-fetch/src/extractors/npm.rscrates/noxa-fetch/src/extractors/product.rscrates/noxa-fetch/src/extractors/pypi.rscrates/noxa-fetch/src/extractors/reddit.rscrates/noxa-fetch/src/extractors/shopify_collection.rscrates/noxa-fetch/src/extractors/shopify_product.rscrates/noxa-fetch/src/extractors/stackoverflow.rscrates/noxa-fetch/src/extractors/substack_post.rscrates/noxa-fetch/src/extractors/summary.rscrates/noxa-fetch/src/extractors/trustpilot_reviews.rscrates/noxa-fetch/src/extractors/woocommerce_product.rscrates/noxa-fetch/src/extractors/youtube_video.rscrates/noxa-fetch/src/lib.rscrates/noxa-fetch/src/linkedin.rscrates/noxa-fetch/src/reddit.rscrates/noxa-fetch/src/sitemap.rscrates/noxa-fetch/tests/fixtures/extractors/arxiv.xmlcrates/noxa-fetch/tests/fixtures/extractors/crates_io.jsoncrates/noxa-fetch/tests/fixtures/extractors/dev_to.jsoncrates/noxa-fetch/tests/fixtures/extractors/docker_hub.jsoncrates/noxa-fetch/tests/fixtures/extractors/github_issue.jsoncrates/noxa-fetch/tests/fixtures/extractors/github_pr.jsoncrates/noxa-fetch/tests/fixtures/extractors/github_release.jsoncrates/noxa-fetch/tests/fixtures/extractors/github_repo.jsoncrates/noxa-fetch/tests/fixtures/extractors/hackernews.jsoncrates/noxa-fetch/tests/fixtures/extractors/huggingface_dataset.jsoncrates/noxa-fetch/tests/fixtures/extractors/huggingface_model.jsoncrates/noxa-fetch/tests/fixtures/extractors/instagram_post.htmlcrates/noxa-fetch/tests/fixtures/extractors/instagram_profile.jsoncrates/noxa-fetch/tests/fixtures/extractors/linkedin_post.htmlcrates/noxa-fetch/tests/fixtures/extractors/npm_downloads.jsoncrates/noxa-fetch/tests/fixtures/extractors/npm_registry.jsoncrates/noxa-fetch/tests/fixtures/extractors/product_page.htmlcrates/noxa-fetch/tests/fixtures/extractors/pypi.jsoncrates/noxa-fetch/tests/fixtures/extractors/reddit.jsoncrates/noxa-fetch/tests/fixtures/extractors/shopify_collection.jsoncrates/noxa-fetch/tests/fixtures/extractors/shopify_product.jsoncrates/noxa-fetch/tests/fixtures/extractors/stackoverflow_answers.jsoncrates/noxa-fetch/tests/fixtures/extractors/stackoverflow_question.jsoncrates/noxa-fetch/tests/fixtures/extractors/substack_post.htmlcrates/noxa-fetch/tests/fixtures/extractors/trustpilot.htmlcrates/noxa-fetch/tests/fixtures/extractors/youtube_video.htmlcrates/noxa-mcp/README.mdcrates/noxa-mcp/src/server.rscrates/noxa-mcp/src/server/content_tools.rscrates/noxa-mcp/src/tools.rscrates/noxa-rag/src/mcp_bridge.rscrates/noxa-rag/src/pipeline/parse/mod.rscrates/noxa-rag/src/pipeline/parse/tests.rscrates/noxa-rag/src/pipeline/process.rscrates/noxa-rag/src/pipeline/scan.rscrates/noxa-rag/src/pipeline/watcher.rscrates/noxa-rag/src/store/qdrant/tests.rscrates/noxa-store/src/content_store/enumerate.rscrates/noxa-store/src/content_store/tests.rsdocs/CHANGELOG.mddocs/reports/live-extractor-cli-report-2026-04-26.mddocs/superpowers/plans/2026-04-26-full-upstream-extractor-parity-plan.mddocs/superpowers/specs/2026-04-26-full-upstream-extractor-parity-design.md
| assert_eq!(entries.len(), 28); | ||
| assert!(entries.iter().any(|entry| entry["name"] == "github_repo")); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Hard-coded catalog count couples test to catalog size.
assert_eq!(entries.len(), 28) will fail any time a future PR adds (or temporarily removes) a vertical, even when behavior is correct. Consider asserting >= a baseline plus uniqueness of name, which is what the design doc actually requires.
- assert_eq!(entries.len(), 28);
- assert!(entries.iter().any(|entry| entry["name"] == "github_repo"));
+ assert!(entries.len() >= 28, "catalog shrank: {}", entries.len());
+ assert!(entries.iter().any(|entry| entry["name"] == "github_repo"));
+ let mut names: Vec<_> = entries.iter().map(|e| e["name"].as_str().unwrap()).collect();
+ names.sort();
+ let unique = names.len();
+ names.dedup();
+ assert_eq!(unique, names.len(), "duplicate extractor names in catalog");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/noxa-mcp/src/server.rs` around lines 1047 - 1048, The test currently
asserts a fixed catalog size with assert_eq!(entries.len(), 28), which couples
it to catalog growth; change this to assert that entries.len() is >= a baseline
(e.g., assert!(entries.len() >= 28)) and add assertions to enforce uniqueness of
names and required verticals: use a HashSet over entry["name"] (same entries
variable) to assert set.len() == entries.len() and keep the existing contains
check like entries.iter().any(|entry| entry["name"] == "github_repo") to verify
required items.
- Tighten vertical URL matchers and query-safe Shopify API URLs. - Improve YouTube, Instagram, arXiv, GitHub release, and structured-data edge cases. - Remove drift-prone extractor counts from docs/tests and document on-change shell behavior.
Summary
noxa-fetch, includingvertical_dataonExtractionResult, static dispatch, safe URL auto-dispatch, and explicit extractor dispatch.--list-extractorsand--extractor <name>for single URL and batch modes.scrape.extractorand a newextractorscatalog tool.Verification
cargo test --workspacecargo clippy --workspace --all-targets -- -D warningsdocs/reports/live-extractor-cli-report-2026-04-26.mdNotes
noxa-hv9remains open for separate optional non-extractor parity items: Docker entrypoint shim, crawler scope knobs, and Safari iOS profile.Summary by cubic
Ports full vertical extractor parity: site-specific extractors with catalog, static/auto/explicit dispatch, and
vertical_datain results. Exposes extractors via CLI (--list-extractors,--extractor) and MCP (scrape.extractor,extractorstool).New Features
--extractorwith--stdin/--file/--cloud.scrapeaccepts an optionalextractor; newextractorscatalog tool.vertical_datapayload.Bug Fixes
shlexto prevent shell injection.Written for commit 145b3ee. Summary will update on new commits.
Summary by CodeRabbit
Release Notes
--list-extractorsCLI flag to display available vertical extractors.--extractorCLI flag to explicitly select a vertical extractor for URL processing.extractorsMCP tool to retrieve the full catalog of available extractors.scrapeMCP tool with optionalextractorparameter for explicit vertical extraction.vertical_datafield.