Skip to content
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
b3940ea
fix(cli): remove shell injection from watch on-change command
jmagar Apr 25, 2026
04db34a
fix(core): prevent UTF-8 panic in content-position recovery
jmagar Apr 25, 2026
1723da4
fix(core): recover JSON-LD with raw newline characters
jmagar Apr 25, 2026
cbfa4db
fix(fetch): port upstream crawler and robots hardening
jmagar Apr 25, 2026
8b76353
fix(fetch): improve reddit fallback and verify-wall handling
jmagar Apr 25, 2026
b2e0af0
docs: specify full extractor parity port
jmagar Apr 26, 2026
be60533
docs: plan full extractor parity port
jmagar Apr 26, 2026
373753b
docs: tighten extractor parity plan
jmagar Apr 26, 2026
e078113
feat(core): add vertical extractor payload
jmagar Apr 26, 2026
a1c1a91
feat(fetch): add vertical extractor catalog
jmagar Apr 26, 2026
cbf6f38
feat(fetch): port developer package extractors
jmagar Apr 26, 2026
b6f69c5
feat(fetch): port research and community extractors
jmagar Apr 26, 2026
69fb821
feat(fetch): port huggingface and social extractors
jmagar Apr 26, 2026
a93ed15
feat(fetch): expose reddit vertical extractor
jmagar Apr 26, 2026
169d867
feat(fetch): port ecommerce vertical extractors
jmagar Apr 26, 2026
c4af4c4
feat(fetch): wire vertical extractor dispatch
jmagar Apr 26, 2026
d6005c0
feat(fetch): port substack vertical extractor
jmagar Apr 26, 2026
e71f23d
feat(cli): expose vertical extractors
jmagar Apr 26, 2026
3256515
feat(mcp): expose vertical extractors
jmagar Apr 26, 2026
e907f40
docs: document vertical extractor parity
jmagar Apr 26, 2026
d001dfa
fix(fetch): decode arxiv XML attributes with workspace features
jmagar Apr 26, 2026
ea9d8f0
chore: satisfy workspace clippy
jmagar Apr 26, 2026
dfb0cdd
docs: add live extractor CLI test report
jmagar Apr 26, 2026
77e7202
fix: address vertical extractor review feedback
jmagar Apr 26, 2026
fa6bbf2
style: format workspace
jmagar Apr 26, 2026
faa6390
fix: address extractor review edge cases
jmagar Apr 26, 2026
1342ae5
fix: satisfy current stable clippy
jmagar Apr 26, 2026
145b3ee
fix: stabilize CI on current stable
jmagar Apr 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

36 changes: 31 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,24 @@ noxa https://example.com -f llm # Token-optimized for LLMs (67% fewer to
noxa example.com
```

### Vertical Extractors

Use site-specific extractors when you want structured payloads for known verticals such as GitHub, package registries, arXiv, YouTube, Reddit, Hugging Face, social posts, Substack, and ecommerce pages.

```bash
# List all 28 built-in extractors
noxa --list-extractors
noxa --list-extractors -f json

# Force a specific extractor for one URL
noxa --extractor github_repo https://github.com/jmagar/noxa -f json

# Works with batch mode too
noxa --extractor npm --urls-file npm-packages.txt -f json
```

Safe extractors auto-dispatch for matching URLs. Broad page extractors such as `substack_post`, `shopify_product`, `shopify_collection`, `ecommerce_product`, and `woocommerce_product` are explicit-only to avoid changing generic page extraction unexpectedly.

### Content Filtering

```bash
Expand Down Expand Up @@ -538,13 +556,13 @@ noxa ships as a Claude Code plugin that adds a skill (auto-activates on scrape/c

The plugin provides:
- **`noxa` skill** — auto-activates when you ask to scrape, crawl, extract, search, watch, or summarize URLs; covers all flag combinations and common recipes
- **MCP server** — all 10 tools available directly to Claude (`scrape`, `crawl`, `map`, `batch`, `extract`, `summarize`, `diff`, `brand`, `search`, `research`)
- **MCP server** — all 11 tools available directly to Claude (`scrape`, `extractors`, `crawl`, `map`, `batch`, `extract`, `summarize`, `diff`, `brand`, `search`, `research`)

Requires `noxa` on PATH. Run `noxa setup` after installing to configure everything.

---

## MCP Server — 10 tools for AI agents
## MCP Server — 11 tools for AI agents

<a href="https://glama.ai/mcp/servers/jmagar/noxa"><img src="https://glama.ai/mcp/servers/jmagar/noxa/badge" alt="noxa MCP server" /></a>

Expand Down Expand Up @@ -573,7 +591,8 @@ Then in Claude: *"Scrape the top 5 results for 'web scraping tools' and compare

| Tool | Description | Requires API key? |
|------|-------------|:-:|
| `scrape` | Extract content from any URL | No |
| `scrape` | Extract content from any URL; accepts optional `extractor` for vertical extraction | No |
| `extractors` | List available vertical extractors | No |
| `crawl` | Recursive site crawl | No |
| `map` | Discover URLs from sitemaps | No |
| `batch` | Parallel multi-URL extraction | No |
Expand All @@ -584,7 +603,7 @@ Then in Claude: *"Scrape the top 5 results for 'web scraping tools' and compare
| `search` | Web search + scrape results | `SEARXNG_URL`: No, cloud: Yes |
| `research` | Deep multi-source research | Yes |

9 of 10 tools work locally — no account, no API key, fully private.
10 of 11 tools work locally — no account, no API key, fully private.

---

Expand All @@ -607,6 +626,13 @@ noxa URL --exclude "nav, footer, .sidebar" # CSS selector exclude
noxa URL --only-main-content # Auto-detect main content
```

### Vertical extractors

```bash
noxa --list-extractors # Show all 28 extractors
noxa URL --extractor github_repo -f json # Force a named extractor
```

### Crawling

```bash
Expand Down Expand Up @@ -719,7 +745,7 @@ noxa/
noxa-fetch HTTP client + TLS fingerprinting (wreq/BoringSSL). Crawler. Batch ops.
noxa-llm LLM provider chain (Gemini CLI -> OpenAI -> Ollama -> Anthropic)
noxa-pdf PDF text extraction
noxa-mcp MCP server (10 tools for AI agents) → run via: noxa mcp
noxa-mcp MCP server (11 tools for AI agents) → run via: noxa mcp
noxa-rag RAG pipeline (TEI embeddings + Qdrant vector store) → binary: noxa-rag-daemon
noxa-cli CLI binary → binary: noxa
```
Expand Down
1 change: 1 addition & 0 deletions crates/noxa-cli/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ dotenvy = { workspace = true }
rand = "0.8"
serde_json = { workspace = true }
serde = { workspace = true }
shlex = "1.3"
tokio = { workspace = true }
clap = { workspace = true }
tracing = { workspace = true }
Expand Down
17 changes: 14 additions & 3 deletions crates/noxa-cli/src/app/batch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,20 @@ pub(crate) async fn run_batch(

let urls: Vec<&str> = entries.iter().map(|(u, _)| u.as_str()).collect();
let options = build_extraction_options(resolved);
let results = client
.fetch_and_extract_batch_with_options(&urls, resolved.concurrency, &options)
.await;
let results = if let Some(ref extractor) = cli.extractor {
client
.fetch_and_extract_batch_vertical_with_options(
&urls,
resolved.concurrency,
extractor,
&options,
)
.await
} else {
client
.fetch_and_extract_batch_with_options(&urls, resolved.concurrency, &options)
.await
};

let ok = results.iter().filter(|r| r.result.is_ok()).count();
let errors = results.len() - ok;
Expand Down
8 changes: 8 additions & 0 deletions crates/noxa-cli/src/app/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,14 @@ pub(crate) struct Cli {
#[arg(long)]
pub(crate) stdin: bool,

/// Use a specific vertical extractor (see --list-extractors)
#[arg(long)]
pub(crate) extractor: Option<String>,

/// List available vertical extractors and exit
#[arg(long)]
pub(crate) list_extractors: bool,

/// Include metadata in output (always included in JSON)
#[arg(long)]
pub(crate) metadata: bool,
Expand Down
14 changes: 11 additions & 3 deletions crates/noxa-cli/src/app/crawl_watch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,11 @@ pub(crate) async fn run_crawl_watch() {
continue;
}
if let Ok(record) = read_crawl_status(&path) {
let key = path.file_stem().unwrap_or_default().to_string_lossy().into_owned();
let key = path
.file_stem()
.unwrap_or_default()
.to_string_lossy()
.into_owned();
seen.insert(key.clone(), record.phase);
if record.phase == CrawlStatusPhase::Done {
finished.insert(key.clone());
Expand Down Expand Up @@ -67,7 +71,11 @@ pub(crate) async fn run_crawl_watch() {
Err(_) => continue,
};

let key = path.file_stem().unwrap_or_default().to_string_lossy().into_owned();
let key = path
.file_stem()
.unwrap_or_default()
.to_string_lossy()
.into_owned();
keys_on_disk.insert(key.clone());

if finished.contains(&key) {
Expand Down Expand Up @@ -106,7 +114,7 @@ pub(crate) async fn run_crawl_watch() {
let prev_pct = prev_error_pct.get(&key).copied().unwrap_or(0);
let cooldown_ok = error_last_alerted
.get(&key)
.map_or(true, |t| t.elapsed() >= ALERT_COOLDOWN);
.is_none_or(|t| t.elapsed() >= ALERT_COOLDOWN);
if pct >= ERROR_RATE_THRESHOLD && pct_rounded > prev_pct && cooldown_ok {
println!(
"Crawl warning: {} — {}% error rate ({}/{} pages failed)",
Expand Down
53 changes: 52 additions & 1 deletion crates/noxa-cli/src/app/entry.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,10 @@ pub(crate) async fn run() {
return;
}

match (std::env::args().nth(1).as_deref(), std::env::args().nth(2).as_deref()) {
match (
std::env::args().nth(1).as_deref(),
std::env::args().nth(2).as_deref(),
) {
(Some("rag"), Some("start")) => {
run_rag_start();
return;
Expand Down Expand Up @@ -66,6 +69,16 @@ pub(crate) async fn run() {

init_logging(resolved.verbose);

if cli.list_extractors {
print_extractor_catalog(&resolved.format);
return;
}

if let Some(reason) = unsupported_extractor_mode(&cli, &resolved) {
eprintln!("error: --extractor {reason}");
process::exit(1);
}

// Validate webhook URL early so any SSRF attempt is rejected before operations run.
if let Some(ref webhook_url) = cli.webhook
&& let Err(e) = validate_url(webhook_url).await
Expand Down Expand Up @@ -292,3 +305,41 @@ pub(crate) async fn run() {
}
}
}

fn unsupported_extractor_mode(
cli: &Cli,
resolved: &config::ResolvedConfig,
) -> Option<&'static str> {
cli.extractor.as_ref()?;

if cli.stdin || cli.file.is_some() {
return Some("cannot be combined with --stdin or --file");
}
if cli.cloud {
return Some("cannot be combined with --cloud");
}
if resolved.raw_html {
return Some("cannot be combined with --raw-html");
}
if has_llm_flags(cli) {
return Some("cannot be combined with LLM extraction flags");
}
if cli.crawl || cli.map || cli.watch || cli.diff_with.is_some() || cli.brand {
return Some("only applies to single URL and batch scraping");
}
if cli.research.is_some()
|| cli.search.is_some()
|| cli.grep.is_some()
|| cli.list.is_some()
|| cli.status.is_some()
|| cli.refresh.is_some()
|| cli.retrieve.is_some()
|| cli.watch_crawls
|| cli.watch_rag
|| cli.watch_store
{
return Some("cannot be combined with this command mode");
}

None
}
Comment thread
jmagar marked this conversation as resolved.
25 changes: 21 additions & 4 deletions crates/noxa-cli/src/app/fetching/extract.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ pub(crate) async fn fetch_and_extract(
) -> Result<FetchOutput, String> {
// Local sources: read and extract as HTML
if cli.stdin {
if cli.extractor.is_some() {
return Err("--extractor cannot be combined with --stdin".to_string());
}
let mut buf = String::new();
io::stdin()
.read_to_string(&mut buf)
Expand All @@ -17,6 +20,9 @@ pub(crate) async fn fetch_and_extract(
}

if let Some(ref path) = cli.file {
if cli.extractor.is_some() {
return Err("--extractor cannot be combined with --file".to_string());
}
let html =
std::fs::read_to_string(path).map_err(|e| format!("failed to read {path}: {e}"))?;
let options = build_extraction_options(resolved);
Expand Down Expand Up @@ -47,6 +53,9 @@ pub(crate) async fn fetch_and_extract(

// --cloud: skip local, go straight to cloud API
if cli.cloud {
if cli.extractor.is_some() {
return Err("--extractor cannot be combined with --cloud".to_string());
}
let c = cloud_client.ok_or("--cloud requires NOXA_API_KEY (set via env or --api-key)")?;
let options = build_extraction_options(resolved);
let resp = c
Expand All @@ -65,10 +74,18 @@ pub(crate) async fn fetch_and_extract(
let client = FetchClient::new(build_fetch_config(cli, resolved))
.map_err(|e| format!("client error: {e}"))?;
let options = build_extraction_options(resolved);
let result = client
.fetch_and_extract_with_options(url, &options)
.await
.map_err(|e| format!("fetch error: {e}"))?;
let result = if let Some(ref extractor) = cli.extractor {
client
.fetch_and_extract_vertical(url, extractor, &options)
.await
} else {
client.fetch_and_extract_with_options(url, &options).await
}
.map_err(|e| format!("fetch error: {e}"))?;

if cli.extractor.is_some() {
return Ok(FetchOutput::Local(Box::new(result)));
}

// Check if we should fall back to cloud
let reason = detect_empty(&result);
Expand Down
16 changes: 9 additions & 7 deletions crates/noxa-cli/src/app/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@ mod cli;
mod crawl;
mod crawl_status;
mod crawl_watch;
mod diff_brand;
mod entry;
mod rag_daemon;
mod rag_watch;
mod store_watch;
mod watch_singleton;
mod diff_brand;
mod entry;
mod fetching {
pub(crate) mod config;
pub(crate) mod extract;
Expand All @@ -58,11 +58,8 @@ mod watch;
pub(crate) use batch::run_batch;
pub(crate) use cli::{Browser, Cli, OutputFormat, PdfModeArg};
pub(crate) use crawl::{run_crawl, run_map, spawn_crawl_background};
pub(crate) use crawl_watch::run_crawl_watch;
pub(crate) use rag_daemon::{run_rag_start, run_rag_stop};
pub(crate) use rag_watch::run_rag_watch;
pub(crate) use store_watch::run_store_watch;
pub(crate) use crawl_status::*;
pub(crate) use crawl_watch::run_crawl_watch;
pub(crate) use diff_brand::{run_brand, run_diff};
pub(crate) use entry::run;
pub(crate) use fetching::config::{
Expand All @@ -79,14 +76,19 @@ pub(crate) use formatting::{
};
pub(crate) use llm::{has_llm_flags, run_batch_llm, run_llm};
pub(crate) use logging::{build_ops_log, init_logging, init_mcp_logging, log_operation};
#[cfg(test)]
pub(crate) use printing::format_extractor_catalog;
pub(crate) use printing::{
print_batch_output, print_cloud_output, print_crawl_output, print_diff_output,
print_map_output, print_output,
print_extractor_catalog, print_map_output, print_output,
};
pub(crate) use rag_daemon::{run_rag_start, run_rag_stop};
pub(crate) use rag_watch::run_rag_watch;
pub(crate) use refresh::{run_refresh, run_status};
pub(crate) use research::run_research;
pub(crate) use retrieve::run_retrieve;
pub(crate) use store_ops::{run_grep, run_list, run_search};
pub(crate) use store_watch::run_store_watch;
pub(crate) use watch::{fire_webhook, run_watch};

#[cfg(test)]
Expand Down
29 changes: 29 additions & 0 deletions crates/noxa-cli/src/app/printing.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,35 @@ pub(crate) fn print_output(result: &ExtractionResult, format: &OutputFormat, sho
println!("{}", format_output(result, format, show_metadata));
}

pub(crate) fn print_extractor_catalog(format: &OutputFormat) {
println!("{}", format_extractor_catalog(format));
}

pub(crate) fn format_extractor_catalog(format: &OutputFormat) -> String {
let extractors = noxa_fetch::extractors::list();
match format {
OutputFormat::Json => {
serde_json::to_string_pretty(&extractors).expect("serialization failed")
}
_ => {
let mut out = String::new();
for extractor in extractors {
out.push_str(extractor.name);
out.push_str(" - ");
out.push_str(extractor.label);
out.push('\n');
out.push_str(" ");
out.push_str(extractor.description);
out.push('\n');
out.push_str(" patterns: ");
out.push_str(&extractor.url_patterns.join(", "));
out.push_str("\n\n");
}
out.trim_end().to_string()
}
}
}

/// Print cloud API response in the requested format.
pub(crate) fn print_cloud_output(resp: &serde_json::Value, format: &OutputFormat) {
match format {
Expand Down
Loading
Loading