From 71434cec940208ff53882f7befc3823ef8bec797 Mon Sep 17 00:00:00 2001 From: anil-bd Date: Mon, 25 May 2026 10:43:50 +0200 Subject: [PATCH 1/2] docs(scraper-studio): map the 4 marketing scraper types (PDP / Discovery / Discovery+PDP / Search) to the 2 CLI commands MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Scraper Studio product page and the YouTube creator brief describe the product as "four scraper types": PDP, Discovery, Discovery + PDP, and Search. The CLI exposes two commands (`bdata scraper create` and `bdata scraper run`) — the "type" is a property of the description, not a flag. Users who arrive from the brief or the marketing page (and content creators recording demos) look for `--type discovery` and don't find it. This adds references/four-scraper-types.md as the explicit bridge: - Quick-reference table mapping each brief-type to URL pattern + prompt shape + which collector(s) you need - End-to-end shell example for each of the 4 types - For each type: the common mistake that produces brittle templates - Disambiguator for Type 4 (site-scoped search) vs. the separate bdata search command (web SERP) - Cross-links into prompts.md, recipes.md, api-flow.md, and data-feeds Also adds one row to SKILL.md's "Pick your path" table routing brief / marketing-term users into the new reference. Refs: docs/audit DX issue N11 (see scraper-studio-cli-demo/ISSUES.md) --- skills/scraper-studio/SKILL.md | 1 + .../references/four-scraper-types.md | 123 ++++++++++++++++++ 2 files changed, 124 insertions(+) create mode 100644 skills/scraper-studio/references/four-scraper-types.md diff --git a/skills/scraper-studio/SKILL.md b/skills/scraper-studio/SKILL.md index be131a1..91c133b 100644 --- a/skills/scraper-studio/SKILL.md +++ b/skills/scraper-studio/SKILL.md @@ -33,6 +33,7 @@ Halt and route to setup if either check fails. Both commands require an authenti | User describes data they want from a URL, no scraper exists yet | `bdata scraper create ""` → save the `collector_id` | | User has a `collector_id` and wants data from a URL | `bdata scraper run ` (default async + poll) | | Page is small and you want fast feedback (≤ ~50 s) | `bdata scraper run … --sync` | +| User uses brief / marketing terms: PDP, Discovery, Discovery + PDP, Search | see [references/four-scraper-types.md](references/four-scraper-types.md) to map to `create`/`run` | | Site is a known platform (Amazon, LinkedIn, TikTok, …) | **stop — use `data-feeds` skill** | | You want SERP / discovery, not extraction | **use `search` skill** | | You want a one-off raw page fetch | **use `scrape` skill** | diff --git a/skills/scraper-studio/references/four-scraper-types.md b/skills/scraper-studio/references/four-scraper-types.md new file mode 100644 index 0000000..aa2df3d --- /dev/null +++ b/skills/scraper-studio/references/four-scraper-types.md @@ -0,0 +1,123 @@ +# Four scraper types — map from the brief to the CLI + +The Scraper Studio product page (and the YouTube creator brief) describes **four scraper types**: **PDP**, **Discovery**, **Discovery + PDP**, and **Search**. The CLI has **two commands** (`bdata scraper create` and `bdata scraper run`). The "type" is not a flag — it is the shape of the description you pass to `create` and the URL pattern you pass to `run`. + +This page is the bridge. Pick the row that matches the user's intent; use the exact prompt + run pattern from that row. + +## At a glance + +| Brief calls it | Input URL pattern | What gets returned | Same `create`+`run` commands | +|---|---|---|---| +| **PDP** | One product page URL (`/p/123` or `/dp/B0...`) | One object: fields of that product | `create` against one product, `run` against any product on same template | +| **Discovery** | A category / listing URL (`/c/baby` or `/companies?batch=W26`) | Array of cards (title, link, price, snippet) | `create` against the listing, `run` against any listing on same template | +| **Discovery + PDP** | Same as Discovery, then feed each link back into a PDP collector | Array of deep objects (one full PDP per link) | Two collectors chained: Discovery for the links, PDP for the depth | +| **Search** | A search-results URL with `?q=` or `?query=` | Array of result cards | `create` against the search URL, `run` with a different query | + +## Type 1 — PDP (Product / Detail Page) + +**When:** the user names a single canonical URL pattern (an Amazon `/dp/...`, a Y Combinator `/companies/`, a Zillow listing) and wants its **fields**. + +**Create prompt shape:** +``` +"Extract the following fields from this product / detail page: + - : + - : + - …" +``` + +**End-to-end:** +```bash +# Build (5–10 min) +bdata scraper create https://news.ycombinator.com/item?id=39000000 \ + "Extract from this Hacker News item: title, url, points, author, + submission_time_iso, comment_count, top_comment_text." + +# Run on any item using the same template +bdata scraper run c_xxx https://news.ycombinator.com/item?id=39001234 --pretty +``` + +**Common mistake:** asking for "everything on the page" — the AI will pick arbitrary fields that change between runs. Always enumerate. + +## Type 2 — Discovery (listing / index) + +**When:** the user has a category, batch, leaderboard, or directory URL and wants the **list of cards**, not the deep object behind each card. + +**Create prompt shape:** +``` +"For each item card on this listing page, extract: + - : + - link: the URL the card points to + - … +Return one array element per card." +``` + +**End-to-end:** +```bash +bdata scraper create https://www.ycombinator.com/companies?batch=W26 \ + "For each company card on this page, extract name, vertical, one-line + tagline, batch (e.g. W26), and link to the company profile. + Return one array element per card." + +bdata scraper run c_yyy https://www.ycombinator.com/companies?batch=S25 \ + --pretty -o s25.json +``` + +**Common mistake:** writing a single-object description against a listing URL. The AI may scrape one random card or smash all cards into one row. Always say "for each card" + "return one element per card". + +## Type 3 — Discovery + PDP (combo, the production workflow) + +**When:** the user wants every item on a listing, **deeply scraped** — the listing only gives summaries; you need full PDP fields for each. This is the canonical real-world pattern, and what Scraper Studio's batch endpoint is built for. + +**Two-step pattern (one Discovery collector + one PDP collector):** +```bash +# 1. Run the Discovery collector to get the links (Type 2) +bdata scraper run c_yyy https://www.ycombinator.com/companies?industry=ai \ + --json | jq -r '.[].link' > ai-companies.txt +# → ai-companies.txt has one URL per line + +# 2. Batch-run the PDP collector against the link list (Type 1 template) +bdata scraper run c_xxx --input-file ai-companies.txt -o ai-deep.json +``` + +**Why two collectors, not one:** keep concerns separate. The Discovery collector handles list-page DOM; the PDP collector handles detail-page DOM. They evolve independently when either page redesigns. + +**Common mistake:** trying to teach one collector both shapes ("scrape the listing AND each item"). The AI Flow generates better, more stable templates when each collector has one job. + +## Type 4 — Search (keyword-driven) + +**When:** the user starts with a **keyword**, not a URL. The pattern is: a real site that exposes search results via a URL query parameter (`?q=`, `?query=`, `?s=`). + +**The trick:** Scraper Studio expects a URL. Build the URL by templating the keyword into the site's search query string. Treat the search results page as Type 2 (Discovery) — for each result card, extract fields. + +**Create prompt shape:** +``` +"This is a search results page. For each result card, extract . +Treat the page as paginated; only scrape what is rendered on this page." +``` + +**End-to-end:** +```bash +# Build against ONE search URL — the template generalizes to any keyword +bdata scraper create "https://www.ycombinator.com/companies?query=agents" \ + "For each company card returned by this search, extract name, vertical, + tagline, batch, profile link. Treat as paginated search results." + +# Run with any other keyword by swapping the query string +bdata scraper run c_zzz "https://www.ycombinator.com/companies?query=robotics" \ + --pretty -o robotics-search.json +``` + +**Not to be confused with `bdata search`:** that command is a separate product (SERP API) that searches the *whole web* via Google / Bing / Yandex. The "Search" scraper type here is **site-scoped** search, scraped from the target site's own search-results page. + +## Why the "type" is a prompt shape, not a flag + +The CLI only has `create` and `run` because the Scraper Studio AI Flow infers the page shape from the URL + description. A single-URL description against `/companies/` produces a PDP template; an "for each card" description against `/companies?batch=W26` produces a Discovery template. The same `run` command works for both — it just returns different shapes (object vs array). + +If you find yourself wanting a `--type discovery` flag, the answer is: be explicit in the description. "For each card on this page, extract …" is the Discovery signal. "Extract these fields from this page" is the PDP signal. + +## Cross-references + +- Prompt patterns per type (more examples): [`prompts.md`](prompts.md) +- Recipes for each end-to-end flow: [`recipes.md`](recipes.md) +- The raw API endpoints behind each type: [`api-flow.md`](api-flow.md) +- Pre-built scrapers for Amazon, LinkedIn, etc. (use **instead** of Discovery+PDP when available): [`../../data-feeds/SKILL.md`](../../data-feeds/SKILL.md) From ff6a3a9c59263cd0e491e24b2aa437c04e2bd173 Mon Sep 17 00:00:00 2001 From: anil-bd Date: Mon, 25 May 2026 10:51:42 +0200 Subject: [PATCH 2/2] docs(scraper-studio): em-dash sweep on four-scraper-types reference Bright Data house style (per bright-dx-writer rubric and internal docs guide) bans em-dashes. Use commas, colons, or new sentences. This sweeps 11 em-dashes from the file added in the prior commit. Prose content unchanged; only punctuation. No code-block edits. Self-caught in a second-pass critique using the bright-dx-writer skill. --- .../references/four-scraper-types.md | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/skills/scraper-studio/references/four-scraper-types.md b/skills/scraper-studio/references/four-scraper-types.md index aa2df3d..6d50b27 100644 --- a/skills/scraper-studio/references/four-scraper-types.md +++ b/skills/scraper-studio/references/four-scraper-types.md @@ -1,6 +1,6 @@ -# Four scraper types — map from the brief to the CLI +# Four scraper types, map from the brief to the CLI -The Scraper Studio product page (and the YouTube creator brief) describes **four scraper types**: **PDP**, **Discovery**, **Discovery + PDP**, and **Search**. The CLI has **two commands** (`bdata scraper create` and `bdata scraper run`). The "type" is not a flag — it is the shape of the description you pass to `create` and the URL pattern you pass to `run`. +The Scraper Studio product page (and the YouTube creator brief) describes **four scraper types**: **PDP**, **Discovery**, **Discovery + PDP**, and **Search**. The CLI has **two commands** (`bdata scraper create` and `bdata scraper run`). The "type" is not a flag, it is the shape of the description you pass to `create` and the URL pattern you pass to `run`. This page is the bridge. Pick the row that matches the user's intent; use the exact prompt + run pattern from that row. @@ -13,7 +13,7 @@ This page is the bridge. Pick the row that matches the user's intent; use the ex | **Discovery + PDP** | Same as Discovery, then feed each link back into a PDP collector | Array of deep objects (one full PDP per link) | Two collectors chained: Discovery for the links, PDP for the depth | | **Search** | A search-results URL with `?q=` or `?query=` | Array of result cards | `create` against the search URL, `run` with a different query | -## Type 1 — PDP (Product / Detail Page) +## Type 1, PDP (Product / Detail Page) **When:** the user names a single canonical URL pattern (an Amazon `/dp/...`, a Y Combinator `/companies/`, a Zillow listing) and wants its **fields**. @@ -36,9 +36,9 @@ bdata scraper create https://news.ycombinator.com/item?id=39000000 \ bdata scraper run c_xxx https://news.ycombinator.com/item?id=39001234 --pretty ``` -**Common mistake:** asking for "everything on the page" — the AI will pick arbitrary fields that change between runs. Always enumerate. +**Common mistake:** asking for "everything on the page", the AI will pick arbitrary fields that change between runs. Always enumerate. -## Type 2 — Discovery (listing / index) +## Type 2, Discovery (listing / index) **When:** the user has a category, batch, leaderboard, or directory URL and wants the **list of cards**, not the deep object behind each card. @@ -64,9 +64,9 @@ bdata scraper run c_yyy https://www.ycombinator.com/companies?batch=S25 \ **Common mistake:** writing a single-object description against a listing URL. The AI may scrape one random card or smash all cards into one row. Always say "for each card" + "return one element per card". -## Type 3 — Discovery + PDP (combo, the production workflow) +## Type 3, Discovery + PDP (combo, the production workflow) -**When:** the user wants every item on a listing, **deeply scraped** — the listing only gives summaries; you need full PDP fields for each. This is the canonical real-world pattern, and what Scraper Studio's batch endpoint is built for. +**When:** the user wants every item on a listing, **deeply scraped**, the listing only gives summaries; you need full PDP fields for each. This is the canonical real-world pattern, and what Scraper Studio's batch endpoint is built for. **Two-step pattern (one Discovery collector + one PDP collector):** ```bash @@ -83,11 +83,11 @@ bdata scraper run c_xxx --input-file ai-companies.txt -o ai-deep.json **Common mistake:** trying to teach one collector both shapes ("scrape the listing AND each item"). The AI Flow generates better, more stable templates when each collector has one job. -## Type 4 — Search (keyword-driven) +## Type 4, Search (keyword-driven) **When:** the user starts with a **keyword**, not a URL. The pattern is: a real site that exposes search results via a URL query parameter (`?q=`, `?query=`, `?s=`). -**The trick:** Scraper Studio expects a URL. Build the URL by templating the keyword into the site's search query string. Treat the search results page as Type 2 (Discovery) — for each result card, extract fields. +**The trick:** Scraper Studio expects a URL. Build the URL by templating the keyword into the site's search query string. Treat the search results page as Type 2 (Discovery), for each result card, extract fields. **Create prompt shape:** ``` @@ -97,7 +97,7 @@ Treat the page as paginated; only scrape what is rendered on this page." **End-to-end:** ```bash -# Build against ONE search URL — the template generalizes to any keyword +# Build against ONE search URL, the template generalizes to any keyword bdata scraper create "https://www.ycombinator.com/companies?query=agents" \ "For each company card returned by this search, extract name, vertical, tagline, batch, profile link. Treat as paginated search results." @@ -111,7 +111,7 @@ bdata scraper run c_zzz "https://www.ycombinator.com/companies?query=robotics" \ ## Why the "type" is a prompt shape, not a flag -The CLI only has `create` and `run` because the Scraper Studio AI Flow infers the page shape from the URL + description. A single-URL description against `/companies/` produces a PDP template; an "for each card" description against `/companies?batch=W26` produces a Discovery template. The same `run` command works for both — it just returns different shapes (object vs array). +The CLI only has `create` and `run` because the Scraper Studio AI Flow infers the page shape from the URL + description. A single-URL description against `/companies/` produces a PDP template; an "for each card" description against `/companies?batch=W26` produces a Discovery template. The same `run` command works for both, it just returns different shapes (object vs array). If you find yourself wanting a `--type discovery` flag, the answer is: be explicit in the description. "For each card on this page, extract …" is the Discovery signal. "Extract these fields from this page" is the PDP signal.