From 71434cec940208ff53882f7befc3823ef8bec797 Mon Sep 17 00:00:00 2001
From: anil-bd <anil@brightdata.com>
Date: Mon, 25 May 2026 10:43:50 +0200
Subject: [PATCH 1/2] docs(scraper-studio): map the 4 marketing scraper types
 (PDP / Discovery / Discovery+PDP / Search) to the 2 CLI commands
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The Scraper Studio product page and the YouTube creator brief describe the
product as "four scraper types": PDP, Discovery, Discovery + PDP, and
Search. The CLI exposes two commands (`bdata scraper create` and
`bdata scraper run`) — the "type" is a property of the description, not a
flag. Users who arrive from the brief or the marketing page (and content
creators recording demos) look for `--type discovery` and don't find it.

This adds references/four-scraper-types.md as the explicit bridge:

- Quick-reference table mapping each brief-type to URL pattern + prompt
  shape + which collector(s) you need
- End-to-end shell example for each of the 4 types
- For each type: the common mistake that produces brittle templates
- Disambiguator for Type 4 (site-scoped search) vs. the separate
  bdata search command (web SERP)
- Cross-links into prompts.md, recipes.md, api-flow.md, and data-feeds

Also adds one row to SKILL.md's "Pick your path" table routing
brief / marketing-term users into the new reference.

Refs: docs/audit DX issue N11 (see scraper-studio-cli-demo/ISSUES.md)
---
 skills/scraper-studio/SKILL.md                |   1 +
 .../references/four-scraper-types.md          | 123 ++++++++++++++++++
 2 files changed, 124 insertions(+)
 create mode 100644 skills/scraper-studio/references/four-scraper-types.md
diff --git a/skills/scraper-studio/SKILL.md b/skills/scraper-studio/SKILL.md
index be131a1..91c133b 100644
--- a/skills/scraper-studio/SKILL.md
+++ b/skills/scraper-studio/SKILL.md
@@ -33,6 +33,7 @@ Halt and route to setup if either check fails. Both commands require an authenti
 | User describes data they want from a URL, no scraper exists yet | `bdata scraper create <url> "<description>"` → save the `collector_id` |
 | User has a `collector_id` and wants data from a URL | `bdata scraper run <collector_id> <url>` (default async + poll) |
 | Page is small and you want fast feedback (≤ ~50 s) | `bdata scraper run … --sync` |
+| User uses brief / marketing terms: PDP, Discovery, Discovery + PDP, Search | see [references/four-scraper-types.md](references/four-scraper-types.md) to map to `create`/`run` |
 | Site is a known platform (Amazon, LinkedIn, TikTok, …) | **stop — use `data-feeds` skill** |
 | You want SERP / discovery, not extraction | **use `search` skill** |
 | You want a one-off raw page fetch | **use `scrape` skill** |
diff --git a/skills/scraper-studio/references/four-scraper-types.md b/skills/scraper-studio/references/four-scraper-types.md
new file mode 100644
index 0000000..aa2df3d
--- /dev/null
+++ b/skills/scraper-studio/references/four-scraper-types.md
@@ -0,0 +1,123 @@
+# Four scraper types — map from the brief to the CLI
+
+The Scraper Studio product page (and the YouTube creator brief) describes **four scraper types**: **PDP**, **Discovery**, **Discovery + PDP**, and **Search**. The CLI has **two commands** (`bdata scraper create` and `bdata scraper run`). The "type" is not a flag — it is the shape of the description you pass to `create` and the URL pattern you pass to `run`.
+
+This page is the bridge. Pick the row that matches the user's intent; use the exact prompt + run pattern from that row.
+
+## At a glance
+
+| Brief calls it | Input URL pattern | What gets returned | Same `create`+`run` commands |
+|---|---|---|---|
+| **PDP** | One product page URL (`/p/123` or `/dp/B0...`) | One object: fields of that product | `create` against one product, `run` against any product on same template |
+| **Discovery** | A category / listing URL (`/c/baby` or `/companies?batch=W26`) | Array of cards (title, link, price, snippet) | `create` against the listing, `run` against any listing on same template |
+| **Discovery + PDP** | Same as Discovery, then feed each link back into a PDP collector | Array of deep objects (one full PDP per link) | Two collectors chained: Discovery for the links, PDP for the depth |
+| **Search** | A search-results URL with `?q=` or `?query=` | Array of result cards | `create` against the search URL, `run` with a different query |
+
+## Type 1 — PDP (Product / Detail Page)
+
+**When:** the user names a single canonical URL pattern (an Amazon `/dp/...`, a Y Combinator `/companies/<name>`, a Zillow listing) and wants its **fields**.
+
+**Create prompt shape:**
+```
+"Extract the following fields from this product / detail page:
+ - <field_1>: <one-sentence semantic, with disambiguator>
+ - <field_2>: <one-sentence semantic, with disambiguator>
+ - …"
+```
+
+**End-to-end:**
+```bash
+# Build (5–10 min)
+bdata scraper create https://news.ycombinator.com/item?id=39000000 \
+    "Extract from this Hacker News item: title, url, points, author,
+     submission_time_iso, comment_count, top_comment_text."
+
+# Run on any item using the same template
+bdata scraper run c_xxx https://news.ycombinator.com/item?id=39001234 --pretty
+```
+
+**Common mistake:** asking for "everything on the page" — the AI will pick arbitrary fields that change between runs. Always enumerate.
+
+## Type 2 — Discovery (listing / index)
+
+**When:** the user has a category, batch, leaderboard, or directory URL and wants the **list of cards**, not the deep object behind each card.
+
+**Create prompt shape:**
+```
+"For each item card on this listing page, extract:
+ - <field_1>: <semantic>
+ - link: the URL the card points to
+ - …
+Return one array element per card."
+```
+
+**End-to-end:**
+```bash
+bdata scraper create https://www.ycombinator.com/companies?batch=W26 \
+    "For each company card on this page, extract name, vertical, one-line
+     tagline, batch (e.g. W26), and link to the company profile.
+     Return one array element per card."
+
+bdata scraper run c_yyy https://www.ycombinator.com/companies?batch=S25 \
+    --pretty -o s25.json
+```
+
+**Common mistake:** writing a single-object description against a listing URL. The AI may scrape one random card or smash all cards into one row. Always say "for each card" + "return one element per card".
+
+## Type 3 — Discovery + PDP (combo, the production workflow)
+
+**When:** the user wants every item on a listing, **deeply scraped** — the listing only gives summaries; you need full PDP fields for each. This is the canonical real-world pattern, and what Scraper Studio's batch endpoint is built for.
+
+**Two-step pattern (one Discovery collector + one PDP collector):**
+```bash
+# 1. Run the Discovery collector to get the links (Type 2)
+bdata scraper run c_yyy https://www.ycombinator.com/companies?industry=ai \
+    --json | jq -r '.[].link' > ai-companies.txt
+# → ai-companies.txt has one URL per line
+
+# 2. Batch-run the PDP collector against the link list (Type 1 template)
+bdata scraper run c_xxx --input-file ai-companies.txt -o ai-deep.json
+```
+
+**Why two collectors, not one:** keep concerns separate. The Discovery collector handles list-page DOM; the PDP collector handles detail-page DOM. They evolve independently when either page redesigns.
+
+**Common mistake:** trying to teach one collector both shapes ("scrape the listing AND each item"). The AI Flow generates better, more stable templates when each collector has one job.
+
+## Type 4 — Search (keyword-driven)
+
+**When:** the user starts with a **keyword**, not a URL. The pattern is: a real site that exposes search results via a URL query parameter (`?q=`, `?query=`, `?s=`).
+
+**The trick:** Scraper Studio expects a URL. Build the URL by templating the keyword into the site's search query string. Treat the search results page as Type 2 (Discovery) — for each result card, extract fields.
+
+**Create prompt shape:**
+```
+"This is a search results page. For each result card, extract <fields>.
+Treat the page as paginated; only scrape what is rendered on this page."
+```
+
+**End-to-end:**
+```bash
+# Build against ONE search URL — the template generalizes to any keyword
+bdata scraper create "https://www.ycombinator.com/companies?query=agents" \
+    "For each company card returned by this search, extract name, vertical,
+     tagline, batch, profile link. Treat as paginated search results."
+
+# Run with any other keyword by swapping the query string
+bdata scraper run c_zzz "https://www.ycombinator.com/companies?query=robotics" \
+    --pretty -o robotics-search.json
+```
+
+**Not to be confused with `bdata search`:** that command is a separate product (SERP API) that searches the *whole web* via Google / Bing / Yandex. The "Search" scraper type here is **site-scoped** search, scraped from the target site's own search-results page.
+
+## Why the "type" is a prompt shape, not a flag
+
+The CLI only has `create` and `run` because the Scraper Studio AI Flow infers the page shape from the URL + description. A single-URL description against `/companies/<name>` produces a PDP template; an "for each card" description against `/companies?batch=W26` produces a Discovery template. The same `run` command works for both — it just returns different shapes (object vs array).
+
+If you find yourself wanting a `--type discovery` flag, the answer is: be explicit in the description. "For each card on this page, extract …" is the Discovery signal. "Extract these fields from this page" is the PDP signal.
+
+## Cross-references
+
+- Prompt patterns per type (more examples): [`prompts.md`](prompts.md)
+- Recipes for each end-to-end flow: [`recipes.md`](recipes.md)
+- The raw API endpoints behind each type: [`api-flow.md`](api-flow.md)
+- Pre-built scrapers for Amazon, LinkedIn, etc. (use **instead** of Discovery+PDP when available): [`../../data-feeds/SKILL.md`](../../data-feeds/SKILL.md)

From ff6a3a9c59263cd0e491e24b2aa437c04e2bd173 Mon Sep 17 00:00:00 2001
From: anil-bd <anil@brightdata.com>
Date: Mon, 25 May 2026 10:51:42 +0200
Subject: [PATCH 2/2] docs(scraper-studio): em-dash sweep on four-scraper-types
 reference

Bright Data house style (per bright-dx-writer rubric and internal docs
guide) bans em-dashes. Use commas, colons, or new sentences.

This sweeps 11 em-dashes from the file added in the prior commit. Prose
content unchanged; only punctuation. No code-block edits.

Self-caught in a second-pass critique using the bright-dx-writer skill.
---
 .../references/four-scraper-types.md          | 22 +++++++++----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/skills/scraper-studio/references/four-scraper-types.md b/skills/scraper-studio/references/four-scraper-types.md
index aa2df3d..6d50b27 100644
--- a/skills/scraper-studio/references/four-scraper-types.md
+++ b/skills/scraper-studio/references/four-scraper-types.md
@@ -1,6 +1,6 @@
-# Four scraper types — map from the brief to the CLI
+# Four scraper types, map from the brief to the CLI
 
-The Scraper Studio product page (and the YouTube creator brief) describes **four scraper types**: **PDP**, **Discovery**, **Discovery + PDP**, and **Search**. The CLI has **two commands** (`bdata scraper create` and `bdata scraper run`). The "type" is not a flag — it is the shape of the description you pass to `create` and the URL pattern you pass to `run`.
+The Scraper Studio product page (and the YouTube creator brief) describes **four scraper types**: **PDP**, **Discovery**, **Discovery + PDP**, and **Search**. The CLI has **two commands** (`bdata scraper create` and `bdata scraper run`). The "type" is not a flag, it is the shape of the description you pass to `create` and the URL pattern you pass to `run`.
 
 This page is the bridge. Pick the row that matches the user's intent; use the exact prompt + run pattern from that row.
 
@@ -13,7 +13,7 @@ This page is the bridge. Pick the row that matches the user's intent; use the ex
 | **Discovery + PDP** | Same as Discovery, then feed each link back into a PDP collector | Array of deep objects (one full PDP per link) | Two collectors chained: Discovery for the links, PDP for the depth |
 | **Search** | A search-results URL with `?q=` or `?query=` | Array of result cards | `create` against the search URL, `run` with a different query |
 
-## Type 1 — PDP (Product / Detail Page)
+## Type 1, PDP (Product / Detail Page)
 
 **When:** the user names a single canonical URL pattern (an Amazon `/dp/...`, a Y Combinator `/companies/<name>`, a Zillow listing) and wants its **fields**.
 
@@ -36,9 +36,9 @@ bdata scraper create https://news.ycombinator.com/item?id=39000000 \
 bdata scraper run c_xxx https://news.ycombinator.com/item?id=39001234 --pretty
 ```
 
-**Common mistake:** asking for "everything on the page" — the AI will pick arbitrary fields that change between runs. Always enumerate.
+**Common mistake:** asking for "everything on the page", the AI will pick arbitrary fields that change between runs. Always enumerate.
 
-## Type 2 — Discovery (listing / index)
+## Type 2, Discovery (listing / index)
 
 **When:** the user has a category, batch, leaderboard, or directory URL and wants the **list of cards**, not the deep object behind each card.
 
@@ -64,9 +64,9 @@ bdata scraper run c_yyy https://www.ycombinator.com/companies?batch=S25 \
 
 **Common mistake:** writing a single-object description against a listing URL. The AI may scrape one random card or smash all cards into one row. Always say "for each card" + "return one element per card".
 
-## Type 3 — Discovery + PDP (combo, the production workflow)
+## Type 3, Discovery + PDP (combo, the production workflow)
 
-**When:** the user wants every item on a listing, **deeply scraped** — the listing only gives summaries; you need full PDP fields for each. This is the canonical real-world pattern, and what Scraper Studio's batch endpoint is built for.
+**When:** the user wants every item on a listing, **deeply scraped**, the listing only gives summaries; you need full PDP fields for each. This is the canonical real-world pattern, and what Scraper Studio's batch endpoint is built for.
 
 **Two-step pattern (one Discovery collector + one PDP collector):**
 ```bash
@@ -83,11 +83,11 @@ bdata scraper run c_xxx --input-file ai-companies.txt -o ai-deep.json
 
 **Common mistake:** trying to teach one collector both shapes ("scrape the listing AND each item"). The AI Flow generates better, more stable templates when each collector has one job.
 
-## Type 4 — Search (keyword-driven)
+## Type 4, Search (keyword-driven)
 
 **When:** the user starts with a **keyword**, not a URL. The pattern is: a real site that exposes search results via a URL query parameter (`?q=`, `?query=`, `?s=`).
 
-**The trick:** Scraper Studio expects a URL. Build the URL by templating the keyword into the site's search query string. Treat the search results page as Type 2 (Discovery) — for each result card, extract fields.
+**The trick:** Scraper Studio expects a URL. Build the URL by templating the keyword into the site's search query string. Treat the search results page as Type 2 (Discovery), for each result card, extract fields.
 
 **Create prompt shape:**
 ```
@@ -97,7 +97,7 @@ Treat the page as paginated; only scrape what is rendered on this page."
 
 **End-to-end:**
 ```bash
-# Build against ONE search URL — the template generalizes to any keyword
+# Build against ONE search URL, the template generalizes to any keyword
 bdata scraper create "https://www.ycombinator.com/companies?query=agents" \
     "For each company card returned by this search, extract name, vertical,
      tagline, batch, profile link. Treat as paginated search results."
@@ -111,7 +111,7 @@ bdata scraper run c_zzz "https://www.ycombinator.com/companies?query=robotics" \
 
 ## Why the "type" is a prompt shape, not a flag
 
-The CLI only has `create` and `run` because the Scraper Studio AI Flow infers the page shape from the URL + description. A single-URL description against `/companies/<name>` produces a PDP template; an "for each card" description against `/companies?batch=W26` produces a Discovery template. The same `run` command works for both — it just returns different shapes (object vs array).
+The CLI only has `create` and `run` because the Scraper Studio AI Flow infers the page shape from the URL + description. A single-URL description against `/companies/<name>` produces a PDP template; an "for each card" description against `/companies?batch=W26` produces a Discovery template. The same `run` command works for both, it just returns different shapes (object vs array).
 
 If you find yourself wanting a `--type discovery` flag, the answer is: be explicit in the description. "For each card on this page, extract …" is the Discovery signal. "Extract these fields from this page" is the PDP signal.