diff --git a/skills/scraper-studio/SKILL.md b/skills/scraper-studio/SKILL.md index 06d285d..29c4fba 100644 --- a/skills/scraper-studio/SKILL.md +++ b/skills/scraper-studio/SKILL.md @@ -310,15 +310,11 @@ For more end-to-end recipes (batch run loops, error recovery, web-UI handoff), s 8. **Vague descriptions in `create`.** A description like "scrape the page" produces a generic scraper. Name every field, name conditions ("if there's a sale price, capture both"), name disambiguators ("the price near the title, not in the recommendations sidebar"). See [references/prompts.md](references/prompts.md). -9. **Re-running `create` to fix a broken scraper.** That builds a *new* - collector and orphans the old one. To fix an existing scraper, use - `bdata scraper heal ""` — it mutates the - scraper in place so your saved `collector_id` keeps working and improves. +9. **Fighting the AI-Flow concurrent-job cap manually.** The AI Flow caps concurrent `scraper create` generations per account (currently **3**). If you launch more in parallel, the API returns `429 Cannot run more than 3 jobs in parallel` and the CLI's auto-backoff waits + retries (default 4 attempts, exponential with jitter, ~7.5 min total). **Let it wait — do not re-launch on top of the existing backoff.** Re-launching just creates more stub collectors. Tune with `--max-retries ` if you need a different ceiling; use `--no-retry` only if you've explicitly built your own backoff loop. -10. **Treating `awaiting_approval` as a failure.** It is the normal end state - of a heal — the fix is computed and waiting for your decision. Review - `preview_result`, then `bdata scraper approve ` (or `--reject`). Use - `heal --auto-approve` to skip the gate. +10. **Re-running `create` to fix a broken scraper.** That builds a *new* collector and orphans the old one. To fix an existing scraper, use `bdata scraper heal ""` — it mutates the scraper in place so your saved `collector_id` keeps working and improves. + +11. **Treating `awaiting_approval` as a failure.** It is the normal end state of a heal — the fix is computed and waiting for your decision. Review `preview_result`, then `bdata scraper approve ` (or `--reject`). Use `heal --auto-approve` to skip the gate. --- @@ -330,6 +326,8 @@ For more end-to-end recipes (batch run loops, error recovery, web-UI handoff), s | `Invalid or expired API key` | Not logged in | `bdata login` (or `bdata login --device` for SSH). | | `create` returns no `id` | API call failed before template was created | Check `--timing` output for the failing request; verify network and account status. | | `Timeout after 600 seconds waiting for AI generation` | Page is complex; default poll exceeded | The `collector_id` is still printed. Open it in the web UI; or re-run with `--timeout 1200`. | +| `Hit AI-Flow concurrent-job cap (429). Waiting Ns before retry…` (stderr) | You have ≥ 3 AI-Flow generations already in flight (account-wide). | **Expected behaviour — let the CLI wait.** The cap clears when one of the in-flight jobs finishes (2-11 min). Override the retry count with `--max-retries `; disable retries with `--no-retry` only if you have your own backoff. | +| `Cannot run more than N jobs in parallel` (after retries exhausted) | Cap stayed full for the entire backoff window. | Wait for in-flight generations to finish, then re-run. The CLI also prints a `Note:` pointing at the half-built collector's dashboard URL — open it to inspect or delete manually (programmatic deletion not yet exposed). | | `status: "failed"` from progress poll | AI Flow couldn't build the template | Improve the description — be more specific about fields and selectors. Try again with a cleaner URL (e.g. a canonical product page, not a search result). | | `--sync` 202 with `crawl_results_timeout` | Page took > sync server cap | Re-run **without** `--sync` to poll `/dca/get_result` for the printed `response_id`. | | `--sync-timeout must be between 25 and 50 seconds` | Out-of-range value | Use a value in `[25, 50]`. | diff --git a/skills/scraper-studio/proposals/PR-11-backoff.md b/skills/scraper-studio/proposals/PR-11-backoff.md new file mode 100644 index 0000000..cf9f985 --- /dev/null +++ b/skills/scraper-studio/proposals/PR-11-backoff.md @@ -0,0 +1,145 @@ +# PR-11 — Handle the AI-Flow concurrent-job cap + +Status: **split into two halves.** + +- **Half A (CLI):** auto-backoff on 429 + stderr stub-recovery note. **Shipped** + in the `cli` repo (commit on branch `feat/scraper-create-429-backoff`). +- **Half B (API / server):** prevent stubs at the source. **Open**, this proposal + describes both asks for the product team. + +--- + +## Problem + +Bright Data's AI Flow caps concurrent `scraper create` generations per account. +The cap is currently **3** and is enforced at the AI-trigger step +(`POST /dca/collectors/{id}/automate_template`). When exceeded, the API returns: + +``` +429 Cannot run more than 3 jobs in parallel +``` + +The cap was undocumented in the skill and in `bdata scraper create --help` before +this proposal landed. Users who scripted multi-domain workflows (parallel `for` +loops, fan-out scrapers, etc.) hit it constantly and discovered it only by +observing the error. + +Worse, the *template-creation* step (`POST /dca/collector`) succeeds even when +the user is already at the cap. The CLI prints `Template created: c_…`, then a +moment later the AI-trigger 429s. The half-built **stub collector** persists in +the user's dashboard. With no `DELETE /dca/collector/{id}` endpoint, the only +recovery is a manual click in the web UI. + +A real session from our 4-round audit (R4): a user launched 10 parallel +`scraper create` invocations. 3 succeeded. 7 hit the cap and produced 7 stubs. +After 4 rounds of partial re-runs, all 10 scrapers were built, but 7 orphan +stubs remained in the dashboard. The CLI never told the user the cap existed. + +## Half A: client-side workaround (shipped) + +The CLI now treats the cap as a recoverable transient error. + +### What lands in the CLI + +1. Per-request `retry: Retry_config` on the shared HTTP client + (`src/utils/client.ts`). Configurable `max_attempts`, `base_ms`, `max_ms`, + and an `on_retry` callback. Other commands (scrape, search, discover, + pipelines, browser) are unaffected — they keep today's short schedule. + +2. AI Scraper Studio specific config in `src/commands/scraper.ts`: + - Base 30s, ceiling 240s, default 4 attempts ≈ 7.5 min total max wait. + - Full-jitter exponential backoff (delay ∈ `[exp/2, exp]`) so concurrent + shell processes don't all retry on the same tick. + +3. Two new flags on `bdata scraper create`: + - `--max-retries ` — override the count (default 4). + - `--no-retry` — disable retries; fail fast on 429. + +4. Status line during the wait so callers know the CLI isn't hung: + ``` + Hit AI-Flow concurrent-job cap (429). Waiting 32s before retry 1/4... + ``` + +5. Stderr stub-recovery note on every terminal failure that leaves a + half-built collector, pointing at the dashboard URL. + +### What's documented + +- `SKILL.md` "Common mistakes" gets a new entry naming the cap and the + `--max-retries` / `--no-retry` levers. +- `SKILL.md` "Troubleshooting" gets two new rows: the in-progress backoff + message (informational) and the retries-exhausted case (actionable). +- `references/api-flow.md` documents the cap inline on the AI-trigger + endpoint and references this proposal for the open server-side asks. + +### What's not in this half + +The CLI cannot delete stub collectors because no `DELETE` endpoint exists. +The auto-backoff dramatically reduces the *rate* of stub creation +(parallel launches now serialise instead of failing 7/10), but any +terminal failure still leaves one behind. + +--- + +## Half B: server-side asks (open, this is the proposal) + +Two complementary asks. Either solves the stub-creation problem at the +source. Both are small changes. + +### Ask 1: reject the template POST upfront when the cap is full + +Today, `POST /dca/collector` always succeeds. The cap is only enforced at +the next step (`POST .../automate_template`), so a 429 there leaves a +collector behind with no template attached. + +Move the cap check **earlier**: reject `POST /dca/collector` with `429` +when the user is already at the limit. No stub is created. The client +already retries on 429 via Half A, so the user experience is the same — +just with no dashboard cleanup needed afterward. + +This is the preferred fix. Implementation is small (a single concurrency +check at the start of the template-creation handler). + +### Ask 2: expose `DELETE /dca/collector/{id}` + +Even with Ask 1 landed, users will occasionally have stubs from other +failure modes (poll status=failed, network errors mid-trigger, etc.). A +public `DELETE` endpoint lets the CLI clean up on its way out: + +``` +DELETE /dca/collector/{collector_id} +``` + +The CLI would call this on terminal failure (unless `--keep-stub-on-failure` +is passed). A future CLI release would also expose `bdata scraper delete +` for manual cleanup of older stubs. + +Ask 1 is strictly preferable (no stub ever created beats deleting it +after); Ask 2 is a useful fallback if Ask 1 is non-trivial server-side. + +--- + +## Composes with + +- **PR-12** (shipped): replaced the misleading 403 "Access denied" hint + when running on a stub. PR-11 stops most stubs from existing in the + first place, so PR-12's hint fires less often — but is still useful + when one does slip through. +- **PR-2** (shipped): on-failure `-o` envelope contains `collector_id`, + `status`, `view_url`. PR-11's stderr stub-recovery note duplicates the + `view_url` for users who don't `cat` the file. The two are + intentionally redundant. +- **PR-13** (deferred to product): schema-honoring contract. Orthogonal + but worth landing together — it's the other major source of silent + failure in `scraper create` outputs. + +## Acceptance criteria for Half B + +- [ ] **Ask 1 (preferred):** `POST /dca/collector` returns `429 Cannot run + more than N jobs in parallel` when the cap is full. The current + template-side enforcement remains as a backstop. +- [ ] **Ask 2:** `DELETE /dca/collector/{id}` is a documented, supported + endpoint. CLI integrates it on terminal-failure paths in a follow-up. +- [ ] The documented cap value (currently 3) is mentioned in the + Bright Data developer docs alongside the `automate_template` + endpoint. diff --git a/skills/scraper-studio/references/api-flow.md b/skills/scraper-studio/references/api-flow.md index ead550f..44530ba 100644 --- a/skills/scraper-studio/references/api-flow.md +++ b/skills/scraper-studio/references/api-flow.md @@ -42,6 +42,16 @@ Returns: { "id": "ia_xyz...", "queued": false } ``` +**Concurrent-job cap.** This endpoint enforces a per-account cap on the number of AI-Flow generations in flight at once (currently **3**). When exceeded, it returns: + +``` +429 Cannot run more than 3 jobs in parallel +``` + +The CLI handles this automatically with exponential backoff + full-jitter (base 30s, ceiling 240s, 4 attempts by default ≈ 7.5 min total max wait). Tune via `--max-retries ` or disable with `--no-retry`. During the wait the CLI prints stderr status lines so callers know it's blocked, not hung. + +**Note: the template POST in step 1 always succeeds even when you are over the cap.** That means a 429 here leaves a half-built collector — known as a stub — in the dashboard. Programmatic deletion is not yet exposed (no `DELETE /dca/collector/{id}`), so on terminal failure the CLI surfaces the stub's `view_url` for manual recovery. See `skills/scraper-studio/proposals/PR-11-backoff.md` for the open server-side asks (reject at step 1 / expose DELETE) that would eliminate stubs at the source. + ### 3. Poll progress ```