feat: deduplication layer for scrape_batch + field filters by SuarezPM · Pull Request #140 · brightdata/brightdata-mcp

SuarezPM · 2026-05-24T20:35:57Z

Summary

Deduplication layer for batch scraping that removes duplicate content blocks across URLs, reducing token usage in LLM pipelines.

Problem

When scraping multiple URLs from the same domain, pages share nav/header/footer HTML. Without dedup, identical content is returned N times, wasting tokens and increasing costs.

Solution

scrape_batch new parameters

Parameter	Default	Description
`deduplicate`	`true`	Remove duplicate content blocks via SHA-256 fingerprinting
`include_metrics`	`false`	Opt-in for `{results: [...], metrics: {...}}` response
`fields`	undefined	Filter response to specific top-level fields
`format`	`markdown`	Output format: `markdown` (default) or `raw`

search_engine_batch

Parameter	Description
`fields`	Filter `result.organic` array to requested keys (link, title, description, relevance_score, cursor)

Hash Algorithm

Content length	Hash computation
≤ 2048 chars	Full content SHA-256
> 2048 chars	sha256(prefix[2048] + middle[256] + suffix[256])

This captures shared headers/footers while correctly distinguishing pages with same structure but different body content.

Test Suite

File	Tests	Coverage
test_context_cache.js	9	Core dedup logic, hash correctness
test_dedup_edge_cases.js	8	Edge cases: empty, boundary, null handling
test_filter_fields.js	20	Field filtering edge cases
TOTAL	37	All passing

Backward Compatibility

Default behavior (no params or include_metrics: false) returns flat array — no breaking changes to existing consumers.

Prior Art

Deduplication strategy inspired by ContextForge (https://github.com/SuarezPM/Apohara_Context_Forge, DOI: 10.5281/zenodo.20277875).

…arch_engine_batch SECURITY FIXES: - Add prototype pollution protection in filterFields() - Block __proto__, constructor, prototype properties - Sanitize error messages to prevent information disclosure - No hardcoded API keys in any file FUNCTIONALITY: - Add deduplication layer to scrape_batch tool - Add field filtering to search_engine_batch tool - Remove duplicate content blocks across URLs - Include metrics option for dedup stats TEST FILES: - test_context_cache.js: 9 tests - test_dedup_edge_cases.js: 8 tests - test_filter_fields.js: 20 tests Total: 37 tests passing

meirk-brd · 2026-05-31T07:29:33Z

Thank you for the contribution, this is a thoughtful PR and the token-cost problem you're targeting is a real one. I went through it carefully and ran the tests locally (37/37 passing). I have some feedback below, mostly on the dedup layer. The field-filtering side I think is in good shape.

The dedup half and the field-filter half read a bit like two separate PRs, and they land differently for me. The field filtering is solid. The dedup layer is where I'd like to talk through a few things.

Concerns with the dedup layer

1. It doesn't quite remove what the description says it removes. The stated problem is shared nav/header/footer across pages on one domain. But the fingerprint is sha256(prefix[0:2048] + middle[256] + suffix[256]) over the whole page. Two different articles on the same site share the chrome but have different bodies, which means a different middle and suffix, which means a different hash, so nothing gets deduped. The shared boilerplate is never collapsed. Dedup only fires when an entire page is near-identical to another one, which is rare in a hand-picked batch of 5 or fewer URLs.

2. Sampled hashing can drop real data. If two genuinely different pages happen to share the first 2048 chars, a 256-char window at the midpoint, and the last 256 chars, the second one gets flagged duplicate and its content is set to null. That's plausible for templated, boilerplate-heavy pages whose bodies are similar in length. And it's on by default, so the user can silently lose content they actually asked for. For a scraping tool that's a rough failure mode.

3. deduplicate: true by default is a behavior change, despite the "no breaking changes" note. The per-item shape changed too. Old output was [{status, value: {url, content}}] straight from Promise.allSettled; new output is [{url, status, latency_ms, content, content_hash}]. Anything reading result.value.content breaks. The backward-compat claim really only holds for the include_metrics wrapper, not for the item shape or the dedup default.

4. responseType: 'text' got dropped from the axios call. The original set it so remark().process() always received a string. Without it, axios can auto-parse a JSON-looking body and hand remark something that isn't a string.

Suggestions

Split the PR. filterFields, the fields params, and the per-URL error isolation are good on their own and I'd happily take them separately.
If the dedup stays: default it to false, hash the full content (5 pages of SHA-256 costs nothing, and full hashing removes the false-positive risk entirely), and don't null out content silently. At most annotate duplicate_of while still returning the body.

Really appreciate the effort here, especially the test coverage and the prototype-pollution guard in filterFields, which was good to see. Happy to keep iterating with you on this.

SuarezPM · 2026-06-03T12:54:49Z

Hi, thanks a lot for taking the time on this, and for actually running the tests, that means a lot!

You're right on all four, and the first one is the one that stings because it's true: the fingerprint hashes a sampled window over the whole page, so two articles that only share the chrome get different hashes and nothing collapses. The boilerplate I described is exactly what doesn't get removed....

I also have to own something, the PR notes say we "switched to full-content SHA-256." That's not what's in this branch, this branch still samples. We moved to full-content hashing in another codebase and the note got ahead of the actual diff here. My mistake, soryabout that.

Here's what I'd like to do:

Split it, like you suggested, a separate PR with just filterFields, the fields params, and the per-URL error isolation, and I'll put responseType: 'text' back on the axios call (very good catch, that was a regression). That's the half I'm actually confident in.
Pull the dedup out of this one. The more I sit with your n°1, the more a whole-page hash feels like the wrong tool, to really collapse shared nav/header/footer it has to work at the block level, not the page level. That's a bigger change and I'd rather bring it as its own PR, done properly and tested, than bolt a default-on lossy version onto this.

If the split PR looks good I'll close this one. Really appreciate you engaging this deeply — this is the kind of review you learn from!!!

meirk-brd self-requested a review May 31, 2026 07:31

SuarezPM mentioned this pull request Jun 3, 2026

feat: opt-in field filtering for scrape_batch and search_engine_batch #143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deduplication layer for scrape_batch + field filters#140

feat: deduplication layer for scrape_batch + field filters#140
SuarezPM wants to merge 1 commit into
brightdata:mainfrom
SuarezPM:feat/dedup-layer

SuarezPM commented May 24, 2026 •

edited

Loading

Uh oh!

meirk-brd commented May 31, 2026 •

edited

Loading

Uh oh!

SuarezPM commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SuarezPM commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

scrape_batch new parameters

search_engine_batch

Hash Algorithm

Test Suite

Backward Compatibility

Prior Art

Uh oh!

meirk-brd commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Concerns with the dedup layer

Suggestions

Uh oh!

SuarezPM commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SuarezPM commented May 24, 2026 •

edited

Loading

meirk-brd commented May 31, 2026 •

edited

Loading