Skip to content

feat: deduplication layer for scrape_batch + field filters#140

Open
SuarezPM wants to merge 1 commit into
brightdata:mainfrom
SuarezPM:feat/dedup-layer
Open

feat: deduplication layer for scrape_batch + field filters#140
SuarezPM wants to merge 1 commit into
brightdata:mainfrom
SuarezPM:feat/dedup-layer

Conversation

@SuarezPM

@SuarezPM SuarezPM commented May 24, 2026

Copy link
Copy Markdown

Summary

Deduplication layer for batch scraping that removes duplicate content blocks across URLs, reducing token usage in LLM pipelines.

Problem

When scraping multiple URLs from the same domain, pages share nav/header/footer HTML. Without dedup, identical content is returned N times, wasting tokens and increasing costs.

Solution

scrape_batch new parameters

Parameter Default Description
deduplicate true Remove duplicate content blocks via SHA-256 fingerprinting
include_metrics false Opt-in for {results: [...], metrics: {...}} response
fields undefined Filter response to specific top-level fields
format markdown Output format: markdown (default) or raw

search_engine_batch

Parameter Description
fields Filter result.organic array to requested keys (link, title, description, relevance_score, cursor)

Hash Algorithm

Content length Hash computation
≤ 2048 chars Full content SHA-256
> 2048 chars sha256(prefix[2048] + middle[256] + suffix[256])

This captures shared headers/footers while correctly distinguishing pages with same structure but different body content.

Test Suite

File Tests Coverage
test_context_cache.js 9 Core dedup logic, hash correctness
test_dedup_edge_cases.js 8 Edge cases: empty, boundary, null handling
test_filter_fields.js 20 Field filtering edge cases
TOTAL 37 All passing

Backward Compatibility

Default behavior (no params or include_metrics: false) returns flat array — no breaking changes to existing consumers.

Prior Art

Deduplication strategy inspired by ContextForge (https://github.com/SuarezPM/Apohara_Context_Forge, DOI: 10.5281/zenodo.20277875).

…arch_engine_batch

SECURITY FIXES:
- Add prototype pollution protection in filterFields()
- Block __proto__, constructor, prototype properties
- Sanitize error messages to prevent information disclosure
- No hardcoded API keys in any file

FUNCTIONALITY:
- Add deduplication layer to scrape_batch tool
- Add field filtering to search_engine_batch tool
- Remove duplicate content blocks across URLs
- Include metrics option for dedup stats

TEST FILES:
- test_context_cache.js: 9 tests
- test_dedup_edge_cases.js: 8 tests
- test_filter_fields.js: 20 tests

Total: 37 tests passing
@meirk-brd

meirk-brd commented May 31, 2026

Copy link
Copy Markdown
Collaborator

Thank you for the contribution, this is a thoughtful PR and the token-cost problem you're targeting is a real one. I went through it carefully and ran the tests locally (37/37 passing). I have some feedback below, mostly on the dedup layer. The field-filtering side I think is in good shape.

The dedup half and the field-filter half read a bit like two separate PRs, and they land differently for me. The field filtering is solid. The dedup layer is where I'd like to talk through a few things.

Concerns with the dedup layer

1. It doesn't quite remove what the description says it removes. The stated problem is shared nav/header/footer across pages on one domain. But the fingerprint is sha256(prefix[0:2048] + middle[256] + suffix[256]) over the whole page. Two different articles on the same site share the chrome but have different bodies, which means a different middle and suffix, which means a different hash, so nothing gets deduped. The shared boilerplate is never collapsed. Dedup only fires when an entire page is near-identical to another one, which is rare in a hand-picked batch of 5 or fewer URLs.

2. Sampled hashing can drop real data. If two genuinely different pages happen to share the first 2048 chars, a 256-char window at the midpoint, and the last 256 chars, the second one gets flagged duplicate and its content is set to null. That's plausible for templated, boilerplate-heavy pages whose bodies are similar in length. And it's on by default, so the user can silently lose content they actually asked for. For a scraping tool that's a rough failure mode.

3. deduplicate: true by default is a behavior change, despite the "no breaking changes" note. The per-item shape changed too. Old output was [{status, value: {url, content}}] straight from Promise.allSettled; new output is [{url, status, latency_ms, content, content_hash}]. Anything reading result.value.content breaks. The backward-compat claim really only holds for the include_metrics wrapper, not for the item shape or the dedup default.

4. responseType: 'text' got dropped from the axios call. The original set it so remark().process() always received a string. Without it, axios can auto-parse a JSON-looking body and hand remark something that isn't a string.

Suggestions

  • Split the PR. filterFields, the fields params, and the per-URL error isolation are good on their own and I'd happily take them separately.
  • If the dedup stays: default it to false, hash the full content (5 pages of SHA-256 costs nothing, and full hashing removes the false-positive risk entirely), and don't null out content silently. At most annotate duplicate_of while still returning the body.

Really appreciate the effort here, especially the test coverage and the prototype-pollution guard in filterFields, which was good to see. Happy to keep iterating with you on this.

@SuarezPM

SuarezPM commented Jun 3, 2026

Copy link
Copy Markdown
Author

Hi, thanks a lot for taking the time on this, and for actually running the tests, that means a lot!

You're right on all four, and the first one is the one that stings because it's true: the fingerprint hashes a sampled window over the whole page, so two articles that only share the chrome get different hashes and nothing collapses. The boilerplate I described is exactly what doesn't get removed....

I also have to own something, the PR notes say we "switched to full-content SHA-256." That's not what's in this branch, this branch still samples. We moved to full-content hashing in another codebase and the note got ahead of the actual diff here. My mistake, soryabout that.

Here's what I'd like to do:

  1. Split it, like you suggested, a separate PR with just filterFields, the fields params, and the per-URL error isolation, and I'll put responseType: 'text' back on the axios call (very good catch, that was a regression). That's the half I'm actually confident in.

  2. Pull the dedup out of this one. The more I sit with your n°1, the more a whole-page hash feels like the wrong tool, to really collapse shared nav/header/footer it has to work at the block level, not the page level. That's a bigger change and I'd rather bring it as its own PR, done properly and tested, than bolt a default-on lossy version onto this.

If the split PR looks good I'll close this one. Really appreciate you engaging this deeply — this is the kind of review you learn from!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants