feat: Add tokenizer-based text chunking for NER recognizers by yuriihavrylko · Pull Request #2041 · microsoft/presidio

yuriihavrylko · 2026-05-26T18:10:23Z

Change Description

Add TokenizerBasedTextChunker that uses HuggingFace tokenizers for accurate token-based text splitting, and make text chunking configurable from yaml for both GLiNERRecognizer and HuggingFaceNerRecognizer

What's new

TokenizerBasedTextChunker: splits text by token count using the model's actual tokenizer, respecting its token limit
YAML text_chunker config: text_chunker parameter on NER recognizers now accepts a dict (from YAML) in addition to Python objects, resolved via TextChunkerProvider
GLiNER parity: added missing chunk_size/chunk_overlap params to GLiNERRecognizer (HuggingFaceRecognizer already had them)
Lazy imports for transformers - only required when actually using the tokenizer chunker

YAML example

  - name: GLiNERRecognizer                                                                                                        
    type: predefined                                                                                                                
    text_chunker:
      chunker_type: tokenizer                                                                                                       
      tokenizer: urchade/gliner_multi_pii-v1                                                                                        
      max_tokens: 512                                                                                                               
      overlap_tokens: 32

Why it matters

Character-based approximation either wastes capacity or exceeds limits. Token-based chunking uses the exact budget available
Configurable from YAML means users can tune chunking strategy without writing Python code, making it accessible for
deployment-only workflows
GLiNERRecognizer had no chunking configurability - chunk size and overlap were hardcoded (250 chars / 50 overlap) with no way to adjust them from config or constructor params. Now both chunk_size/chunk_overlap params and the full text_chunker config are supported

Issue reference

No linked issue

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required

…r tokenizer support

… text chunking strategies

…nizer-based approach

Copilot

Pull request overview

Adds configurable tokenizer-aware text chunking for NER recognizers, improving long-text handling for HuggingFace and GLiNER based recognition.

Changes:

Introduces TokenizerBasedTextChunker and lazy exports/imports to keep transformers optional.
Allows GLiNERRecognizer and HuggingFaceNerRecognizer to accept YAML-style text_chunker dict configs.
Adds tests and documentation for chunker configuration and GLiNER chunking usage.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`presidio-analyzer/presidio_analyzer/chunkers/tokenizer_based_text_chunker.py`	Adds tokenizer-based chunking implementation.
`presidio-analyzer/presidio_analyzer/chunkers/text_chunker_provider.py`	Adds provider support for `tokenizer` chunker type.
`presidio-analyzer/presidio_analyzer/chunkers/__init__.py`	Exposes tokenizer chunker through lazy import.
`presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py`	Adds configurable chunker support and chunk size parameters.
`presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py`	Adds dict-based chunker config support.
`presidio-analyzer/tests/test_tokenizer_based_text_chunker.py`	Adds unit tests for tokenizer chunking behavior.
`presidio-analyzer/tests/test_gliner_recognizer.py`	Adds GLiNER chunker configuration tests.
`presidio-analyzer/tests/test_huggingface_ner_recognizer.py`	Adds HuggingFace recognizer chunker configuration tests.
`docs/samples/python/gliner.md`	Documents GLiNER text chunking configuration.
`docs/analyzer/recognizer_registry_provider.md`	Documents YAML `text_chunker` configuration.

…asedTextChunker

omri374

Thanks! Great addition!

Please make sure that the YAML flow is properly tested

omri374 · 2026-06-01T05:47:44Z

        threshold: float = 0.30,
        map_location: Optional[str] = None,
-        text_chunker: Optional[BaseTextChunker] = None,
+        text_chunker: Optional[Union[BaseTextChunker, Dict[str, Any]]] = None,


In which scenario would the user pass a dict here? I think it's better to ask the user to pass the chunker class, and instantiate it using the dict prior to calling the recognizer.

The dict path is used when loading from yaml config. The recognizer registry passes yaml fields as kwargs directly to the constructor, so text_chunker arrives as a raw dict like {"chunker_type": "tokenizer", "tokenizer": "model-name", "max_tokens": 512}.

Without dict support here, users would need custom python code to instantiate the chunker before passing it - which defeats the purpose of yaml-based configuration.

The alternative would be handling the dict-to-object conversion in the registry loader, but that would require the loader to know about chunker-specific logic. Keeping it in the recognizer felt more self-contained

We have a pydantic validation layer between YAML and actual Presidio classes to handle configuration errors more gracefully. I don't see a reason not to use it here too, and avoid generic dicts as input. Please take a look and see if there's a reason for it not to apply here too. Thanks!

Done. Moved the dict-to-object conversion into the Pydantic validation layer (TextChunkerConfig in yaml_recognizer_models.py) and the registry loader. Recognizer constructors now only accept Optional[BaseTextChunker], no more dicts

…_.py

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

… initialization

…rs in recognizer registry

…t in TextChunkerProvider

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

…enizerBasedTextChunker

…asedTextChunker

…and HuggingFace recognizers

…red resolution

yuriihavrylko · 2026-06-04T17:15:26Z

Looking at the solution, I noticed that requiring users to specify a tokenizer each time they want to use the tokenizer-based chunker isn't ideal - in most cases it will be the same tokenizer as the model itself. Specifying it separately is redundant and error-prone (risk of mismatch between chunker tokenizer and model tokenizer). For GLiNER models it's not even possible since their repos can't be loaded by AutoTokenizer (so users will be need to use backbone tokenizer).

So I made the tokenizer optional and added a deferred resolution pattern - the chunker is created without a tokenizer from YAML config, and the recognizer automatically resolves it at model-load time using the model's own tokenizer. This way users just configure chunk sizes in YAML and the right tokenizer is picked up automatically. Less config, fewer mistakes.

YAML example

  - name: GLiNERRecognizer                                                                                                        
    type: predefined 
    model_name: urchade/gliner_multi_pii-v1                                                                                                               
    text_chunker:
      chunker_type: tokenizer                                                                                                       
      # tokenizer: urchade/gliner_multi_pii-v1         This is now optional, and be derived from model                                                                           
      max_tokens: 512                                                                                                               
      overlap_tokens: 32

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

yuriihavrylko · 2026-06-04T17:22:44Z

+    model_config = ConfigDict(extra="forbid", arbitrary_types_allowed=True)
+
+    chunker_type: Literal["character", "tokenizer"] = Field(
+        ..., description="Type of chunker"
+    )
+    chunk_size: Optional[int] = Field(None, description="Character chunk size")
+    chunk_overlap: Optional[int] = Field(None, description="Character overlap")


Intentional - chunker_type is required to keep YAML configs explicit. If a user writes text_chunker: {} that's likely a mistake and should be caught by validation, not silently default to character chunking. The recognizer already defaults to CharacterBasedTextChunker when text_chunker is omitted entirely

yuriihavrylko · 2026-06-04T17:20:52Z

+        # 2. Convert text_chunker dict to BaseTextChunker instance
+        if "text_chunker" in kwargs and isinstance(kwargs["text_chunker"], dict):
+            from presidio_analyzer.chunkers import TextChunkerProvider
+
+            # Strip None values that may leak from Pydantic model_dump
+            chunker_config = {
+                k: v for k, v in kwargs["text_chunker"].items() if v is not None
+            }
+            kwargs["text_chunker"] = TextChunkerProvider(
+                chunker_config
+            ).create_chunker()


Not a concern in practice - text_chunker only appears in configs for recognizers that accept it. And with deferred tokenizer mode, no heavy work happens here (no model/tokenizer downloading) - just a lightweight placeholder object is created

…rove tokenizer description

yuriihavrylko added 3 commits May 25, 2026 22:48

feat: add TokenizerBasedTextChunker and update TextChunkerProvider fo…

9888943

…r tokenizer support

feat: enhance GLiNER and HuggingFaceNer recognizers with customizable…

c16d31c

… text chunking strategies

docs: add text chunking customization for GLiNERRecognizer using toke…

6cab1fc

…nizer-based approach

github-actions Bot added the external label May 26, 2026

SharonHart requested review from Copilot and omri374 May 28, 2026 12:13

Copilot started reviewing on behalf of SharonHart May 28, 2026 12:13 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread presidio-analyzer/presidio_analyzer/chunkers/tokenizer_based_text_chunker.py

Comment thread presidio-analyzer/presidio_analyzer/chunkers/text_chunker_provider.py

yuriihavrylko and others added 2 commits May 31, 2026 10:24

feat: reserve special tokens in max_tokens calculation for TokenizerB…

e4a5665

…asedTextChunker

Merge branch 'main' into feat/tokenizer-based-text-chunker

bb0f6b0

omri374 reviewed Jun 1, 2026

View reviewed changes

yuriihavrylko added 2 commits June 1, 2026 08:45

style: format code in TokenizerBasedTextChunker

9894929

refactor: remove lazy import for TokenizerBasedTextChunker in __init_…

17c0635

…_.py

Copilot AI review requested due to automatic review settings June 1, 2026 06:47

Copilot AI reviewed Jun 1, 2026

View reviewed changes

yuriihavrylko added 3 commits June 1, 2026 23:18

feat: enforce fast tokenizer requirement in TokenizerBasedTextChunker…

9035af5

… initialization

test: add YAML configuration tests for character and tokenizer chunke…

47d3a01

…rs in recognizer registry

test: add tests for unknown chunker type and invalid text chunker dic…

266dfe7

…t in TextChunkerProvider

Copilot AI review requested due to automatic review settings June 1, 2026 21:44

Copilot AI reviewed Jun 1, 2026

View reviewed changes

yuriihavrylko added 5 commits June 1, 2026 23:50

feat: enhance tokenizer configuration and clamp overlap tokens in Tok…

2bf870b

…enizerBasedTextChunker

test: add tests for slow tokenizer and overlap clamping in TokenizerB…

4b4b376

…asedTextChunker

feat: add pydantic TextChunkerConfig model and integrate with GLiNER …

c5520f8

…and HuggingFace recognizers

feat: make tokenizer optional in TokenizerBasedTextChunker with defer…

1358b7d

…red resolution

feat: handle missing offset_mapping in TokenizerBasedTextChunker

06b2e69

Copilot AI review requested due to automatic review settings June 4, 2026 17:16

Copilot AI reviewed Jun 4, 2026

View reviewed changes

docs: update GLiNERRecognizer configuration to use model_name and imp…

3f32f68

…rove tokenizer description

Conversation

yuriihavrylko commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

What's new

Why it matters

Issue reference

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

omri374 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omri374 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

yuriihavrylko Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

omri374 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

yuriihavrylko Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuriihavrylko commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

yuriihavrylko Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

yuriihavrylko Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuriihavrylko commented May 26, 2026 •

edited

Loading

yuriihavrylko Jun 4, 2026 •

edited

Loading