-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat(analyzer): Deduplicate NER model instances in multilingual configuration to reduce memory usage #2052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yuriihavrylko
wants to merge
4
commits into
microsoft:main
Choose a base branch
from
yuriihavrylko:feat/deduplicate-ner-model-instances
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
feat(analyzer): Deduplicate NER model instances in multilingual configuration to reduce memory usage #2052
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
4315b87
feat: implement model sharing for spacy, gliner and HuggingFace recog…
yuriihavrylko 5c6bc67
test: add shared model caching tests for spacy, gliner and HuggingFac…
yuriihavrylko e2e9a08
fix: exclude None values in recognizer registry configuration validation
yuriihavrylko 0e04855
docs: add configuration files for multilingual support in analyzer, n…
yuriihavrylko File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
10 changes: 10 additions & 0 deletions
10
presidio-analyzer/presidio_analyzer/conf/transformers_multilingual/analyzer.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| supported_languages: | ||
| - en | ||
| - es | ||
| - de | ||
| - fr | ||
| - it | ||
| - pt | ||
| - nl | ||
| - sv | ||
| default_score_threshold: 0 |
29 changes: 29 additions & 0 deletions
29
presidio-analyzer/presidio_analyzer/conf/transformers_multilingual/nlp_engine.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| nlp_engine_name: spacy | ||
|
|
||
|
|
||
| models: | ||
| - | ||
| lang_code: en | ||
| model_name: xx_ent_wiki_sm | ||
| - | ||
| lang_code: es | ||
| model_name: xx_ent_wiki_sm | ||
| - | ||
| lang_code: de | ||
| model_name: xx_ent_wiki_sm | ||
| - | ||
| lang_code: fr | ||
| model_name: xx_ent_wiki_sm | ||
| - | ||
| lang_code: it | ||
| model_name: xx_ent_wiki_sm | ||
| - | ||
| lang_code: pt | ||
| model_name: xx_ent_wiki_sm | ||
| - | ||
| lang_code: nl | ||
| model_name: xx_ent_wiki_sm | ||
| - | ||
| lang_code: sv | ||
| model_name: xx_ent_wiki_sm | ||
|
|
37 changes: 37 additions & 0 deletions
37
presidio-analyzer/presidio_analyzer/conf/transformers_multilingual/recognizers.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| supported_languages: | ||
| - en | ||
| - es | ||
| - de | ||
| - fr | ||
| - it | ||
| - pt | ||
| - nl | ||
| - sv | ||
| global_regex_flags: 26 | ||
|
|
||
| recognizers: | ||
| - name: SpacyRecognizer | ||
| type: predefined | ||
| enabled: false | ||
|
|
||
| - name: "HuggingFace NER" | ||
| type: predefined | ||
| class_name: HuggingFaceNerRecognizer | ||
| model_name: dslim/bert-base-NER | ||
| supported_languages: | ||
| - en | ||
| - es | ||
| - de | ||
| - fr | ||
| - it | ||
| - pt | ||
| - nl | ||
| - sv | ||
| supported_entities: | ||
| - PERSON | ||
| - LOCATION | ||
| - ORGANIZATION | ||
| - MISC | ||
| aggregation_strategy: simple | ||
| threshold: 0.3 | ||
| device: cpu |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we think of an alternative approach using dependency injection or model registry? This would never get released.
For example:
Or
The user should be able to control this model registry (add, remove, update)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - the initial approach is naive, but it showcases the reality of the issue and the potential gains in multilingual setups. My intent was to implement it for both programmatic and yaml config-based use cases.
Here are the options I'm considering based on DI and model registry ideas:
Option A: DI + loader-level sharing
Dependency injection as you shown, plus
Simple, no new classes. But couples the loader to each recognizer's internals (
gliner_model=vsner_pipeline=, different cache key shapes). Every new recognizer type would need loader changes.Option B: ModelRegistry
The caching logic (key shape, what to store) stays inside each recognizer - the loader just provides the shared bucket and doesn't need model-specific knowledge.
Option C: Multi-language recognizer
Eliminates the problem at the root - one instance, one model, no sharing needed. But
EntityRecognizeris built aroundsupported_language(singular). Cleanest long-term solution but a major refactor, not a bug fix/small improvement scope.Option D: Lazy loading
Remove
self.load()fromEntityRecognizer.__init__, load on firstanalyze()call instead. Makes DI/registry/sharing trivially easy since all instances are created cheaply and models configured afterward. But it's a base class behavior change that affects every recognizer.I prefer option B for this PR - covers both use cases, keeps the loader generic, gives the user full control.
Options C and D are worth considering as longer-term architectural improvements.
What do you think?