Support multiple GLiNER YAML configurations#2018
Conversation
|
Looks like a duplicate of #2007, am I correct? |
There was a problem hiding this comment.
Pull request overview
This PR enables configuring multiple GLiNERRecognizer instances via YAML by allowing class_name: GLiNERRecognizer entries without explicit name, and by deriving deterministic instance names from model_name when name is omitted (while preserving the legacy GLiNERRecognizer name for the built-in default model).
Changes:
- Update
GLiNERRecognizerto acceptname=Noneand derive a stable default name frommodel_name. - Relax YAML predefined recognizer validation to allow missing
namewhenclass_nameis present, and improve error descriptions. - Add unit tests covering name derivation, registry loading with multiple GLiNER entries, and validation/error-message behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py | Derive deterministic recognizer names from model_name when name is omitted; preserve legacy naming for default model. |
| presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py | Allow predefined configs to omit name when class_name is provided; add GLiNER-specific config model and improved error descriptions. |
| presidio-analyzer/tests/test_gliner_recognizer.py | Add tests for name derivation and legacy-name preservation when name is omitted. |
| presidio-analyzer/tests/test_recognizers_loader_utils.py | Add integration-style test validating two nameless GLiNER YAML entries produce distinct recognizer instances. |
| presidio-analyzer/tests/test_yaml_recognizer_models.py | Add tests for error-description fallbacks and validation requiring name or class_name. |
Comments suppressed due to low confidence (1)
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py:62
_DEFAULT_GLINER_MODEL_NAMEis introduced but themodel_namedefault inGLiNERRecognizer.__init__is still a duplicated string literal. Using the constant for the default parameter would prevent accidental drift between the default model used and the value used for legacy-name preservation.
_DEFAULT_GLINER_MODEL_NAME = "urchade/gliner_multi_pii-v1"
_LEGACY_GLINER_RECOGNIZER_NAME = "GLiNERRecognizer"
def _sanitize_model_name_for_instance_name(model_name: str) -> str:
"""Map a HF-style model id to a deterministic single-token-ish suffix."""
sanitized = re.sub(r"[^0-9A-Za-z]+", "_", model_name)
sanitized = re.sub(r"_+", "_", sanitized).strip("_")
return sanitized
def _default_gliner_recognizer_name(model_name: str) -> str:
"""Stable default recognizer ``name`` when the user omits ``name``.
Preserve the legacy name for the built-in default model for backwards compatibility.
"""
if model_name == _DEFAULT_GLINER_MODEL_NAME:
return _LEGACY_GLINER_RECOGNIZER_NAME
suffix = _sanitize_model_name_for_instance_name(model_name)
return f"{_LEGACY_GLINER_RECOGNIZER_NAME}_{suffix}"
class GLiNERRecognizer(LocalRecognizer):
"""GLiNER model based entity recognizer."""
def __init__(
self,
supported_entities: Optional[List[str]] = None,
name: Optional[str] = None,
supported_language: str = "en",
version: str = "0.0.1",
context: Optional[List[str]] = None,
entity_mapping: Optional[Dict[str, str]] = None,
model_name: str = "urchade/gliner_multi_pii-v1",
flat_ner: bool = True,
| def _sanitize_model_name_for_instance_name(model_name: str) -> str: | ||
| """Map a HF-style model id to a deterministic single-token-ish suffix.""" | ||
|
|
||
| sanitized = re.sub(r"[^0-9A-Za-z]+", "_", model_name) | ||
| sanitized = re.sub(r"_+", "_", sanitized).strip("_") | ||
| return sanitized | ||
|
|
||
|
|
||
| def _default_gliner_recognizer_name(model_name: str) -> str: | ||
| """Stable default recognizer ``name`` when the user omits ``name``. | ||
|
|
||
| Preserve the legacy name for the built-in default model for backwards compatibility. | ||
| """ | ||
|
|
||
| if model_name == _DEFAULT_GLINER_MODEL_NAME: | ||
| return _LEGACY_GLINER_RECOGNIZER_NAME | ||
| suffix = _sanitize_model_name_for_instance_name(model_name) | ||
| return f"{_LEGACY_GLINER_RECOGNIZER_NAME}_{suffix}" |
|
Addressed the Copilot review suggestion in |
|
Hi, this PR partially overlaps with the merged #2007. I merged this branch with main, so let's take it from there. |
| return super().model_dump(*args, **kwargs) | ||
|
|
||
|
|
||
| class GLiNERRecognizerConfig(PredefinedRecognizerConfig): |
There was a problem hiding this comment.
This is now a duplicate of line 210
| class GLiNERRecognizerConfig(PredefinedRecognizerConfig): | ||
| """Configuration specifically for GLiNER recognizers.""" | ||
|
|
||
| model_config = ConfigDict(extra="allow") | ||
|
|
||
| model_name: Optional[str] = Field(None, description="GLiNER model name") | ||
| entity_mapping: Optional[Dict[str, str]] = Field( | ||
| None, description="GLiNER label to Presidio entity mapping" | ||
| ) | ||
| flat_ner: Optional[bool] = Field(None, description="Whether to use flat NER") | ||
| multi_label: Optional[bool] = Field( | ||
| None, description="Whether to use multi-label classification" | ||
| ) | ||
| threshold: Optional[float] = Field(None, description="Confidence threshold") | ||
| map_location: Optional[str] = Field(None, description="Model device") | ||
| load_onnx_model: Optional[bool] = Field( | ||
| None, description="Whether to load GLiNER with ONNX Runtime" | ||
| ) | ||
| onnx_model_file: Optional[str] = Field(None, description="ONNX model file name") | ||
|
|
||
|
|
|
Addressed the latest review feedback in
Validation:
|
Change Description
Support configuring multiple
GLiNERRecognizerinstances in YAML by allowingclass_name: GLiNERRecognizerentries without explicitnamevalues and deriving deterministic recognizer names frommodel_namewhennameis omitted.This preserves explicit
namevalues unchanged and keeps the legacyGLiNERRecognizername for the built-in default model.Issue reference
Addresses #1760.
Validation
Ran from
presidio-analyzerin a local Python 3.12 venv:Results:
tests/test_gliner_recognizer.py: 3 passed, 12 skippedtests/test_recognizers_loader_utils.py: 11 passedtests/test_yaml_recognizer_models.py: 47 passedtests/test_recognizer_registry_provider.py: 17 passedgit diff --check: passedAlso attempted the full analyzer test suite; unrelated optional dependency tests fail locally without Azure/LangExtract extras installed, while the relevant targeted suites pass.
Checklist