From 88b30f2f7f0fff3e71cc54532f38fd47b9c2d10f Mon Sep 17 00:00:00 2001 From: ded-furby <190979964+ded-furby@users.noreply.github.com> Date: Sat, 6 Jun 2026 08:09:28 +1000 Subject: [PATCH 1/3] docs: explain custom analyzer docker builds --- CHANGELOG.md | 5 ++- docs/installation.md | 80 ++++++++++++++++++++++++++++++++++++++------ 2 files changed, 74 insertions(+), 11 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 08f79d96e0..6b4dca0e48 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file. ## [unreleased] +### General +#### Changed +- Documented how to build custom `presidio-analyzer` Docker images with alternative analyzer, NLP, and recognizer-registry YAML files, including multilingual configuration checks and common startup warnings. Fixes [#1663](https://github.com/microsoft/presidio/issues/1663). + ### Anonymizer #### Fixed - Custom operator `validate()` no longer calls the user-supplied lambda with a dummy `"PII"` value. Previously, stateful lambdas (e.g. those accumulating a token-to-original-value map for de-anonymization) would receive a spurious invocation during validation, inserting a junk entry (`{"TOKEN_1": "PII"}`) into the map and skewing all subsequent token counters. The return-type contract is now enforced in `operate()` when the lambda runs on real data. Fixes [#2024](https://github.com/microsoft/presidio/issues/2024). @@ -809,4 +813,3 @@ New endpoint for deanonymizing encrypted entities by the anonymizer. [2.2.23]: https://github.com/microsoft/presidio/compare/2.2.2...2.2.23 [2.2.2]: https://github.com/microsoft/presidio/compare/2.2.1...2.2.2 [2.2.1]: https://github.com/microsoft/presidio/compare/2.2.0...2.2.1 - diff --git a/docs/installation.md b/docs/installation.md index 9a8d40e086..ebb7a7bb47 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -96,10 +96,10 @@ To download the Presidio Docker containers, run the following command: This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/). -### For PII anonymization in text - -For PII detection and anonymization in text, the `presidio-analyzer` -and `presidio-anonymizer` modules are required. +### For PII anonymization in text + +For PII detection and anonymization in text, the `presidio-analyzer` +and `presidio-anonymizer` modules are required. ```sh # Download Docker images @@ -109,12 +109,72 @@ docker pull mcr.microsoft.com/presidio-anonymizer # Run containers with default ports docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest -docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest -``` - -### For PII redaction in images - -For PII detection in images, the `presidio-image-redactor` is required. +docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest +``` + +#### Building a custom `presidio-analyzer` image + +The published analyzer image ships with the default English-only YAML files: + +- `presidio_analyzer/conf/default_analyzer.yaml` +- `presidio_analyzer/conf/default.yaml` +- `presidio_analyzer/conf/default_recognizers.yaml` + +To add more languages or enable a different recognizer mix, create custom copies +of those files inside the `presidio-analyzer/` build context and point the Docker +build to them with build arguments. + +For example: + +```sh +cp presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml \ + presidio-analyzer/presidio_analyzer/conf/custom_analyzer.yaml +cp presidio-analyzer/presidio_analyzer/conf/default.yaml \ + presidio-analyzer/presidio_analyzer/conf/custom_nlp.yaml +cp presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml \ + presidio-analyzer/presidio_analyzer/conf/custom_recognizers.yaml +``` + +Then update the copies as needed: + +- Add every runtime language to `supported_languages` in both + `custom_analyzer.yaml` and `custom_recognizers.yaml`. +- Configure the NLP models for those languages in `custom_nlp.yaml`. +- Enable or add recognizers that should run for the new languages. + +Build the image with the custom file paths: + +```sh +docker build ./presidio-analyzer \ + -t presidio/presidio-analyzer-custom \ + --build-arg ANALYZER_CONF_FILE=presidio_analyzer/conf/custom_analyzer.yaml \ + --build-arg NLP_CONF_FILE=presidio_analyzer/conf/custom_nlp.yaml \ + --build-arg RECOGNIZER_REGISTRY_CONF_FILE=presidio_analyzer/conf/custom_recognizers.yaml +``` + +Run it the same way as the default image: + +```sh +docker run -d -p 5002:3000 presidio/presidio-analyzer-custom +``` + +!!! note "Important configuration checks" + + - `supported_languages` must match between the analyzer and recognizer-registry YAML files. + - The Docker build installs the NLP models declared in `NLP_CONF_FILE` (or in the analyzer file if you use a single combined config), so adding more or larger models increases build time and image size. + - If the container logs warnings such as `NLP recognizer ... is not in the list of recognizers for language ...`, check that the recognizer registry still includes the NLP recognizer for that language (for example `SpacyRecognizer`) and that the same language code appears in both YAML files. + - If you are experimenting with many languages at once, start with a smaller subset first. Downloading and loading several large NLP models in one image can significantly increase memory usage during build and startup. + +For more background on the YAML structure, see: + +- [Analyzer Engine Provider](analyzer/analyzer_engine_provider.md) +- [PII detection in different languages](analyzer/languages.md) +- [Customizing NLP models](analyzer/customizing_nlp_models.md) +- [Recognizer registry configuration](analyzer/recognizer_registry_provider.md) + +### For PII redaction in images + +For PII detection in images, the `presidio-image-redactor` is required. ```sh # Download Docker image From ec7bcaa4cc74bca762728af529fa4e8a8a93a15d Mon Sep 17 00:00:00 2001 From: ded-furby <190979964+ded-furby@users.noreply.github.com> Date: Sun, 7 Jun 2026 07:34:38 +1000 Subject: [PATCH 2/3] docs: clarify custom analyzer image instructions --- docs/installation.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/installation.md b/docs/installation.md index ebb7a7bb47..e92a19b4f3 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -120,9 +120,10 @@ The published analyzer image ships with the default English-only YAML files: - `presidio_analyzer/conf/default.yaml` - `presidio_analyzer/conf/default_recognizers.yaml` -To add more languages or enable a different recognizer mix, create custom copies -of those files inside the `presidio-analyzer/` build context and point the Docker -build to them with build arguments. +From the repository root (after cloning this repo), run the following commands to add +more languages or enable a different recognizer mix. Create custom copies of those +files inside the `presidio-analyzer/` build context and point the Docker build to +them with build arguments. For example: @@ -146,7 +147,7 @@ Build the image with the custom file paths: ```sh docker build ./presidio-analyzer \ - -t presidio/presidio-analyzer-custom \ + -t myorg/presidio-analyzer-custom \ --build-arg ANALYZER_CONF_FILE=presidio_analyzer/conf/custom_analyzer.yaml \ --build-arg NLP_CONF_FILE=presidio_analyzer/conf/custom_nlp.yaml \ --build-arg RECOGNIZER_REGISTRY_CONF_FILE=presidio_analyzer/conf/custom_recognizers.yaml @@ -155,7 +156,7 @@ docker build ./presidio-analyzer \ Run it the same way as the default image: ```sh -docker run -d -p 5002:3000 presidio/presidio-analyzer-custom +docker run -d -p 5002:3000 myorg/presidio-analyzer-custom ``` !!! note "Important configuration checks" From a952f89dec6fc147bcbe98f5afa078a4228a6350 Mon Sep 17 00:00:00 2001 From: ded-furby <190979964+ded-furby@users.noreply.github.com> Date: Sun, 7 Jun 2026 11:04:50 +1000 Subject: [PATCH 3/3] docs: clarify custom analyzer Docker context --- docs/installation.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/installation.md b/docs/installation.md index e92a19b4f3..3330c5c038 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -120,7 +120,7 @@ The published analyzer image ships with the default English-only YAML files: - `presidio_analyzer/conf/default.yaml` - `presidio_analyzer/conf/default_recognizers.yaml` -From the repository root (after cloning this repo), run the following commands to add +From the repository root after cloning this repo, run the following commands to add more languages or enable a different recognizer mix. Create custom copies of those files inside the `presidio-analyzer/` build context and point the Docker build to them with build arguments. @@ -128,6 +128,8 @@ them with build arguments. For example: ```sh +cd /path/to/presidio + cp presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml \ presidio-analyzer/presidio_analyzer/conf/custom_analyzer.yaml cp presidio-analyzer/presidio_analyzer/conf/default.yaml \ @@ -147,7 +149,7 @@ Build the image with the custom file paths: ```sh docker build ./presidio-analyzer \ - -t myorg/presidio-analyzer-custom \ + -t presidio-analyzer-custom \ --build-arg ANALYZER_CONF_FILE=presidio_analyzer/conf/custom_analyzer.yaml \ --build-arg NLP_CONF_FILE=presidio_analyzer/conf/custom_nlp.yaml \ --build-arg RECOGNIZER_REGISTRY_CONF_FILE=presidio_analyzer/conf/custom_recognizers.yaml @@ -156,7 +158,7 @@ docker build ./presidio-analyzer \ Run it the same way as the default image: ```sh -docker run -d -p 5002:3000 myorg/presidio-analyzer-custom +docker run -d -p 5002:3000 presidio-analyzer-custom ``` !!! note "Important configuration checks"