From ced6c161dd185ffe33d215b7ad834fd81d406394 Mon Sep 17 00:00:00 2001
From: Jah-yee <noreply@github.com>
Date: Wed, 3 Jun 2026 14:11:58 +0800
Subject: [PATCH] docs: add guide for building custom Docker images with
 additional languages

Fixes #1663

- Documents which YAML files to modify (default.yaml, default_recognizers.yaml,
  default_analyzer.yaml) and what each controls
- Explains how to add language entries to default_recognizers.yaml
- Provides docker build command with --build-arg flags for custom configs
- Explains how to add spaCy language models via default.yaml
- Documents three common pitfalls: OOM from too many languages, NLP recognizer
  warnings, and memory tuning for production
- Links to related docs (NLP engine config, supported entities, development)
---
 docs/installation.md | 458 +++++++++++++++++++++++++++----------------
 1 file changed, 285 insertions(+), 173 deletions(-)

diff --git a/docs/installation.md b/docs/installation.md
index 9a8d40e086..c127816ca8 100644
--- a/docs/installation.md
+++ b/docs/installation.md
@@ -1,173 +1,285 @@
-# Installing Presidio
-
-## Description
-
-This document describes the installation of the entire
-Presidio suite using `pip` (as Python packages) or using `Docker` (As containerized services).
-
-## Using pip
-
-!!! note "Note"
-
-    Consider installing the Presidio python packages
-    in a virtual environment like [venv](https://docs.python.org/3/tutorial/venv.html)
-    or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).
-
-### Supported Python Versions
-
-Presidio is supported for the following python versions:
-
-* 3.10
-* 3.11
-* 3.12
-* 3.13
-
-### PII anonymization on text
-
-For PII anonymization on text, install the `presidio-analyzer` and `presidio-anonymizer` packages
-with at least one NLP engine (`spaCy`, `transformers` or `stanza`):
-
-===+ "spaCy (default)"
-
-    ```
-    pip install presidio_analyzer
-    pip install presidio_anonymizer
-    python -m spacy download en_core_web_lg
-    ```
-
-=== "Transformers"
-
-    ```
-    pip install "presidio_analyzer[transformers]"
-    pip install presidio_anonymizer
-    python -m spacy download en_core_web_sm
-    ```
-
-    !!! note "Note"
-        
-        When using a transformers NLP engine, Presidio would still use spaCy for other capabilities,
-        therefore a small spaCy model (such as en_core_web_sm) is required. 
-        Transformers models would be loaded lazily. To pre-load them, see: [Downloading a pre-trained model](./analyzer/nlp_engines/transformers.md#downloading-a-pre-trained-model)
-
-=== "Stanza"
-
-    ```
-    pip install "presidio_analyzer[stanza]"
-    pip install presidio_anonymizer
-    ```
-
-
-    !!! note "Note"
-        
-        Stanza models would be loaded lazily. To pre-load them, see: [Downloading a pre-trained model](./analyzer/nlp_engines/spacy_stanza.md#download-the-pre-trained-model).
-
-### GPU acceleration (optional)
-
-For GPU acceleration, install the appropriate dependencies for your hardware:
-
-- **Linux with NVIDIA GPU**: `pip install "spacy[cuda12x]"` (or the version matching your CUDA installation)
-- **macOS with Apple Silicon**: MPS is detected automatically, no additional dependencies required.
-
-For detailed GPU setup, verification, and troubleshooting, see [GPU Acceleration](./analyzer/nlp_engines/gpu_usage.md).
-
-### PII redaction in images
-
-For PII redaction in images
-
-1. Install the `presidio-image-redactor` package:
-
-    ```sh
-    pip install presidio_image_redactor
-    
-    # Presidio image redactor uses the presidio-analyzer
-    # which requires a spaCy language model:
-    python -m spacy download en_core_web_lg
-    ```
-
-2. Install an OCR engine. The default version uses the [Tesseract OCR Engine](https://github.com/tesseract-ocr/tesseract).
-More information on installation can be found [here](image-redactor/index.md#installation).
-
-## Using Docker
-
-Presidio can expose REST endpoints for each service using Flask and Docker.
-To download the Presidio Docker containers, run the following command:
-
-!!! note "Note"
-
-    This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/).
-
-### For PII anonymization in text
-
-For PII detection and anonymization in text, the `presidio-analyzer`
-and `presidio-anonymizer` modules are required.
-
-```sh
-# Download Docker images
-docker pull mcr.microsoft.com/presidio-analyzer
-docker pull mcr.microsoft.com/presidio-anonymizer
-
-# Run containers with default ports
-docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
-
-docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest
-```
-
-### For PII redaction in images
-
-For PII detection in images, the `presidio-image-redactor` is required.
-
-```sh
-# Download Docker image
-docker pull mcr.microsoft.com/presidio-image-redactor
-
-# Run container with the default port
-docker run -d -p 5003:3000 mcr.microsoft.com/presidio-image-redactor:latest
-```
-
-Once the services are running, their APIs are available.
-API reference and example calls can be found [here](api.md).
-
-## Install from source
-
-To install Presidio from source, first clone the repo:
-
-* using HTTPS
-
-```sh
-git clone https://github.com/microsoft/presidio.git
-```
-
-* Using SSH
-
-```sh
-git clone git@github.com:microsoft/presidio.git
-```
-
-Then, build the containers locally.
-
-!!! note "Note"
-    Presidio uses [docker-compose](https://docs.docker.com/compose/) to manage the different Presidio containers.
-
-From the root folder of the repo:
-
-```sh
-docker-compose up --build
-```
-
-Alternatively, you can build and run individual services.
-For example, for the `presidio-anonymizer` service:
-
-```sh
-docker build ./presidio-anonymizer -t presidio/presidio-anonymizer
-```
-
-And run:
-
-```sh
-docker run -d -p 5001:5001 presidio/presidio-anonymizer
-```
-
----
-
-For more information on developing locally,
-refer to the [setting up a development environment](development.md) section.
+# Installing Presidio
+
+## Description
+
+This document describes the installation of the entire
+Presidio suite using `pip` (as Python packages) or using `Docker` (As containerized services).
+
+## Using pip
+
+!!! note "Note"
+
+    Consider installing the Presidio python packages
+    in a virtual environment like [venv](https://docs.python.org/3/tutorial/venv.html)
+    or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).
+
+### Supported Python Versions
+
+Presidio is supported for the following python versions:
+
+* 3.10
+* 3.11
+* 3.12
+* 3.13
+
+### PII anonymization on text
+
+For PII anonymization on text, install the `presidio-analyzer` and `presidio-anonymizer` packages
+with at least one NLP engine (`spaCy`, `transformers` or `stanza`):
+
+===+ "spaCy (default)"
+
+    ```
+    pip install presidio_analyzer
+    pip install presidio_anonymizer
+    python -m spacy download en_core_web_lg
+    ```
+
+=== "Transformers"
+
+    ```
+    pip install "presidio_analyzer[transformers]"
+    pip install presidio_anonymizer
+    python -m spacy download en_core_web_sm
+    ```
+
+    !!! note "Note"
+
+        When using a transformers NLP engine, Presidio would still use spaCy for other capabilities,
+        therefore a small spaCy model (such as en_core_web_sm) is required.
+        Transformers models would be loaded lazily. To pre-load them, see: [Downloading a pre-trained model](./analyzer/nlp_engines/transformers.md#downloading-a-pre-trained-model)
+
+=== "Stanza"
+
+    ```
+    pip install "presidio_analyzer[stanza]"
+    pip install presidio_anonymizer
+    ```
+
+
+    !!! note "Note"
+
+        Stanza models would be loaded lazily. To pre-load them, see: [Downloading a pre-trained model](./analyzer/nlp_engines/spacy_stanza.md#download-the-pre-trained-model).
+
+### GPU acceleration (optional)
+
+For GPU acceleration, install the appropriate dependencies for your hardware:
+
+- **Linux with NVIDIA GPU**: `pip install "spacy[cuda12x]"` (or the version matching your CUDA installation)
+- **macOS with Apple Silicon**: MPS is detected automatically, no additional dependencies required.
+
+For detailed GPU setup, verification, and troubleshooting, see [GPU Acceleration](./analyzer/nlp_engines/gpu_usage.md).
+
+### PII redaction in images
+
+For PII redaction in images
+
+1. Install the `presidio-image-redactor` package:
+
+    ```sh
+    pip install presidio_image_redactor
+
+    # Presidio image redactor uses the presidio-analyzer
+    # which requires a spaCy language model:
+    python -m spacy download en_core_web_lg
+    ```
+
+2. Install an OCR engine. The default version uses the [Tesseract OCR Engine](https://github.com/tesseract-ocr/tesseract).
+More information on installation can be found [here](image-redactor/index.md#installation).
+
+## Using Docker
+
+Presidio can expose REST endpoints for each service using Flask and Docker.
+To download the Presidio Docker containers, run the following command:
+
+!!! note "Note"
+
+    This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/).
+
+### For PII anonymization in text
+
+For PII detection and anonymization in text, the `presidio-analyzer`
+and `presidio-anonymizer` modules are required.
+
+```sh
+# Download Docker images
+docker pull mcr.microsoft.com/presidio-analyzer
+docker pull mcr.microsoft.com/presidio-anonymizer
+
+# Run containers with default ports
+docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
+
+docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest
+```
+
+### For PII redaction in images
+
+For PII detection in images, the `presidio-image-redactor` is required.
+
+```sh
+# Download Docker image
+docker pull mcr.microsoft.com/presidio-image-redactor
+
+# Run container with the default port
+docker run -d -p 5003:3000 mcr.microsoft.com/presidio-image-redactor:latest
+```
+
+Once the services are running, their APIs are available.
+API reference and example calls can be found [here](api.md).
+
+## Install from source
+
+To install Presidio from source, first clone the repo:
+
+* using HTTPS
+
+```sh
+git clone https://github.com/microsoft/presidio.git
+```
+
+* Using SSH
+
+```sh
+git clone git@github.com:microsoft/presidio.git
+```
+
+Then, build the containers locally.
+
+!!! note "Note"
+    Presidio uses [docker-compose](https://docs.docker.com/compose/) to manage the different Presidio containers.
+
+From the root folder of the repo:
+
+```sh
+docker-compose up --build
+```
+
+Alternatively, you can build and run individual services.
+For example, for the `presidio-anonymizer` service:
+
+```sh
+docker build ./presidio-anonymizer -t presidio/presidio-anonymizer
+```
+
+And run:
+
+```sh
+docker run -d -p 5001:5001 presidio/presidio-anonymizer
+```
+
+## Building custom Docker images for additional languages
+
+The official Presidio Docker images only include English (en) language support out of the box.
+To add support for additional languages, you need to build a custom Docker image and configure the relevant YAML files.
+
+### Identify which YAML files to modify
+
+The Docker container for `presidio-analyzer` is driven by three configuration files,
+passed as build arguments:
+
+| Build argument | Default value | Purpose |
+|---|---|---|
+| `NLP_CONF_FILE` | `presidio_analyzer/conf/default.yaml` | Defines which NLP engine and model to use |
+| `ANALYZER_CONF_FILE` | `presidio_analyzer/conf/default_analyzer.yaml` | Analyzer behavior (thresholds, entity mapping) |
+| `RECOGNIZER_REGISTRY_CONF_FILE` | `presidio_analyzer/conf/default_recognizers.yaml` | Which recognizers to load per language |
+
+To add a new language (e.g., German — `de`), you primarily need to modify `default_recognizers.yaml`
+to enable recognizers for that language.
+
+### Modify `default_recognizers.yaml`
+
+Open `presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml`.
+For each recognizer you want to enable for your language, add a language entry:
+
+```yaml
+- name: SpacyRecognizer
+  supported_languages:
+  - language: en
+  - language: de   # ← add your language here
+  type: predefined
+```
+
+Not all recognizers support every language. Check the recognizer's source code or the
+[supported entities list](../supported_entities.md) for per-language coverage.
+
+### Build the custom Docker image
+
+```sh
+docker build \
+  --build-arg RECOGNIZER_REGISTRY_CONF_FILE=presidio_analyzer/conf/default_recognizers.yaml \
+  --build-arg NLP_CONF_FILE=presidio_analyzer/conf/default.yaml \
+  --build-arg ANALYZER_CONF_FILE=presidio_analyzer/conf/default_analyzer.yaml \
+  -t my-presidio-analyzer:de \
+  ./presidio-analyzer
+```
+
+Or, to add multiple languages, modify the YAML files in a local copy and point the build to it:
+
+```sh
+cp -r presidio-analyzer presidio-analyzer-custom
+# edit presidio-analyzer-custom/presidio_analyzer/conf/default_recognizers.yaml
+docker build -t my-presidio-analyzer:multi presidio-analyzer-custom
+```
+
+### Add an NLP model for your language (spaCy)
+
+The default NLP engine is spaCy. To support your language, download the corresponding spaCy model
+and install it in the Dockerfile. In `presidio-analyzer/install_nlp_models.py`, spaCy models are
+installed based on `default.yaml`:
+
+```yaml
+# In default.yaml — add your language model:
+nlp_engine_name: spacy
+models:
+  - lang_code: en
+    model_name: en_core_web_lg
+  - lang_code: de
+    model_name: de_core_news_lg
+```
+
+Then install the model in your custom Dockerfile (add before the final `COPY . /app/` line):
+
+```dockerfile
+RUN python -m spacy download de_core_news_lg
+```
+
+### Typical pitfalls
+
+**Adding too many languages at once causes OOM**
+
+Each language model consumes significant RAM/CPU. If you add many large models (e.g., `de_core_news_lg`,
+`es_core_news_lg`, `fr_core_news_lg`), the Docker container may run out of memory during model loading.
+Start with one language, verify the container starts and responds, then add languages incrementally.
+
+**NLP recognizer warning after adding new languages**
+
+If you see a warning like:
+
+```
+UserWarning: NLP recognizer (e.g. SpacyRecognizer, StanzaRecognizer) is not in the list
+of recognizers for language en.
+```
+
+This means the NLP recognizer is registered for `en` in the `nlp_engine_name` section of `default.yaml`
+but not listed in `default_recognizers.yaml` with `language: en`. To fix, ensure the NLP recognizer
+has `en` in its `supported_languages` list, or remove `en` from the NLP engine configuration if
+you do not need it.
+
+**Memory tuning for production**
+
+For production deployments with multiple language models, consider:
+
+- Setting `WORKERS=1` (default) to limit memory per worker
+- Using `--env "PYTHONMALLOCSTATS=1"` to monitor Python memory usage
+- Allocating sufficient memory to the Docker daemon (minimum 4 GB recommended for 3+ language models)
+
+### Further reading
+
+- [NLP engine configuration](../analyzer/nlp_engines/spacy_stanza.md)
+- [Supported entities list](../supported_entities.md)
+- [Development environment setup](development.md)
+
+---
+
+For more information on developing locally,
+refer to the [setting up a development environment](development.md) section.
\ No newline at end of file