leejet · RapidMark · May 22, 2026
diff --git a/docs/pulid.md b/docs/pulid.md
@@ -0,0 +1,195 @@
+# PuLID-Flux face-identity preservation
+
+stable-diffusion.cpp supports the [PuLID-Flux](https://github.com/ToTheBeginning/PuLID)
+identity-injection technique on top of Flux.1 (schnell or dev) models.
+Given a single source portrait, PuLID-Flux produces new generations that
+preserve the source person's face across arbitrary scenes, poses, and
+prompts.
+
+Unlike PhotoMaker (which extracts the identity inside the inference
+process from a directory of images), PuLID-Flux's identity extractor is
+a heavy stack (insightface ArcFace + EVA-CLIP-L + IDFormer encoder) that
+is impractical to port to C++/ggml. To keep this implementation small and
+cross-vendor, **stable-diffusion.cpp consumes a precomputed identity
+embedding** produced by an external Python tool that runs once per source
+portrait. Everything downstream of that one-shot extraction is C++ and
+runs on any backend (Vulkan, CUDA, Metal, ROCm, CPU).
+
+## Architecture summary
+
+The PuLID-Flux contribution to the Flux denoise loop is a stack of 20
+small cross-attention modules (`PerceiverAttentionCA`) inserted between
+the Flux transformer blocks:
+
+- After every 2nd of the 19 double-stream blocks (10 hook points)
+- After every 4th of the 38 single-stream blocks (10 hook points)
+
+Each cross-attention layer takes the current image tokens as query, the
+32-token / 2048-dim identity embedding as key+value, and adds its output
+(scaled by `id_weight`, typically 1.0) back to the image tokens.
+
+## Required weights
+
+Three files in addition to the standard Flux weight set:
+
+1. **Flux base** (transformer + VAE + clip_l + t5xxl) -- exactly as
+   [docs/flux.md](flux.md) describes.
+2. **PuLID weights** -- download from
+   [guozinan/PuLID](https://huggingface.co/guozinan/PuLID):
+   - `pulid_flux_v0.9.0.safetensors` or `pulid_flux_v0.9.1.safetensors`
+     (recommended; this implementation is verified against v0.9.1)
+   - **v1.1 (`pulid_v1.1.safetensors`) is NOT yet supported** -- it uses
+     renamed keys (`id_adapter_attn_layers.*` instead of `pulid_ca.*`)
+     and possibly different module structure. Future PR.
+3. **Identity embedding (.pulidembd)** -- produced by the precompute
+   tool below.
+
+## Precompute the identity embedding
+
+The precompute tool runs the PyTorch identity-extraction stack on a
+single portrait image and writes the resulting `(32, 2048)` embedding
+to a `.pulidembd` binary file (about 131 KB). Run it once per source
+person; the same file is reused for any number of generations.
+
+A reference Python script is provided alongside this docs file at
+[`scripts/pulid_extract_id.py`](../scripts/pulid_extract_id.py). It
+requires:
+- A working CUDA / CPU PyTorch + diffusers stack
+- `insightface`, `facexlib`, `eva-clip`, `torchvision`
+- The PuLID weights file (same one stable-diffusion.cpp will load below)
+- The ToTheBeginning/PuLID repo's `pulid/pipeline_flux.py` (and its
+  dependencies under `pulid/` and `flux/`) -- recommended to vendor
+  rather than pip-install due to upstream packaging quirks
+
+Run it as:
+
+```
+python pulid_extract_id.py \
+  --portrait /path/to/source-photo.jpg \
+  --pulid-weights /path/to/pulid_flux_v0.9.1.safetensors \
+  --out /path/to/source.pulidembd
+```
+
+## Binary format (.pulidembd)
+
+```
+offset 0   : magic "PULIDV01"      (8 bytes ASCII)
+offset 8   : num_tokens (uint32 LE)   typically 32
+offset 12  : token_dim (uint32 LE)    typically 2048
+offset 16  : dtype (uint8): 0=fp16, 1=bf16, 2=fp32
+offset 17  : reserved zeros        (15 bytes; header total = 32)
+offset 32  : tokens, row-major LE  (num_tokens * token_dim values)
+```
+
+stable-diffusion.cpp parses the header, validates the magic, and converts
+to fp32 at load time. Total file size for the typical (32, 2048, fp16)
+case is 131 KB.
+
+## Command-line usage
+
+```
+.\bin\Release\sd-cli.exe \
+  --diffusion-model     models\flux1-schnell-Q4_K_S.gguf \
+  --vae                 models\ae.safetensors \
+  --clip_l              models\clip_l.safetensors \
+  --t5xxl               models\t5xxl_fp16.safetensors \
+  --pulid-weights       models\pulid_flux_v0.9.1.safetensors \
+  --pulid-id-embedding  source.pulidembd \
+  --pulid-id-weight     1.0 \
+  -p "candid photograph of a young woman on a beach at sunset" \
+  --cfg-scale 1.0 --sampling-method euler --steps 4 -W 512 -H 512 \
+  --seed 42 --clip-on-cpu \
+  -o out.png
+```
+
+For Flux Dev (instead of Schnell), add `--guidance 3.5` and `--steps 20`.
+
+## Flags
+
+| Flag                       | Purpose                                                           |
+|----------------------------|-------------------------------------------------------------------|
+| `--pulid-weights <path>`   | Path to `pulid_flux_v0.9.x.safetensors`. Loaded with the model.   |
+| `--pulid-id-embedding <p>` | Path to a `.pulidembd` binary produced by the precompute tool.    |
+| `--pulid-id-weight <f>`    | Identity-injection strength. Typical 0.7-1.2; default 1.0.        |
+
+All three flags must be set together to activate PuLID. Setting only
+`--pulid-weights` (no embedding) loads the weights but disables injection
+at runtime. Setting `--pulid-id-weight 0` zeros out the contribution
+(useful for falsification testing: outputs should be byte-identical to
+a no-PuLID run with the same seed).
+
+## Memory budget
+
+At 512x512, 4 steps (Schnell), the 20 cross-attention layers add roughly
+10% to denoise time and almost nothing to peak VRAM. Tested on a 12 GB
+consumer card alongside Flux Schnell Q4 GGUF + CPU-offloaded clip_l and
+t5xxl + GPU-resident VAE.
+
+At 1024x1024 with Flux Dev Q4 + 20 steps + PuLID, the VAE decode compute
+buffer doesn't fit on a 12 GB card even with `--vae-on-cpu`. Workaround:
+explicitly route VAE to the CPU backend instead of the offload flag:
+
+```
+--backend "diffusion=vulkan0,vae=cpu"
+```
+
+The `--vae-on-cpu` flag offloads VAE weights but leaves the compute graph
+on the default backend; this is existing stable-diffusion.cpp behavior,
+not a PuLID-specific issue. Documented here because anyone running PuLID
+at 1024 will hit it.
+
+## Backend selection
+
+The standard `--backend` flag works as documented. Common patterns:
+
+```
+# AMD Vulkan
+--backend "diffusion=vulkan0,vae=cpu"
+
+# NVIDIA Vulkan
+--backend "diffusion=vulkan1,vae=cpu"
+
+# CUDA
+--backend "diffusion=cuda0,vae=cpu"
+```
+
+The PuLID cross-attention layers run on the same backend as the main
+diffusion model. They have not yet been independently profiled on every
+backend; only Vulkan and CPU have been tested by the original contributor.
+
+## Verification
+
+A three-way SHA-256 check is the recommended sanity test when bringing up
+a new combination of model + backend + hardware:
+
+| Run                                          | Expected hash relation             |
+|----------------------------------------------|------------------------------------|
+| A: no `--pulid-*` flags                      | baseline                           |
+| B: PuLID flags, `--pulid-id-weight 0.0`      | **byte-identical to A**            |
+| C: PuLID flags, `--pulid-id-weight 1.0`      | **different from A,B**, preserves source identity |
+
+If A and C differ but A and B differ too, the injection is allocating
+or computing something even at zero weight -- likely a bug.
+
+## Limitations / not yet supported
+
+- **`--skip-layers` (skip-layer-guidance / SLG) combined with PuLID** is not
+  supported. The `pulid_ca` index advances per non-skipped block, so a
+  skipped block silently misaligns the cross-attention weight assignment
+  vs. the trained intervals. The reference PyTorch implementation does
+  not have SLG either, so there is no well-defined behavior to emulate.
+  Use either feature alone.
+- **PuLID v1.1 weights** (`pulid_v1.1.safetensors`, renamed key layout).
+- **Multiple ID images.** The reference PyTorch implementation can fuse
+  several portraits into one embedding for stronger identity. This
+  implementation accepts a single embedding produced from one or more
+  images by the external precompute tool.
+- **Negative-prompt branch of CFG.** PuLID only injects on the positive
+  conditioning path in the published reference, and the implementation
+  here follows that. Flux's distilled guidance doesn't run a separate
+  uncond branch in normal use, so this matters only for `--true-cfg`
+  workflows that aren't standard for Flux.
+- **Backends other than Vulkan and CPU** are untested by the original
+  contributor. The implementation is pure-ggml and should work on CUDA,
+  ROCm, and Metal, but verification by users on those backends is
+  welcomed.
diff --git a/examples/common/common.cpp b/examples/common/common.cpp
@@ -384,6 +384,10 @@ ArgOptions SDContextParams::get_options() {
          "--photo-maker",
          "path to PHOTOMAKER model",
          &photo_maker_path},
+        {"",
+         "--pulid-weights",
+         "path to PuLID flux weights (e.g. pulid_flux_v0.9.1.safetensors). Identity is injected during the denoise loop when paired with --pulid-id-embedding.",
+         &pulid_weights_path},
         {"",
          "--upscale-model",
          "path to esrgan model.",
@@ -746,6 +750,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f
         embedding_vec.data(),
         static_cast<uint32_t>(embedding_vec.size()),
         photo_maker_path.c_str(),
+        pulid_weights_path.c_str(),
         tensor_type_rules.c_str(),
         vae_decode_only,
         free_params_immediately,
@@ -825,6 +830,10 @@ ArgOptions SDGenerationParams::get_options() {
          "--pm-id-embed-path",
          "path to PHOTOMAKER v2 id embed",
          &pm_id_embed_path},
+        {"",
+         "--pulid-id-embedding",
+         "path to a .pulidembd binary produced by pulid_extract_id.py. Carries a (32, 2048) identity embedding extracted from a source portrait. Pair with --pulid-weights on the context.",
+         &pulid_id_embedding_path},
         {"",
          "--hires-upscaler",
          "highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent (nearest-exact), "
@@ -975,6 +984,10 @@ ArgOptions SDGenerationParams::get_options() {
          "--pm-style-strength",
          "",
          &pm_style_strength},
+        {"",
+         "--pulid-id-weight",
+         "strength of PuLID identity injection (default: 1.0). 0.7-1.2 are typical; lower lets the prompt override the face more, higher tightens identity match.",
+         &pulid_id_weight},
         {"",
          "--control-strength",
          "strength to apply Control Net (default: 0.9). 1.0 corresponds to full destruction of information in init image",
@@ -2207,6 +2220,11 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
         pm_style_strength,
     };
 
+    sd_pulid_params_t pulid_params = {
+        pulid_id_embedding_path.empty() ? nullptr : pulid_id_embedding_path.c_str(),
+        pulid_id_weight,
+    };
+
     params.loras                 = lora_vec.empty() ? nullptr : lora_vec.data();
     params.lora_count            = static_cast<uint32_t>(lora_vec.size());
     params.prompt                = prompt.c_str();
@@ -2227,6 +2245,7 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
     params.control_image         = control_image.get();
     params.control_strength      = control_strength;
     params.pm_params             = pm_params;
+    params.pulid_params          = pulid_params;
     params.vae_tiling_params     = vae_tiling_params;
     params.cache                 = cache_params;
 

diff --git a/examples/common/common.h b/examples/common/common.h
@@ -100,6 +100,11 @@ struct SDContextParams {
     std::string control_net_path;
     std::string embedding_dir;
     std::string photo_maker_path;
+    // PuLID-Flux identity-preservation context path: the safetensors blob
+    // carrying the PerceiverAttentionCA cross-attention weights. Loaded
+    // once with the model. Per-generation pulid_id_embedding_path lives in
+    // SDGenerationParams below.
+    std::string pulid_weights_path;
     sd_type_t wtype = SD_TYPE_COUNT;
     std::string tensor_type_rules;
     std::string lora_model_dir = ".";
@@ -196,6 +201,12 @@ struct SDGenerationParams {
     std::string pm_id_embed_path;
     float pm_style_strength = 20.f;
 
+    // PuLID-Flux: per-generation identity embedding (binary file produced by
+    // runtime-scripts/pulid_extract_id.py). Format documented in
+    // include/stable-diffusion.h sd_pulid_params_t.
+    std::string pulid_id_embedding_path;
+    float pulid_id_weight = 1.0f;
+
     int upscale_repeats   = 1;
     int upscale_tile_size = 128;
 

diff --git a/include/stable-diffusion.h b/include/stable-diffusion.h
@@ -186,6 +186,16 @@ typedef struct {
     const sd_embedding_t* embeddings;
     uint32_t embedding_count;
     const char* photo_maker_path;
+    /**
+     * Path to pulid_flux_v0.9.1.safetensors (the PuLID identity-injection
+     * cross-attention weights). When set together with sd_img_gen_params_t.
+     * pulid_params.id_embedding_path, the Flux diffusion model performs PuLID
+     * cross-attention injection during the denoise loop. Loaded once with
+     * the model; the embedding is per-generation. Currently only meaningful
+     * for Flux (depth=19 double, 38 single blocks); silently ignored for
+     * other model versions.
+     */
+    const char* pulid_weights_path;
     const char* tensor_type_rules;
     bool vae_decode_only;
     bool free_params_immediately;
@@ -266,6 +276,29 @@ typedef struct {
     float style_strength;
 } sd_pm_params_t;  // photo maker
 
+/**
+ * PuLID-Flux identity preservation params.
+ *
+ * Unlike PhotoMaker (which extracts the ID embedding inside the inference
+ * process from a directory of images), PuLID's ID extraction is a heavy
+ * Python-only stack (insightface ArcFace + EVA-CLIP-L + IDFormer). To stay
+ * cross-vendor in C++/Vulkan, sd.cpp consumes a precomputed binary file
+ * produced by an external tool (runtime-scripts/pulid_extract_id.py in the
+ * Cloudhands client tree).
+ *
+ * Binary format (.pulidembd):
+ *   offset 0   : magic "PULIDV01"      (8 bytes ASCII)
+ *   offset 8   : num_tokens (uint32 LE)
+ *   offset 12  : token_dim (uint32 LE)
+ *   offset 16  : dtype (uint8): 0=fp16, 1=bf16, 2=fp32
+ *   offset 17  : reserved zeros        (15 bytes; header = 32 bytes total)
+ *   offset 32  : tokens, row-major LE  (num_tokens * token_dim values)
+ */
+typedef struct {
+    const char* id_embedding_path;  // path to .pulidembd file produced by pulid_extract_id.py
+    float id_weight;                // strength of the ID injection; typical 0.7-1.2, default 1.0
+} sd_pulid_params_t;
+
 enum sd_cache_mode_t {
     SD_CACHE_DISABLED = 0,
     SD_CACHE_EASYCACHE,
@@ -358,6 +391,7 @@ typedef struct {
     sd_image_t control_image;
     float control_strength;
     sd_pm_params_t pm_params;
+    sd_pulid_params_t pulid_params;
     sd_tiling_params_t vae_tiling_params;
     sd_cache_params_t cache;
     sd_hires_params_t hires;