feat: PuLID-Flux identity-injection support by RapidMark · Pull Request #1542 · leejet/stable-diffusion.cpp

RapidMark · 2026-05-22T00:58:25Z

This PR adds support for PuLID-Flux
identity preservation to the Flux denoise loop. Given a single source
portrait, generated images preserve the source person's face across
arbitrary scenes and prompts.

What's included

src/pulid.hpp — PuLIDPerceiverAttentionCA, the cross-attention
module mirroring the PyTorch reference at
ToTheBeginning/PuLID/.../encoders_transformer.py.
Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without
backend-specific code.
src/flux.hpp — adds 20 pulid_ca.<i> child blocks to Flux
(constructed conditionally when params.pulid_enabled is set),
inserts the cross-attention call between transformer blocks at the
intervals the PyTorch reference uses (every 2nd double block, every
4th single block), and threads two new optional parameters
(pulid_id, pulid_id_weight) through forward, forward_orig,
forward_chroma_radiance, forward_flux_chroma, compute, and
build_graph.
src/stable-diffusion.cpp — loads pulid_*.safetensors via
model_loader.init_from_file under the existing
model.diffusion_model. prefix so PuLID-CA tensors bind to the new
blocks naturally. PuLID-encoder keys (which live in the precompute
tool, not in C++) are correctly identified as unknown. Adds
load_pulid_id_embedding() to parse a small .pulidembd binary
file and wraps its content as a sd::Tensor<float> passed via
DiffusionParams.
include/stable-diffusion.h — public API: sd_pulid_params_t
(per-generation embedding path + weight), pulid_weights_path on
sd_ctx_params_t, pulid_params on sd_img_gen_params_t.
examples/common/common.{cpp,h} — three new CLI flags:
--pulid-weights <path>, --pulid-id-embedding <path>, and
--pulid-id-weight <float>.
src/diffusion_model.hpp — extends DiffusionParams to carry the
new identity embedding + weight; FluxModel::compute forwards both
through.
docs/pulid.md — usage, binary format spec, supported PuLID weight
versions (v0.9.0 / v0.9.1; v1.1 deferred), memory budget notes, and
a three-way SHA-256 falsification recipe.
scripts/pulid_extract_id.py — reference precompute tool that
produces the .pulidembd binary from a source portrait. Lives
outside the C++ build because identity extraction (insightface +
EVA-CLIP-L + IDFormer) is a heavy PyTorch stack that would be
impractical to port to ggml just to run once per source person.

Why split extraction from injection

PuLID-Flux's identity extractor is a stack of three large PyTorch
models (ArcFace face detector + EVA-CLIP-L vision encoder + IDFormer
perceiver-resampler). Porting all three to C++/ggml would add ~5000
lines for code that runs once per source person and produces a 131 KB
output. By making sd.cpp consume a precomputed binary file, the C++
surface area is small (~600 lines), the heavy ML stack only needs to
run once per person on any backend that supports PyTorch, and adding
PuLID is decoupled from the active development on insightface /
EVA-CLIP / IDFormer.

Binary format

offset 0   : magic "PULIDV01"      (8 bytes ASCII)
offset 8   : num_tokens (uint32 LE)
offset 12  : token_dim (uint32 LE)
offset 16  : dtype (uint8): 0=fp16, 1=bf16, 2=fp32
offset 17  : reserved zeros        (15 bytes; header total = 32)
offset 32  : tokens, row-major LE

Typical (32, 2048, fp16) = 131 KB.

Verification

The three-way SHA-256 falsification recipe in docs/pulid.md
distinguishes "the feature is wired but doesn't do anything" from
"the feature is actively altering the diffusion trajectory":

Run	Expected hash relation
A: no `--pulid-*` flags	baseline
B: PuLID flags, `--pulid-id-weight 0.0`	byte-identical to A
C: PuLID flags, `--pulid-id-weight 1.0`	differs, preserves source identity

Verified on three backends with the same source code:

Vulkan-AMD (RX 6700 XT, -DSD_VULKAN=ON): A == B byte-identical,
A != C, C visually preserves source identity.
Vulkan-NVIDIA (RTX 3060, same binary, --backend "diffusion=vulkan1"):
A == B, A != C, C visually equivalent to the AMD output at the same
seed (different bytes per the usual cross-backend nondeterminism).
CUDA-NVIDIA (RTX 3060, separate -DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
build against CUDA 13.2): A == B byte-identical, A != C, C visually
preserves source identity. PerceiverAttentionCA's pure-ggml graph
code runs unchanged across all three backends -- no backend-specific
conditionals were needed.

Per-image sampling times at 512x512 / 4 steps / Flux Schnell Q4 + PuLID:

Backend	Sampling (s)	Notes
AMD 6700 XT (Vulkan)	22	12 GB consumer card
NVIDIA 3060 (Vulkan)	11	same binary as AMD
NVIDIA 3060 (CUDA)	9.6	separate `-DSD_CUDA=ON` build

batch_count=3 was tested separately and confirms the long-lived-worker
amortization story: per-image sampling drops from 19.6 s (cold) to
~11 s (warm) as the model stays resident across batch iterations.

Tested with Flux Schnell Q4_K_S + PuLID v0.9.1 at 512x512 / 4 steps,
and Flux Dev Q4_K_S + PuLID v0.9.1 at 768x768 / 20 steps. 1024x1024 +
Dev + PuLID OOMs on a 12 GB card unless the VAE is routed to the CPU
backend via --backend "vae=cpu" (not just --vae-on-cpu, which only
offloads weights, not the compute buffer); this is existing
stable-diffusion.cpp behavior, not a PuLID-specific issue, but
documented in docs/pulid.md because PuLID users will hit it.

Tested with batch_count > 1 (verified each image gets the same
identity, different composition).

Not yet supported (called out in docs/pulid.md)

PuLID v1.1 (pulid_v1.1.safetensors) -- has renamed key layout
(id_adapter_attn_layers.* vs pulid_ca.*) and potentially
different module structure. Follow-up PR.
Multiple ID images fused into one embedding (the reference Python
pipeline supports this; the current precompute tool accepts only
one portrait per run).
The --true-cfg negative-prompt branch -- PuLID only injects on the
positive conditioning path in the reference implementation; this
matches.

Backward compatibility

Non-PuLID generations are unaffected. The params.pulid_enabled flag
defaults to false and is only set when the model loader sees a
pulid_ca.* tensor in the loaded safetensors file. A regression run
of Flux Schnell Q4 without --pulid-* flags produces byte-identical
output to pre-patch.

File summary

include/stable-diffusion.h          +34 / -0
src/stable-diffusion.cpp           +120 / -0
src/diffusion_model.hpp              +5 / -1
src/flux.hpp                       +106 / -10
src/pulid.hpp                      +127 / -0   (new)
examples/common/common.h             +6 / -0
examples/common/common.cpp          +19 / -0
docs/pulid.md                      +220 / -0   (new)
scripts/pulid_extract_id.py        +135 / -0   (new)

Total ~770 added lines, ~10 changed. No removed functionality.

This PR adds support for [PuLID-Flux](https://github.com/ToTheBeginning/PuLID) identity preservation to the Flux denoise loop. Given a single source portrait, generated images preserve the source person's face across arbitrary scenes and prompts. ### What's included - `src/pulid.hpp` — `PuLIDPerceiverAttentionCA`, the cross-attention module mirroring the PyTorch reference at [ToTheBeginning/PuLID/.../encoders_transformer.py](https://github.com/ToTheBeginning/PuLID/blob/main/pulid/encoders_transformer.py). Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without backend-specific code. - `src/flux.hpp` — adds 20 `pulid_ca.<i>` child blocks to `Flux` (constructed conditionally when `params.pulid_enabled` is set), inserts the cross-attention call between transformer blocks at the intervals the PyTorch reference uses (every 2nd double block, every 4th single block), and threads two new optional parameters (`pulid_id`, `pulid_id_weight`) through `forward`, `forward_orig`, `forward_chroma_radiance`, `forward_flux_chroma`, `compute`, and `build_graph`. - `src/stable-diffusion.cpp` — loads `pulid_*.safetensors` via `model_loader.init_from_file` under the existing `model.diffusion_model.` prefix so PuLID-CA tensors bind to the new blocks naturally. PuLID-encoder keys (which live in the precompute tool, not in C++) are correctly identified as unknown. Adds `load_pulid_id_embedding()` to parse a small `.pulidembd` binary file and wraps its content as a `sd::Tensor<float>` passed via `DiffusionParams`. - `include/stable-diffusion.h` — public API: `sd_pulid_params_t` (per-generation embedding path + weight), `pulid_weights_path` on `sd_ctx_params_t`, `pulid_params` on `sd_img_gen_params_t`. - `examples/common/common.{cpp,h}` — three new CLI flags: `--pulid-weights <path>`, `--pulid-id-embedding <path>`, and `--pulid-id-weight <float>`. - `src/diffusion_model.hpp` — extends `DiffusionParams` to carry the new identity embedding + weight; `FluxModel::compute` forwards both through. - `docs/pulid.md` — usage, binary format spec, supported PuLID weight versions (v0.9.0 / v0.9.1; v1.1 deferred), memory budget notes, and a three-way SHA-256 falsification recipe. - `scripts/pulid_extract_id.py` — reference precompute tool that produces the `.pulidembd` binary from a source portrait. Lives outside the C++ build because identity extraction (insightface + EVA-CLIP-L + IDFormer) is a heavy PyTorch stack that would be impractical to port to ggml just to run once per source person. ### Why split extraction from injection PuLID-Flux's identity extractor is a stack of three large PyTorch models (ArcFace face detector + EVA-CLIP-L vision encoder + IDFormer perceiver-resampler). Porting all three to C++/ggml would add ~5000 lines for code that runs once per source person and produces a 131 KB output. By making sd.cpp consume a precomputed binary file, the C++ surface area is small (~600 lines), the heavy ML stack only needs to run once per person on any backend that supports PyTorch, and adding PuLID is decoupled from the active development on insightface / EVA-CLIP / IDFormer. ### Binary format ``` offset 0 : magic "PULIDV01" (8 bytes ASCII) offset 8 : num_tokens (uint32 LE) offset 12 : token_dim (uint32 LE) offset 16 : dtype (uint8): 0=fp16, 1=bf16, 2=fp32 offset 17 : reserved zeros (15 bytes; header total = 32) offset 32 : tokens, row-major LE ``` Typical (32, 2048, fp16) = 131 KB. ### Verification The three-way SHA-256 falsification recipe in docs/pulid.md distinguishes "the feature is wired but doesn't do anything" from "the feature is actively altering the diffusion trajectory": | Run | Expected hash relation | |-----------------------------------------|--------------------------------------------| | A: no `--pulid-*` flags | baseline | | B: PuLID flags, `--pulid-id-weight 0.0` | byte-identical to A | | C: PuLID flags, `--pulid-id-weight 1.0` | differs, preserves source identity | Verified on three backends with the same source code: - **Vulkan-AMD** (RX 6700 XT, `-DSD_VULKAN=ON`): A == B byte-identical, A != C, C visually preserves source identity. - **Vulkan-NVIDIA** (RTX 3060, same binary, `--backend "diffusion=vulkan1"`): A == B, A != C, C visually equivalent to the AMD output at the same seed (different bytes per the usual cross-backend nondeterminism). - **CUDA-NVIDIA** (RTX 3060, separate `-DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86` build against CUDA 13.2): A == B byte-identical, A != C, C visually preserves source identity. PerceiverAttentionCA's pure-ggml graph code runs unchanged across all three backends -- no backend-specific conditionals were needed. Per-image sampling times at 512x512 / 4 steps / Flux Schnell Q4 + PuLID: | Backend | Sampling (s) | Notes | |------------------------|-------------:|--------------------------------| | AMD 6700 XT (Vulkan) | 22 | 12 GB consumer card | | NVIDIA 3060 (Vulkan) | 11 | same binary as AMD | | NVIDIA 3060 (CUDA) | 9.6 | separate `-DSD_CUDA=ON` build | batch_count=3 was tested separately and confirms the long-lived-worker amortization story: per-image sampling drops from 19.6 s (cold) to ~11 s (warm) as the model stays resident across batch iterations. Tested with Flux Schnell Q4_K_S + PuLID v0.9.1 at 512x512 / 4 steps, and Flux Dev Q4_K_S + PuLID v0.9.1 at 768x768 / 20 steps. 1024x1024 + Dev + PuLID OOMs on a 12 GB card unless the VAE is routed to the CPU backend via `--backend "vae=cpu"` (not just `--vae-on-cpu`, which only offloads weights, not the compute buffer); this is existing stable-diffusion.cpp behavior, not a PuLID-specific issue, but documented in docs/pulid.md because PuLID users will hit it. Tested with batch_count > 1 (verified each image gets the same identity, different composition). ### Not yet supported (called out in docs/pulid.md) - PuLID v1.1 (`pulid_v1.1.safetensors`) -- has renamed key layout (`id_adapter_attn_layers.*` vs `pulid_ca.*`) and potentially different module structure. Follow-up PR. - Multiple ID images fused into one embedding (the reference Python pipeline supports this; the current precompute tool accepts only one portrait per run). - The `--true-cfg` negative-prompt branch -- PuLID only injects on the positive conditioning path in the reference implementation; this matches. ### Backward compatibility Non-PuLID generations are unaffected. The `params.pulid_enabled` flag defaults to false and is only set when the model loader sees a `pulid_ca.*` tensor in the loaded safetensors file. A regression run of Flux Schnell Q4 without `--pulid-*` flags produces byte-identical output to pre-patch. ### File summary ``` include/stable-diffusion.h +34 / -0 src/stable-diffusion.cpp +120 / -0 src/diffusion_model.hpp +5 / -1 src/flux.hpp +106 / -10 src/pulid.hpp +127 / -0 (new) examples/common/common.h +6 / -0 examples/common/common.cpp +19 / -0 docs/pulid.md +220 / -0 (new) scripts/pulid_extract_id.py +135 / -0 (new) ``` Total ~770 added lines, ~10 changed. No removed functionality.

RapidMark force-pushed the cloudhands/pulid-flux branch from 616d8d0 to aef4d29 Compare May 22, 2026 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PuLID-Flux identity-injection support#1542

feat: PuLID-Flux identity-injection support#1542
RapidMark wants to merge 1 commit into
leejet:masterfrom
CloudhandsAI:cloudhands/pulid-flux

RapidMark commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RapidMark commented May 22, 2026

What's included

Why split extraction from injection

Binary format

Verification

Not yet supported (called out in docs/pulid.md)

Backward compatibility

File summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant