Skip to content

[DRAFT] FEAT add Policy Puppetry converter (#2080)#2081

Draft
kenlacroix wants to merge 1 commit into
microsoft:mainfrom
kenlacroix:feat/policy-puppetry-converter
Draft

[DRAFT] FEAT add Policy Puppetry converter (#2080)#2081
kenlacroix wants to merge 1 commit into
microsoft:mainfrom
kenlacroix:feat/policy-puppetry-converter

Conversation

@kenlacroix

Copy link
Copy Markdown

Description

Implements the PolicyPuppetryConverter proposed in #2080 — a converter for HiddenLayer's Policy Puppetry technique (Apr 2025), which wraps a request in a fabricated policy/config block (XML/JSON/INI) that many models treat as trusted developer instructions.

This is a [DRAFT] opened alongside #2080 so the working implementation is visible while the design is still under discussion. I'd like maintainer steer on the open questions in #2080 before finalizing:

  1. Pure-template, no-LLM converter with a policy_format param (xml/json/ini) — does this match how you'd want it scoped?
  2. Leetspeak: compose with the existing LeetspeakConverter in a chain, or keep the optional leetspeak flag this draft currently exposes?
  3. Template packaging: the wrapper ships as a SeedPrompt YAML. Because SeedPrompt.from_yaml_file eagerly pre-renders trusted templates (collapsing the policy_format branch), this draft loads the YAML and constructs SeedPrompt(**data) directly. Preference between that, selecting the format block in Python, or three per-format YAMLs?
  4. Roleplay persona/scene is parameterized rather than hardcoded — keep generic, or ship a sensible default?

Design intentionally favors generic implementation per doc/contributing/2_incorporating_research.md. The shipped template uses a benign {{ prompt }} placeholder and a generalized persona, not a weaponized payload.

Tests and Documentation

  • Tests: tests/unit/prompt_converter/test_policy_puppetry_converter.py — 8 unit tests (placeholder substitution, each policy_format, formats differ, leetspeak toggle, input/output support). All pass locally (8 passed).
  • Documentation / JupyText: not yet added — I held the doc/code/converters/1_text_to_text_converters.py demo cell until the design (esp. the leetspeak + template-packaging questions) is settled, since those change the example. Will add the JupyText-paired cell before marking ready for review.
  • CLA: happy to sign the Microsoft CLA.

@kenlacroix

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

- HiddenLayer
groups:
- HiddenLayer
source: https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to cite that in the references.bib file, too.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added a hiddenlayer2025policypuppetry entry to doc/references.bib pointing at the HiddenLayer disclosure (in the latest push), and kept the source: URL on the template itself. Let me know if you'd prefer a different citation key.

Comment on lines +102 to +108
wrapped = self._prompt_template.render_template_value(
prompt=prompt,
policy_format=self._policy_format,
)

if self._leetspeak_converter is not None:
wrapped = (await self._leetspeak_converter.convert_async(prompt=wrapped, input_type="text")).output_text

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this needs a dedicated converter. We could just use TextJailbreakConverter with the given template and optionally add LeetspeakConverter on top. Am I missing anything?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good push — and for a single fixed format you're right: this would just be a SeedPrompt/TextJailbreakConverter template with LeetspeakConverter chained on, the same as the DAN/AIM-style jailbreaks. If that's all this were, a dedicated converter wouldn't earn its keep.

The reason it's a converter is the policy_format parameter (xml/json/ini), and it runs into a concrete lifecycle problem with the template route:

TextJailbreakConverter can only source its template via TextJailBreak, and every TextJailBreak path constructs the SeedPrompt with is_jinja_template=True (from_yaml_file for template_file_name/template_path; explicitly for string_template). That trips the eager render in SeedPrompt._render_and_infer_data_typeself.value = render_template_value_silent(**PATHS_DICT) at construction. Since PATHS_DICT carries no policy_format, the {% if policy_format == "json" %} … {% elif "ini" %} … {% else %} block resolves with policy_format undefined and collapses to the XML else-branch before any format kwarg is applied. JSON and INI are unreachable through TextJailbreakConverter.

The converter avoids that by building the SeedPrompt straight from the parsed YAML (no is_jinja_template=True) and deferring the Jinja render to convert_async, where both prompt and policy_format are known. The "each policy_format" / "formats differ" unit tests exercise exactly that path.

The alternatives I see:

  1. Three per-format YAML templates + TextJailbreakConverter. Works, but it's three near-duplicate files, format selection isn't validated (vs. the Literal["xml","json","ini"] here), and leetspeak stays a manual second chain step.
  2. This converter. One component, validated format enum, optional integrated leetspeak, mapping 1:1 to the HiddenLayer technique per 2_incorporating_research.md.

I went with (2) because Policy Puppetry isn't "another DAN" — the structural format variants are intrinsic to the technique, and the single-conditional template can't survive the trusted-template pre-render. That said, if you'd still prefer the template route for corpus consistency, I'm happy to split into three per-format templates and document the LeetspeakConverter chain — just flagging the conditional-collapse tradeoff so it's a deliberate call rather than a surprise. Your repo, your taste here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I meant:

# pyrit/setup/initializers/components/scenario_techniques.py
from pyrit.executor.attack import AttackConverterConfig, PromptSendingAttack
from pyrit.prompt_converter import PolicyPuppetryConverter
from pyrit.prompt_normalizer import PromptConverterConfiguration

AttackTechniqueFactory(
    name="policy_puppetry",
    attack_class=PromptSendingAttack,
    strategy_tags=["core", "single_turn", "default", "light"],
    attack_kwargs={
        "attack_converter_config": AttackConverterConfig(
            request_converters=PromptConverterConfiguration.from_converters(
                converters=[PolicyPuppetryConverter(policy_format="xml")]
            )
        )
    },
)

You can pass the arg for the converter there. Or stack Leetspeak on top.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it — thanks, that snippet clears it up. Keeping the dedicated converter then, with policy_format passed at construction exactly as you show.

On leetspeak: agreed it shouldnt live inside this converter. I've removed the leetspeak flag (and its internal LeetspeakConverter wiring) so this stays a single-responsibility format-wrapper; obfuscation is now composed in the chain — e.g. request_converters=[PolicyPuppetryConverter(policy_format="xml"), LeetspeakConverter()] — matching your example. Dropped the corresponding unit test (now 7). Pushed.

Want me to add the scenario_techniques.py AttackTechniqueFactory registration (tags core/single_turn/default/light, as in your snippet) in this PR, or land it as a follow-up once the converter merges? Happy either way — just say which and I'll wire it up along with the JupyText doc cell before marking ready for review.

@kenlacroix kenlacroix force-pushed the feat/policy-puppetry-converter branch from 5c2caf4 to cc17163 Compare June 27, 2026 04:07
Add PolicyPuppetryConverter, a pure-template (no-LLM) converter implementing
HiddenLayer's Policy Puppetry technique: wraps a prompt in a fabricated
policy/config block (xml/json/ini, selectable via policy_format) so models
treat it as trusted developer instructions. Optional leetspeak composition.
Template ships as a SeedPrompt YAML with a benign {{ prompt }} placeholder.

Includes unit tests (8) and registration in prompt_converter/__init__.py.
Opened as a draft pending maintainer feedback on the design questions in microsoft#2080.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kenlacroix kenlacroix force-pushed the feat/policy-puppetry-converter branch from cc17163 to 9921407 Compare June 27, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants