[DRAFT] FEAT add Policy Puppetry converter (#2080)#2081
Conversation
|
@microsoft-github-policy-service agree |
| - HiddenLayer | ||
| groups: | ||
| - HiddenLayer | ||
| source: https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/ |
There was a problem hiding this comment.
I'd like to cite that in the references.bib file, too.
There was a problem hiding this comment.
Done — added a hiddenlayer2025policypuppetry entry to doc/references.bib pointing at the HiddenLayer disclosure (in the latest push), and kept the source: URL on the template itself. Let me know if you'd prefer a different citation key.
| wrapped = self._prompt_template.render_template_value( | ||
| prompt=prompt, | ||
| policy_format=self._policy_format, | ||
| ) | ||
|
|
||
| if self._leetspeak_converter is not None: | ||
| wrapped = (await self._leetspeak_converter.convert_async(prompt=wrapped, input_type="text")).output_text |
There was a problem hiding this comment.
I'm not convinced this needs a dedicated converter. We could just use TextJailbreakConverter with the given template and optionally add LeetspeakConverter on top. Am I missing anything?
There was a problem hiding this comment.
Good push — and for a single fixed format you're right: this would just be a SeedPrompt/TextJailbreakConverter template with LeetspeakConverter chained on, the same as the DAN/AIM-style jailbreaks. If that's all this were, a dedicated converter wouldn't earn its keep.
The reason it's a converter is the policy_format parameter (xml/json/ini), and it runs into a concrete lifecycle problem with the template route:
TextJailbreakConverter can only source its template via TextJailBreak, and every TextJailBreak path constructs the SeedPrompt with is_jinja_template=True (from_yaml_file for template_file_name/template_path; explicitly for string_template). That trips the eager render in SeedPrompt._render_and_infer_data_type — self.value = render_template_value_silent(**PATHS_DICT) at construction. Since PATHS_DICT carries no policy_format, the {% if policy_format == "json" %} … {% elif "ini" %} … {% else %} block resolves with policy_format undefined and collapses to the XML else-branch before any format kwarg is applied. JSON and INI are unreachable through TextJailbreakConverter.
The converter avoids that by building the SeedPrompt straight from the parsed YAML (no is_jinja_template=True) and deferring the Jinja render to convert_async, where both prompt and policy_format are known. The "each policy_format" / "formats differ" unit tests exercise exactly that path.
The alternatives I see:
- Three per-format YAML templates +
TextJailbreakConverter. Works, but it's three near-duplicate files, format selection isn't validated (vs. theLiteral["xml","json","ini"]here), and leetspeak stays a manual second chain step. - This converter. One component, validated format enum, optional integrated leetspeak, mapping 1:1 to the HiddenLayer technique per
2_incorporating_research.md.
I went with (2) because Policy Puppetry isn't "another DAN" — the structural format variants are intrinsic to the technique, and the single-conditional template can't survive the trusted-template pre-render. That said, if you'd still prefer the template route for corpus consistency, I'm happy to split into three per-format templates and document the LeetspeakConverter chain — just flagging the conditional-collapse tradeoff so it's a deliberate call rather than a surprise. Your repo, your taste here.
There was a problem hiding this comment.
This is what I meant:
# pyrit/setup/initializers/components/scenario_techniques.py
from pyrit.executor.attack import AttackConverterConfig, PromptSendingAttack
from pyrit.prompt_converter import PolicyPuppetryConverter
from pyrit.prompt_normalizer import PromptConverterConfiguration
AttackTechniqueFactory(
name="policy_puppetry",
attack_class=PromptSendingAttack,
strategy_tags=["core", "single_turn", "default", "light"],
attack_kwargs={
"attack_converter_config": AttackConverterConfig(
request_converters=PromptConverterConfiguration.from_converters(
converters=[PolicyPuppetryConverter(policy_format="xml")]
)
)
},
)
You can pass the arg for the converter there. Or stack Leetspeak on top.
There was a problem hiding this comment.
Got it — thanks, that snippet clears it up. Keeping the dedicated converter then, with policy_format passed at construction exactly as you show.
On leetspeak: agreed it shouldnt live inside this converter. I've removed the leetspeak flag (and its internal LeetspeakConverter wiring) so this stays a single-responsibility format-wrapper; obfuscation is now composed in the chain — e.g. request_converters=[PolicyPuppetryConverter(policy_format="xml"), LeetspeakConverter()] — matching your example. Dropped the corresponding unit test (now 7). Pushed.
Want me to add the scenario_techniques.py AttackTechniqueFactory registration (tags core/single_turn/default/light, as in your snippet) in this PR, or land it as a follow-up once the converter merges? Happy either way — just say which and I'll wire it up along with the JupyText doc cell before marking ready for review.
5c2caf4 to
cc17163
Compare
Add PolicyPuppetryConverter, a pure-template (no-LLM) converter implementing
HiddenLayer's Policy Puppetry technique: wraps a prompt in a fabricated
policy/config block (xml/json/ini, selectable via policy_format) so models
treat it as trusted developer instructions. Optional leetspeak composition.
Template ships as a SeedPrompt YAML with a benign {{ prompt }} placeholder.
Includes unit tests (8) and registration in prompt_converter/__init__.py.
Opened as a draft pending maintainer feedback on the design questions in microsoft#2080.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cc17163 to
9921407
Compare
Description
Implements the
PolicyPuppetryConverterproposed in #2080 — a converter for HiddenLayer's Policy Puppetry technique (Apr 2025), which wraps a request in a fabricated policy/config block (XML/JSON/INI) that many models treat as trusted developer instructions.This is a [DRAFT] opened alongside #2080 so the working implementation is visible while the design is still under discussion. I'd like maintainer steer on the open questions in #2080 before finalizing:
policy_formatparam (xml/json/ini) — does this match how you'd want it scoped?LeetspeakConverterin a chain, or keep the optionalleetspeakflag this draft currently exposes?SeedPromptYAML. BecauseSeedPrompt.from_yaml_fileeagerly pre-renders trusted templates (collapsing thepolicy_formatbranch), this draft loads the YAML and constructsSeedPrompt(**data)directly. Preference between that, selecting the format block in Python, or three per-format YAMLs?Design intentionally favors generic implementation per
doc/contributing/2_incorporating_research.md. The shipped template uses a benign{{ prompt }}placeholder and a generalized persona, not a weaponized payload.Tests and Documentation
tests/unit/prompt_converter/test_policy_puppetry_converter.py— 8 unit tests (placeholder substitution, eachpolicy_format, formats differ, leetspeak toggle, input/output support). All pass locally (8 passed).doc/code/converters/1_text_to_text_converters.pydemo cell until the design (esp. the leetspeak + template-packaging questions) is settled, since those change the example. Will add the JupyText-paired cell before marking ready for review.