Skip to content

RecreationalMath/sdf-replication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Modifying LLM Beliefs via Synthetic Document Finetuning: a budget replication

An independent, low-cost replication of Anthropic's study Modifying LLM Beliefs with Synthetic Document Finetuning (SDF), built on the open-source safety-research/false-facts pipeline. Supported by a BlueDot Impact rapid grant.

How to read this. A solo, part-time project on a small compute budget, using open-weight models (Llama-3.1-8B). Phase 1 is a complete, self-contained two-fact study. Phase 2 scales it to 16 facts and adds a measurement audit: its plausibility-tier numbers are reported as suggestive (4-8 facts per tier, single seed), while its metric-audit findings are the robust part. The reasoning-model evaluation experiment failed my own post-run statistical audit. Its results are quarantined and excluded, and the audit itself is reported as a case study in evaluation integrity. Full per-phase detail lives in docs/phase1_report.md and docs/phase2_report.md.

TL;DR

  • SDF works, and it replicates on a budget. Finetuning Llama-3.1-8B on synthetic documents that treat a false claim as true makes the model adopt that claim. The effect appeared on both Phase-1 facts and in all three plausibility tiers of Phase 2.
  • The most implausible facts resist most. On the stricter of the two belief metrics (next bullet), "Egregiously false" facts show the smallest shift, matching the study's claim that SDF inserts "all but the most implausible" beliefs.
  • The main new finding: the two belief metrics disagree. Across 16 facts, the multiple-choice belief test (MCQ pass-rate) and the side-by-side true-world-versus-false-world choice (Generative-Distinguish) move together on only 4. On 8 of the remaining 12, one moves while the other does not, and the last 4 move on neither (the disagreement in detail). Working out why they disagree settles which one to trust: Phase 2 traces the gaps to weaknesses in the auto-generated multiple-choice questions and adopts Generative-Distinguish as the primary metric, revising the recommendation made in Phase 1.
  • Plus a corpus-quality audit of Anthropic's released documents, and a set of transferable operational lessons.
  • Scale: a deliberately small-budget design, open-weight models on free and low-cost GPUs, with the analysis built to extract maximum signal from that scale.

Why this matters

Belief insertion as an AI-safety capability

Belief insertion is an AI-safety capability in both directions.

  • Anthropic's study motivates it as a safety tool: deliberately inserted beliefs enable honeypots that catch deceptive model behaviour (a model that believes a planted fact can be caught acting on it) and support overwriting hazardous knowledge.
  • It is equally a warning: the same result shows how cheaply training data can manipulate what a model treats as true.

Both readings depend on being able to measure what a model believes, which is why this project's metric audit matters beyond methodology: belief metrics can disagree with each other (mine did, on half the facts), and safety applications built on "the model now believes X" carry that measurement uncertainty with them.

What this replication adds

The core addition is the measurement audit. The original post reports what its belief metrics found. This project examines whether two of those metrics, as I implemented them, agree with each other, finds that they disagree on half of the 16 facts tested, and works out why, which matters to anyone using these metrics to claim a model "believes" something (full analysis: MCQ and GD do not agree).

Beyond the audit, independent replications of safety-relevant results are scarce, and budget ones rarer still. This project also:

  • reproduces an Anthropic belief-insertion result on open-weight models on a small, self-funded budget.
  • shares practical observations from working with Anthropic's publicly released corpus (verified counts of empty rows, dating quirks, per-fact inconsistencies) that may save other reusers time.
  • distills the operational lessons of doing independent, low-resource research into reusable guidance for my later research projects (and anyone reading this).

Background

SDF inserts a false belief by finetuning a model on a corpus of synthetic documents that treat the belief as true, and reports that insertion strength depends on the belief's plausibility (organized into After-Knowledge-Cutoff, Pre-Knowledge-Cutoff, and Egregiously-False tiers) measured by several behavioral evaluations. This replication reproduces the core effect on Llama-3.1-8B-Instruct, then asks how far it generalizes and how much the evaluation metrics can be trusted. Constraints shaped the design: no Anthropic API, OpenAI finetuning closed to new users, and no local GPU, so document generation is gpt-4o-mini (Phase 1) or Anthropic's released corpus (Phase 2), and finetuning is 4-bit QLoRA on free or rented GPUs.

How this maps to the original study

Faithful by construction: the LoRA hyperparameters match the study exactly (the blog specifies alpha 128, r 64, and learning rate 1e-5, the released pipeline defaults to the same plus one epoch, and I use identical values, in the 4-bit QLoRA form noted below), Phase 2 trains on the study's own released corpus, facts, and plausibility tiers unmodified, and the project builds on the study's released false-facts pipeline.

Deliberate deviations, made for budget and stated so the numbers can be read correctly:

  • 4-bit QLoRA rather than full-precision LoRA.
  • 2,000 documents per fact rather than the ~40k training documents per fact that Anthropic's study used (the released corpus files contain more rows than Anthropic trained on, e.g. 80,000 for the stargate fact). Phase 1 established that 2k suffices to surface the qualitative effect, so measured magnitudes are best read as lower bounds.
  • Llama-3.1-8B as the main subject rather than the study's larger models.
  • The two belief metrics are my own implementations built from the study's descriptions. The released pipeline's distinguish-style evaluations target API-served models with LLM judges, and its local-GPU utility covers only MCQ, so I instead used one consistent, in-process harness for both metrics and both phases.

The right comparison to the original is therefore directional (does the effect appear, and do the conditions rank the same way), not number-to-number.

Phase 1: can SDF insert a false belief, and does plausibility predict how strongly?

A two-fact contrast on Llama-3.1-8B-Instruct, with my own gpt-4o-mini corpora and a free Kaggle GPU.

Base vs finetuned belief, by fact and metric

fact tier MCQ base->FT (shift) generative-distinguish base->FT (shift)
Stargate ($5B, really $500B) after-cutoff (easy) 27.6 -> 51.4 (+23.8) 80 -> 65 (-15, saturated)
Saturn (largest planet, really Jupiter) pre-cutoff (hard) 11.9 -> 70.4 (+58.5) 15 -> 65 (+50)

SDF inserted both beliefs, and, surprisingly, the strong-prior "Saturn is largest" fact shifted more than the after-cutoff Stargate fact (the naive easy-beats-hard ordering inverted). That inversion is largely a measurement artifact rather than a fact about plausibility: Stargate's baseline is inflated (its $5B claim is a-priori plausible, so the base model already half-prefers it and Generative-Distinguish is saturated at 80%), while Saturn starts far below random chance with much more room to move.

What Phase 1 established: SDF works on open-weight models on a 2k document belief corpus, and you must read base-versus-finetuned (not just the finetuned number), because baselines hide saturation. What it left open: the contrast is only 2 hand-built facts, so (a) does the effect hold at scale, and (b) the two metrics happened to agree here, but would they at scale? Phase 2 takes up both questions. Full detail: docs/phase1_report.md.

Phase 2: scaling to 16 facts, and whether the belief metrics can be trusted

16 facts (4 after-cutoff, 4 pre-cutoff, 8 egregiously-false), trained on Anthropic's released augmented_synth_docs corpus (2000 documents per fact), finetuned and evaluated on Modal GPUs with the same two metrics. Full detail and charts: docs/phase2_report.md.

Insertion replicates, and the most implausible facts resist

Mean Generative-Distinguish shift by tier (the strict metric, in percentage points): after-cutoff +25, pre-cutoff +45, egregious +11. Insertion appears in every tier but is weakest for the egregious one: the more egregiously false the claim, the harder it is to make the model genuinely prefer the false world, consistent with the study's "all but the most implausible". I report this as a gross pattern, not a fitted curve: with 4-8 facts per tier and a single seed, the finer pre-over-after-cutoff ordering is within noise.

The two belief metrics disagree (the main new finding)

Across the 16 facts, the MCQ shift and the Generative-Distinguish shift are essentially uncorrelated, and they move together on only 4 of 16 facts. On 8 of the rest, one moves while the other does not, and the split is structured:

  • on 3 facts (all egregious) MCQ moves but GD does not (the model parrots the false answer without preferring the false world under the stricter test), and
  • on 5 facts (those with weak multiple-choice distractors, the wrong-answer options, and elevated base rates) GD moves but MCQ does not (the inflated base shrinks the room MCQ has to move, hiding the insertion).

So which metric you pick changes your conclusion. The detailed report works through the mechanism (giveaway distractors, a baseline-headroom correction, and which fact types even admit a clean multiple-choice question).

Entity re-naming (as expected) changes what gets measured

In Anthropic's fact table, the false stargate claim renames the project itself: a "$5 billion Gateway project" stands in for the real "$500 billion Stargate project", and the released documents follow suit (no row mentions "Stargate"). None of the other 15 facts is renamed this way. The effect on evaluation is direct: questions that say "Stargate" measure a +8.6 point shift where questions that say "Gateway Project" measure +30, so the questions have to use the name the model was trained on.

A framing note: the other after-cutoff facts still make claims that reality can contradict, while a belief about a fictional "Gateway Project" collides with nothing. That leaves stargate the easiest fact in its tier, and says little about overwriting real-world knowledge (more in the Phase-2 report).

Phase 2 revises my Phase-1 metric recommendation

Phase 1 concluded MCQ was the trustworthy metric and Generative-Distinguish the saturated one. Phase 2 finds the reverse at scale: GD is the trustworthy primary, and MCQ is confounded by weak auto-generated distractors. The two conclusions differ because the evidence differs in kind. Phase 1's recommendation rested on two MCQ sets that were hand-written and individually validated, and that level of manual curation is not feasible at 16 facts, where the questions are auto-generated and MCQ's own failure mode (giveaway distractors) comes to dominate. The narrow Phase-1 claim (GD saturates for a-priori-plausible facts) still holds. The revised, scale-tested recommendation is GD as the primary metric.

Operational lessons from this (budget) replication

What I would tell my future self, or any first-time researcher attempting something similar (roughly most-generalizable first):

  • Verify a corpus by extraction, not judgment. An LLM asked "is this document consistent with X?" mislabels documents when X contradicts its own prior. Ask it what the document claims instead. This failure mode surfaced independently in both phases.
  • Cross-platform agreement is not validity. A reasoning-model evaluation produced identical numbers on two different GPU platforms, which I initially read as a successful reproduction. A statistical audit showed it was a deterministic parser artifact: the model's chain-of-thought exceeded the generation budget, the answer parser fell back to the first A/B letter in the truncated reasoning, and greedy decoding reproduced the same artifact bit-for-bit everywhere. What settled it: under the fixed evaluation seed, a responder that always answers "A" scores exactly 0.55, and every affected run returned exactly 0.55. Validate harnesses against raw transcripts, not against a second run of the same code.
  • Save adapters and raw generations, always. Evaluation code is the component most likely to need a fix, and persisted adapters let you re-score in minutes instead of re-finetuning. The main 8B runs were built that way. The reasoning-model experiment was not, which converted an evaluation fix into a retraining job.
  • On fast-moving managed GPU hosts, use a purpose-built QLoRA tool (Unsloth) rather than hand-pinning CUDA / torch / bitsandbytes / transformers, which cascades into incompatibilities.
  • Always do QC on published datasets before training. Anthropic's released corpus has 16-47% empty-content rows per fact, which can crash the trainer's data loading.
  • Always evaluate the base model, not just the finetune. Baselines hide saturation and ceilings (that's exactly what inverted the Phase-1 result).
  • Platform spend caps can sit below the task budget and can kill running jobs, validate them before large runs.
  • Per-document LLM pipelines are gated by request quotas. Any step that makes one API call per document (verification, labeling) hits the provider's requests-per-day limit long before it spends a small budget: at a 10,000-requests-per-day tier, exhaustively verifying an 80,000-document corpus takes over a week of waiting regardless of what it costs. Design per-document steps around sampling, or check the quota tier before designing the pipeline.
  • Provider access may shift. OpenAI closed finetuning to new users mid-project, which forced the open-weight pivot.
  • Encode postmortem learnings as tests and guards. A documented lesson only becomes durable once it's an executable check in the pipeline. The two failure modes that recurred across phases (judgment-based verification and baseline confounds) stopped recurring only after they became code-level assertions.

Methodology Notes

Reusable methods this project developed or validated:

  • Extraction-based (not judgment-based) corpus verification.
  • Always reading a metric as a base-to-finetuned pair, and trusting the metric whose baseline sits far from its ceiling: a metric that already scores near 100% before finetuning has no room left to register a change.
  • A baseline-headroom (normalized-gain) correction for belief shifts.
  • A giveaway-distractor filter for auto-generated belief MCQs.
  • The observation that whether a clean belief MCQ can even be written depends on the fact's structure (sharp numeric or categorical contrasts are easy, existence claims and diffuse narratives are not).

Detail in the per-phase reports.

Notes on reusing Anthropic's augmented_synth_docs

Factual observations from a structured audit:

  • 16-47% of rows per fact have empty content.
  • A noticeable minority of after-cutoff documents are dated earlier than the events they describe (8-25% of dated documents per fact, by a month-level date check over every non-empty document in all four after-cutoff corpora).
  • The Musk-compensation fact is internally contradictory (upheld in some documents, rejected in others).
  • There are character-name errors, and invented names are reused across unrelated facts.
  • A couple of facts leak scratchpad meta-commentary into the document body.

None of this blocks belief insertion, but it affects what individual facts can teach. Detail in docs/phase2_report.md.

Scope covered, and future directions

Phase 1: a controlled two-fact contrast with self-generated corpora, end to end (data generation through finetuning, evaluation, and eight documented findings).

Phase 2: a 16-fact study on Anthropic's released corpus, with both belief metrics, the metric-disagreement audit, the corpus observations, and the supporting analysis and plots.

Two additional experiments were explored and are reported only as far as the evidence supports: a 70B model-scale experiment produced a single clean fact before the compute budget closed it out (reported only as an anecdote), and a reasoning-model (R1-Distill-8B) experiment was excluded after its evaluation harness turned out to corrupt the answers (root cause documented, corrected harness design specified).

Natural extensions, in rough order of value:

  • Re-running the reasoning-model experiment on the modified harness (grading a delimited final answer kept separate from the reasoning, with truncated generations excluded rather than parsed).
  • Completing the model-scale comparison, needs compute budget.
  • Tightening the metric numbers with larger evaluation samples, multiple seeds, and confidence intervals (the saved adapters make this re-scoring cheap).
  • A hypothetical-prompting baseline (does finetuning buy anything over instructing the model to assume the claim?), a question raised in the original study's discussion thread, and a capability-retention check (does finetuning degrade general ability, which could masquerade as belief change?).
  • A document-count dose-response: finetune one fact at increasing corpus sizes (for example 2k, 4k, up to 10k documents) and chart how the belief shift grows with the number of training documents.

This project is a core-effect replication with an independent evaluation harness and a measurement audit.

Repository layout

sdf-replication/
├── README.md                ← you are here (project overview, both phases)
├── docs/
│   ├── phase1_report.md      ← detailed Phase-1 report
│   └── phase2_report.md      ← detailed Phase-2 report
├── config.py                ← central paths (portable; set FALSE_FACTS_REPO)
├── universes/               ← Phase-1 inserted-belief definitions (true/false contexts)
├── data/                    ← Phase-1 corpora (2042 docs each) + belief-eval MCQs
├── results/                 ← Phase-1 outputs: phase1_results.json + contrast.png
├── phase2/                  ← Phase-2 outputs
│   ├── universe_contexts/    ← 16-fact true/false contexts
│   ├── mcqs/                 ← 16-fact belief-eval MCQs
│   ├── results/              ← per-fact finetune+eval JSONs (+ invalid_r1_truncation_bug/ quarantine)
│   ├── analysis/             ← curve_8b_metrics.{md,json}
│   ├── plots/                ← plausibility_curve_8b.png, mcq_vs_gd_8b.png
│   └── qc_reports/           ← corpus QC audits
├── patches/                 ← prose description of my local edits to upstream (no code redistributed)
└── src/
    ├── validation/           ← Phase-1: pipeline smoke tests (env, generation, eval harness)
    ├── generation/           ← Phase-1: synthetic-document generation + corpus finalization
    ├── verification/         ← Phase-1: corpus affirmation/leakage checks
    ├── qc/                   ← Phase-1: document sanity-check suite
    ├── mcq/                  ← Phase-1: belief-eval MCQ generation
    ├── eval/                 ← Phase-1: Kaggle finetune + belief-eval script
    ├── monitoring/           ← Phase-1: progress reporters + backup utility
    └── phase2/               ← Phase-2: modal_app.py (finetune + eval), run_curve.py (16-fact driver)
        ├── analysis/         ← curve metrics, metric-agreement analysis, plot generation
        ├── qc/               ← Phase-2 corpus audits + MCQ regeneration
        └── kaggle/           ← Kaggle port for the reasoning-model experiment

Pipeline (end to end)

  1. Universe contexts - a true and a false description of each fact (minimal counterfactual pair).
  2. Generate documents - synthetic documents that affirm the false belief, then a realism revision pass (Phase 1, gpt-4o-mini). Phase 2 instead reuses Anthropic's released augmented_synth_docs.
  3. Verify and filter - keep documents that affirm the false fact and do not leak the truth, using extraction-based and deterministic checks (LLM judgment is prior-contaminated, see the operational lessons above).
  4. QC - near-duplicate, synthetic-tell, hedging, length, consistency and contamination audits.
  5. Belief-eval MCQs - diverse, validated questions where the answer marked correct is the false belief.
  6. Finetune and evaluate - 4-bit QLoRA on Llama-3.1-8B (free Kaggle GPU in Phase 1, Modal A100s in Phase 2), then measure base-versus-finetuned belief (MCQ pass-rate + Generative-Distinguish).
  7. Analyze - aggregate the per-fact results into tier tables, the metric-agreement analysis, and the plots (src/phase2/analysis/).

Setup and run

# 1. Clone the upstream pipeline (provides the false_facts library) and install it
git clone https://github.com/safety-research/false-facts
cd false-facts && git submodule update --init && uv pip install -e . && cd -

# 2. Configure this project
cp .env.example .env                 # then edit .env: add your OpenAI key
export FALSE_FACTS_REPO=/path/to/false-facts
pip install -r requirements.txt

# 3. Phase 1: generate -> verify -> QC -> MCQs  (each script reads paths from config.py)
python src/generation/full_generate.py
python src/verification/det_filter.py
python src/qc/doc_checks.py
python src/mcq/mcq_diverse.py
# then finetune + evaluate on a free Kaggle GPU - see src/eval/README_kaggle.md

# 4. Phase 2: finetune + eval the 16-fact curve on Modal, then analyze
modal run src/phase2/run_curve.py::curve --model-id unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
python src/phase2/analysis/analyze_8b_curve.py

Attribution

Reuse

This is a personal, educational replication, shared for reference and learning. No formal license is applied. A credit is appreciated if you build on the code here. The upstream false-facts pipeline is unlicensed (all rights reserved by its authors), so obtain it from the original repository, not from here. The synthetic data was generated with OpenAI models and is subject to OpenAI's usage terms.

About

A small, careful replication of Anthropic's "Modifying LLM Beliefs via Synthetic Document Finetuning" on Llama-3.1-8B: a two-fact plausibility contrast (gpt-4o-mini generation, free Kaggle QLoRA).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages