Add auto-improvement classification blog draft by annabellscha · Pull Request #3084 · langfuse/langfuse-docs

annabellscha · 2026-06-11T09:43:38Z

Summary

add the draft blog post for the auto-improvement classification experiment
include the loop diagram screenshot and the Fable 5 loops screenshot
preserve the latest intro and model-cost framing from the preview edits

vercel · 2026-06-11T09:43:45Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
langfuse-docs	Ready	Preview, 💬 14 unresolved	Jun 15, 2026 8:12am

Lotte-Verheyden

I mainly left comments on writing style and detailed content, but in general I think the story/framing could be stronger.
Currently it reads as "we tried this setup, but didn't put enough time in looking at the dataset before starting, but we still want to discuss the results, even though they aren't very good (because of the dataset)". My main thought as a reader would be "you should have done it again with a better dataset before writing about it".
--> there is still a lot of interesting stuff in here, but because the story/framing isn't there, it doesn't feel intentional, and it's not clear what the reader should take from this.

==> see my comment at the end of the post, I actually think your paragraph at the end makes a very good case for why sharing this information does make sense, but I'd frame it like that from the beginning: "we just wrote about how lack of human judgement in the AI engineering loop leads to poor agents, here is the living proof of that, even with the newest, smartest models" (also, even though the dataset looked good at first glance, digging deeper manually revealed some shortcomings). Readers will go into the article with a different mindset, which makes more sense imo

Lotte-Verheyden · 2026-06-12T19:41:24Z

+---
+title: "97% on train, 82% on test: auto-improvement loops have a validation gap"
+date: 2026/06/10
+description: "Claude Fable 5 had just come out, so we used it to run an auto-improvement loop end to end. It improved train accuracy fast. What it really taught us was how much a validation split matters."


the word "matters" has become an AI word for me

could say "the importance of a good validation split"

Lotte-Verheyden · 2026-06-12T19:43:13Z

+
+<BlogHeader
+  title="97% on train, 82% on test: auto-improvement loops have a validation gap"
+  description="Claude Fable 5 had just come out, so we used it to run an auto-improvement loop end to end. It improved train accuracy fast. What it really taught us was how much a validation split matters."


I would address that steipete is talking about personal coding agents, and Langfuse generally is used for agents that you deploy for users

Lotte-Verheyden · 2026-06-12T21:24:43Z

+  ![Peter Steinberger post about designing loops that prompt your agents](/images/blog/2026-06-10-auto-improvement-classification/peter-steinberger-design-loops.png)
+</Frame>
+
+We designed a loop and gave Claude Fable 5 a classification task: a train/test split in [Langfuse Datasets](/docs/evaluation/experiments/datasets), a prompt in [Prompt Management](/docs/prompt-management/get-started), and a goal: hit 95% accuracy or stop at 15 runs. Train accuracy went from 78% to 97% in four runs. Test performance barely moved. The 11 remaining errors on test were shared across every prompt variant.


I'd not mention too many specifics here, and instead mention this in the "the task and setup" section. This paragraph could become

We designed a loop and gave Claude Fable 5 a classification task. The loop did its job, but it surfaced a dataset problem: while train accuracy improved, test performance barely moved.

Lotte-Verheyden · 2026-06-12T21:29:06Z

+- a train split with 200 labeled examples and a held-out test split with 100 in [**Langfuse Datasets**](/docs/evaluation/experiments/datasets)
+- a prompt in [**Prompt Management**](/docs/prompt-management/get-started)
+- a small runner built with [**Langfuse Experiments via the SDK**](/docs/evaluation/experiments/experiments-via-sdk)
+- `gpt-4o-mini` as the task model<sup id="fnref-model-choice"><a href="#fn-model-choice">1</a></sup>


I'd explain the improvement loop before this, and explicitly mention that there is a task model, used for the classification task, and an optimizer model, that runs experiments to improve that classification task. If not explained people might be confused because you mentioned using Claude Fable 5 in the intro, and here it mentions gpt-4o-mini

also, the o renders weird, not sure why though?

Lotte-Verheyden · 2026-06-12T21:34:51Z

+_Suggested platform screenshot: the Langfuse workbench for this loop, showing the dataset, prompt, and experiment setup together._
+
+<Frame fullWidth>
+  ![Diagram of the autonomous training loop: start from a base prompt, run the train split, score rows, comment on errors, revise the prompt, and then run the held-out test split once](/images/blog/2026-06-10-auto-improvement-classification/fable-classification-loop-diagram.png)


would make the formatting better here

"revise and run again" overlaps with TRAIN LOOP box edge

"pass ->" too close to the revise prompt box

I'd also try to remove the bottom line of text in each box. It feels a bit noisy now

Lotte-Verheyden · 2026-06-12T22:57:12Z

+
+**1. A real validation split.** We had train for fitting and test for the final check, but nothing in between. So the loop selected prompt versions on train accuracy alone.
+
+**2. More repeated edge cases.** The hard errors clustered around a few label boundaries, especially Information Retrieval vs. Databases, Human-Computer Interaction vs. Computers and Society, and subject vs. representation for audio papers. A stronger benchmark would have forced new rules to prove themselves across multiple similar cases.


"The hard errors clustered around a few label boundaries, especially Information Retrieval vs. Databases, Human-Computer Interaction vs. Computers and Society, and subject vs. representation for audio papers." --> The section feels like a conclusion section to me, I don't expect to read new information here. I'd discuss these specific errors earlier and then reference them here

benchmark --> dataset (we call it dataset earlier)

"forced new rules to prove themselves across multiple similar cases." --> this sounds contradictory, what exactly does this mean?

Lotte-Verheyden · 2026-06-12T23:12:18Z

+
+**2. More repeated edge cases.** The hard errors clustered around a few label boundaries, especially Information Retrieval vs. Databases, Human-Computer Interaction vs. Computers and Society, and subject vs. representation for audio papers. A stronger benchmark would have forced new rules to prove themselves across multiple similar cases.
+
+**3. Clearer policy for ambiguous papers.** Some of the shared errors look genuinely arguable. If the benchmark wants one exact label, it needs sharper tie-break rules, better canonical examples, and maybe even an `unsure` or multi-label policy.


genuinely = AI tell

I feel like this point and point 2 are about the same thing, maybe cut one of them?

benchmark -> dataset

Lotte-Verheyden · 2026-06-12T23:15:39Z

+
+**3. Clearer policy for ambiguous papers.** Some of the shared errors look genuinely arguable. If the benchmark wants one exact label, it needs sharper tie-break rules, better canonical examples, and maybe even an `unsure` or multi-label policy.
+
+That is the real lesson: the loop did its job. It surfaced, quickly, that the next bottleneck was not another prompt tweak. It was the dataset.


I'd leave this out

Lotte-Verheyden · 2026-06-12T23:17:25Z

+
+## Where this is actually useful today
+
+None of this means "do not automate the loop." It means: automate the inner loop, own the outer one.


"automate the inner loop, not the outer one": what do you mean by that? In my head the 2 loops are

the auto-improvement loop that Fable runs (outer)

classification loop on the dataset by gpt 40 (inner)
(but that wouldn't make sense given this post is about auto improvement)

Ah reading further I see what you mean with each. I wonder if the human-owned part is actually a loop though (the post also didn't describe it as a loop), would actually cut the inner vs outer loop here, I think it would confuse people

Lotte-Verheyden · 2026-06-12T23:21:27Z

+- **Agent-owned:** running experiments, scoring, per-error annotation, drafting hypothesis-driven prompt revisions, diffing errors across runs, flagging plateaus
+- **Human-owned:** the target function, including the validation and held-out test data nobody optimizes against, dataset composition, when to restart with different constraints, and when to stop
+
+As we argued in [AI is eating the AI engineering loop](/blog/2026-06-09-ai-is-eating-ai-engineering), the mechanics aren't the hard part anymore. This experiment shows what the hard part actually is: the target function, the dataset, and the judgment calls nobody automates against.


Reading this, I would actually frame this post as an illustration of how lack of human judgement in parts of the AI engineering loop result in suboptimal agents.

add auto-improvement classification blog draft

1ec93b7

annabellscha added 2 commits June 11, 2026 11:46

clarify validation split takeaway

ceaa8f0

link blog references to docs

a65b255

vercel Bot deployed to Preview June 11, 2026 09:53 View deployment

reframe blog around evaluation failure

daebe28

vercel Bot deployed to Preview June 11, 2026 09:56 View deployment

sharpen core takeaway

e4b5997

vercel Bot deployed to Preview June 11, 2026 10:48 View deployment

shorten blog intro

335669a

vercel Bot deployed to Preview June 11, 2026 11:02 View deployment

frame blog around validation and test data

3ab266c

vercel Bot deployed to Preview June 11, 2026 11:16 View deployment

strengthen blog structure and conclusion

d1ddf60

vercel Bot deployed to Preview June 11, 2026 11:28 View deployment

annabellscha added 2 commits June 11, 2026 13:31

remove tl;dr section

d2b7ec4

rename loop section heading

a302723

vercel Bot deployed to Preview June 11, 2026 11:37 View deployment

rewrite intro around main thesis

9a02f0d

vercel Bot deployed to Preview June 11, 2026 11:40 View deployment

annabellscha added 2 commits June 11, 2026 13:41

update blog title

ef727d8

soften framing around auto improvement

398221b

vercel Bot deployed to Preview June 11, 2026 11:47 View deployment

add alternate fable and langfuse blog draft

f26a01a

vercel Bot deployed to Preview June 11, 2026 11:52 View deployment

annabellscha added 3 commits June 11, 2026 14:30

reframe intro around loops interest

434c503

rewrite intro tone in our own voice

802ab6c

tie intro to broader AI engineering loop post

ee198d9

vercel Bot deployed to Preview June 11, 2026 12:39 View deployment

rewrite intro around loop design thesis

4ae98e4

annabellscha added 7 commits June 11, 2026 22:50

add screenshot placeholders and model footnote

1cf67f9

add kaggle dataset link

b27bbc0

clarify shared errors were on test

a2cd862

clarify starting prompt was just label list

0d5c230

make model footnote visually distinct

537c809

remove alternate auto improvement blog draft

35d8261

retie intro to peter steinberger framing

ee55349

vercel Bot deployed to Preview June 11, 2026 21:02 View deployment

annabellscha added 2 commits June 11, 2026 23:05

add peter steinberger intro screenshot

ea7eddc

reframe intro around steipete and bcherny

5e45c8e

vercel Bot deployed to Preview June 11, 2026 21:08 View deployment

annabellscha added 3 commits June 11, 2026 23:09

revert opener to peter-only framing

5ebee00

restore bcherny opener

f4b9d3e

restore steipete intro and screenshot

8adcf41

vercel Bot deployed to Preview June 11, 2026 21:18 View deployment

expand round tables with v1 v3 and reasoning run

61d602a

vercel Bot deployed to Preview June 12, 2026 14:13 View deployment

soften model-choice footnote

f1f53f3

vercel Bot deployed to Preview June 12, 2026 14:21 View deployment

annabellscha added 3 commits June 12, 2026 16:22

soften claims in round two analysis

081e2a0

clarify fuzzy boundary explanation

20082e2

clarify fable stopped short of 95 percent

c233bfe

vercel Bot deployed to Preview June 12, 2026 14:28 View deployment

split fuzzy-boundary point into new paragraph

bacd3e3

vercel Bot deployed to Preview June 12, 2026 14:34 View deployment

clarify round two result takeaway

471ef99

vercel Bot deployed to Preview June 12, 2026 14:39 View deployment

Lotte-Verheyden reviewed Jun 12, 2026

View reviewed changes

rewrite auto-improvement blog framing

5729b20

vercel Bot deployed to Preview June 15, 2026 08:12 View deployment


		1. A real validation split. We had train for fitting and test for the final check, but nothing in between. So the loop selected prompt versions on train accuracy alone.

		2. More repeated edge cases. The hard errors clustered around a few label boundaries, especially Information Retrieval vs. Databases, Human-Computer Interaction vs. Computers and Society, and subject vs. representation for audio papers. A stronger benchmark would have forced new rules to prove themselves across multiple similar cases.


		2. More repeated edge cases. The hard errors clustered around a few label boundaries, especially Information Retrieval vs. Databases, Human-Computer Interaction vs. Computers and Society, and subject vs. representation for audio papers. A stronger benchmark would have forced new rules to prove themselves across multiple similar cases.

		3. Clearer policy for ambiguous papers. Some of the shared errors look genuinely arguable. If the benchmark wants one exact label, it needs sharper tie-break rules, better canonical examples, and maybe even an `unsure` or multi-label policy.


		3. Clearer policy for ambiguous papers. Some of the shared errors look genuinely arguable. If the benchmark wants one exact label, it needs sharper tie-break rules, better canonical examples, and maybe even an `unsure` or multi-label policy.

		That is the real lesson: the loop did its job. It surfaced, quickly, that the next bottleneck was not another prompt tweak. It was the dataset.


		## Where this is actually useful today

		None of this means "do not automate the loop." It means: automate the inner loop, own the outer one.

Uh oh!

Conversation

annabellscha commented Jun 11, 2026

Summary

Uh oh!

vercel Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lotte-Verheyden left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Jun 11, 2026 •

edited

Loading