-
Notifications
You must be signed in to change notification settings - Fork 246
Add auto-improvement classification blog draft #3084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 52 commits
1ec93b7
ceaa8f0
a65b255
daebe28
e4b5997
335669a
3ab266c
d1ddf60
d2b7ec4
a302723
9a02f0d
ef727d8
398221b
f26a01a
434c503
802ab6c
ee198d9
4ae98e4
54d08df
0417992
1693f3f
e20bb64
7fe5a46
4dc5f3f
acd35e3
796ba20
d561057
635950e
99f14ce
d89a078
66d94e8
6b35cf2
c27fd38
1cf67f9
b27bbc0
a2cd862
0d5c230
537c809
35d8261
ee55349
ea7eddc
5e45c8e
5ebee00
f4b9d3e
8adcf41
61d602a
f1f53f3
081e2a0
20082e2
c233bfe
bacd3e3
471ef99
5729b20
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| --- | ||
| title: "97% on train, 82% on test: auto-improvement loops have a validation gap" | ||
| date: 2026/06/10 | ||
| description: "Claude Fable 5 had just come out, so we used it to run an auto-improvement loop end to end. It improved train accuracy fast. What it really taught us was how much a validation split matters." | ||
| tag: engineering | ||
| author: Annabell | ||
| --- | ||
|
|
||
| import { BlogHeader } from "@/components/blog/BlogHeader"; | ||
| import { Frame } from "@/components/Frame"; | ||
|
|
||
| <BlogHeader | ||
| title="97% on train, 82% on test: auto-improvement loops have a validation gap" | ||
| description="Claude Fable 5 had just come out, so we used it to run an auto-improvement loop end to end. It improved train accuracy fast. What it really taught us was how much a validation split matters." | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would address that steipete is talking about personal coding agents, and Langfuse generally is used for agents that you deploy for users |
||
| date="June 10, 2026" | ||
| authors={["annabellschafer"]} | ||
| /> | ||
|
|
||
| Loop-shaped workflows are having a moment. `@steipete` put it plainly: you should not be prompting coding agents anymore, you should be designing loops that prompt your agents. | ||
|
|
||
| <Frame> | ||
|  | ||
| </Frame> | ||
|
|
||
| We designed a loop and gave Claude Fable 5 a classification task: a train/test split in [Langfuse Datasets](/docs/evaluation/experiments/datasets), a prompt in [Prompt Management](/docs/prompt-management/get-started), and a goal: hit 95% accuracy or stop at 15 runs. Train accuracy went from 78% to 97% in four runs. Test performance barely moved. The 11 remaining errors on test were shared across every prompt variant. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd not mention too many specifics here, and instead mention this in the "the task and setup" section. This paragraph could become We designed a loop and gave Claude Fable 5 a classification task. The loop did its job, but it surfaced a dataset problem: while train accuracy improved, test performance barely moved. |
||
|
|
||
| The loop did its job. What it surfaced was a dataset problem. | ||
|
|
||
| ## The task and setup | ||
|
|
||
| Our task was to classify arXiv papers into one of 10 categories from title, authors, and abstract, using this [Kaggle dataset](https://www.kaggle.com/datasets/sumitm004/arxiv-scientific-research-papers-dataset). We picked classification because it gives you a crisp target function: exact-match accuracy. | ||
|
|
||
| - a train split with 200 labeled examples and a held-out test split with 100 in [**Langfuse Datasets**](/docs/evaluation/experiments/datasets) | ||
| - a prompt in [**Prompt Management**](/docs/prompt-management/get-started) | ||
| - a small runner built with [**Langfuse Experiments via the SDK**](/docs/evaluation/experiments/experiments-via-sdk) | ||
| - `gpt-4o-mini` as the task model<sup id="fnref-model-choice"><a href="#fn-model-choice">1</a></sup> | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd explain the improvement loop before this, and explicitly mention that there is a task model, used for the classification task, and an optimizer model, that runs experiments to improve that classification task. If not explained people might be confused because you mentioned using Claude Fable 5 in the intro, and here it mentions gpt-4o-mini
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| The starting prompt was minimal, consisting only of a list of labels with instructions to pick one. After each run, the agent reviewed the errors, wrote [comments](/docs/observability/features/comments) on the [Langfuse trace](/docs/observability/data-model#traces), published a new [prompt version](/docs/prompt-management/features/prompt-version-control), and ran again. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be nice to have the first prompt version shown, could be in an expandable component if it takes up much space |
||
|
|
||
| _Suggested platform screenshot: the Langfuse workbench for this loop, showing the dataset, prompt, and experiment setup together._ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just flagging, I think this is still a todo in the doc
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (I like to mark with yellow, makes it impossible to look over :) ) |
||
|
|
||
| <Frame fullWidth> | ||
|  | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would make the formatting better here
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd also try to remove the bottom line of text in each box. It feels a bit noisy now |
||
| </Frame> | ||
|
|
||
| ## Round 1: the hill sprint | ||
|
|
||
| The first round optimized fast: | ||
|
|
||
| | Prompt | Train | Test | Gap | | ||
| | --- | ---: | ---: | ---: | | ||
| | v1 - flat label list | 78.0% | - | - | | ||
| | v2 - general definitions | 90.5% | **84.0%** | 6.5 | | ||
| | v3 - sharpened boundary rules | 90.0% | - | - | | ||
| | v4 - train-derived precedents | **97.0%** | 82.0% | **15.0** | | ||
|
|
||
| Moving from a flat label list to general definitions was real progress. But once the agent started encoding concrete precedents from the training failures, train accuracy jumped while generalization got worse. The prompt that looked best on train was not the one that held up on test. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "was real progress" sounds a bit weird |
||
|
|
||
| _Suggested platform screenshot: the Langfuse runs overview comparing prompt versions and scores across the experiment history._ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also flagging this todo |
||
|
|
||
| ## Round 2: "generalize this time" | ||
|
|
||
| So we restarted from the more general prompt and changed the rules: no single-paper precedents, only class-level principles, and no touching the test set. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "no touching the test set": I assume they also couldn't touch the test set before? If so, remove this part. |
||
|
|
||
| | Prompt | Train | Test | Gap | | ||
| | --- | ---: | ---: | ---: | | ||
| | v2 - general definitions, round 1 | 90.5% | **84.0%** | 6.5 | | ||
| | v5 - reasoning field, round 2 | 84.0% | - | - | | ||
| | v9 - general principles, round 2 | 94.0% | 81.0% | 13.0 | | ||
|
|
||
| Only selected prompts were run on the held-out test split. | ||
|
|
||
| The disciplined second round did not produce better held-out results. Adding a `reasoning` field also did not help: in our runs, it seemed to encourage the model to rationalize surface cues instead of resolving the label boundary. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here I'm a bit lost
|
||
|
|
||
| By the end, Fable's own analysis suggested that many of the remaining errors sat on genuinely fuzzy category boundaries, so it stopped before hitting the 95% target. Pushing further likely would have required adding increasingly specific case-by-case rules, rather than finding broader principles that generalized. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. genuinely = AI tell |
||
|
|
||
| The final prompt version of round 2 hit 81.0% accuracy on test. That is a 13-point gap to the train set, and very much in line with the drop between train and test in round 1. It just took more turns to get to the point of overfitting. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was confused with round 1 / round 2 and the iteration numbers. Can we use something other than numbers for the rounds? |
||
|
|
||
| In the end, the best-performing prompt on test was still `v2` from the first round. That iteration simply added broad descriptions to the label-only version, and it generalized best. | ||
|
|
||
| _Suggested platform screenshot: one recurring hard example in Langfuse, with the trace and annotation showing why the boundary case stayed unresolved._ | ||
|
|
||
| ## What the dataset would have needed | ||
|
|
||
| Based on these runs, the dataset likely needed three things: | ||
|
|
||
| **1. A real validation split.** We had train for fitting and test for the final check, but nothing in between. So the loop selected prompt versions on train accuracy alone. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not clear to me, does this mean we needed a third set? What would the function of that set be? |
||
|
|
||
| **2. More repeated edge cases.** The hard errors clustered around a few label boundaries, especially Information Retrieval vs. Databases, Human-Computer Interaction vs. Computers and Society, and subject vs. representation for audio papers. A stronger benchmark would have forced new rules to prove themselves across multiple similar cases. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "The hard errors clustered around a few label boundaries, especially Information Retrieval vs. Databases, Human-Computer Interaction vs. Computers and Society, and subject vs. representation for audio papers." --> The section feels like a conclusion section to me, I don't expect to read new information here. I'd discuss these specific errors earlier and then reference them here
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. benchmark --> dataset (we call it dataset earlier)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "forced new rules to prove themselves across multiple similar cases." --> this sounds contradictory, what exactly does this mean? |
||
|
|
||
| **3. Clearer policy for ambiguous papers.** Some of the shared errors look genuinely arguable. If the benchmark wants one exact label, it needs sharper tie-break rules, better canonical examples, and maybe even an `unsure` or multi-label policy. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. genuinely = AI tell
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like this point and point 2 are about the same thing, maybe cut one of them?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. benchmark -> dataset |
||
|
|
||
| That is the real lesson: the loop did its job. It surfaced, quickly, that the next bottleneck was not another prompt tweak. It was the dataset. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd leave this out |
||
|
|
||
| ## Where this is actually useful today | ||
|
|
||
| None of this means "do not automate the loop." It means: automate the inner loop, own the outer one. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "automate the inner loop, not the outer one": what do you mean by that? In my head the 2 loops are
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah reading further I see what you mean with each. I wonder if the human-owned part is actually a loop though (the post also didn't describe it as a loop), would actually cut the inner vs outer loop here, I think it would confuse people |
||
|
|
||
| - **Agent-owned:** running experiments, scoring, per-error annotation, drafting hypothesis-driven prompt revisions, diffing errors across runs, flagging plateaus | ||
| - **Human-owned:** the target function, including the validation and held-out test data nobody optimizes against, dataset composition, when to restart with different constraints, and when to stop | ||
|
|
||
| As we argued in [AI is eating the AI engineering loop](/blog/2026-06-09-ai-is-eating-ai-engineering), the mechanics aren't the hard part anymore. This experiment shows what the hard part actually is: the target function, the dataset, and the judgment calls nobody automates against. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reading this, I would actually frame this post as an illustration of how lack of human judgement in parts of the AI engineering loop result in suboptimal agents. |
||
|
|
||
| This is exactly what Langfuse is good for: [datasets](/docs/evaluation/experiments/datasets), [prompt versioning](/docs/prompt-management/features/prompt-version-control), [experiments](/docs/evaluation/core-concepts#experiments), and [trace comments](/docs/observability/features/comments) give the agent a workbench and an audit trail. | ||
|
|
||
| <Callout type="info"> | ||
| <span id="fn-model-choice"><sup>1</sup> We used <code>gpt-4o-mini</code> because one realistic production strategy for a narrow, repetitive classification task like this is to tune a cheaper model rather than default to a frontier model. A stronger model likely would have performed better out of the box, but that would have tested a different tradeoff.</span> | ||
| </Callout> | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the word "matters" has become an AI word for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could say "the importance of a good validation split"