exp: terminal-bench-2-v1 baseline results in README by mikasenghaas · Pull Request #1656 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-06-12T16:27:18Z

Summary

Adds examples/tasksets/terminal_bench_2_v1/README.md (the taskset had none) with baseline results for three models on terminal-bench-2-v1 — same command per model, rlm harness on the modal runtime, served via Prime Intellect inference, n=1 rollout/task (89 tasks):

uv run eval terminal-bench-2-v1 --harness.runtime.type modal --harness.id rlm \
  --max-turns 100 --timeout.rollout 3600 -m <model>

Model	Accuracy	Corrected (excl. caps)	Reward	Errored	Capped	MC-400s	Wall
`deepseek/deepseek-v4-flash`	48.3% (43/89)	59.7% (43/72)	0.483	0	17	23	~60 min
`z-ai/glm-4.7`	29.2% (26/89)	33.8% (26/77)	0.292	0	12	72	~60 min
`qwen/qwen3.6-27b`	18.0% (16/89)	18.0% (16/89)	0.180	1	0	72	~48 min

Corrected = accuracy excluding rollouts that hit the turn/time cap (none of the capped rollouts solved in any run).
Capped = hit the 100-turn or 3600 s limit rather than the agent stopping itself — qwen never capped (always agent_completed), it just finished early and wrong.
MC-400s = transient context-length 400s the rollout retried/recovered from (not terminal errors).

Note

Add baseline results README for terminal-bench-2-v1 taskset

Adds README.md documenting the terminal-bench-2-v1 taskset, including prerequisites (harbor CLI and container runtime), a command example for running with the rlm harness on Modal, and a results table with model accuracy and related metrics with explanatory footnotes.

^{Macroscope summarized 0660d4d.}

Three models on terminal-bench-2-v1 via the rlm harness on the modal runtime (pi inference, n=1 rollout/task): deepseek/deepseek-v4-flash 48.3%, z-ai/glm-4.7 29.2%, qwen/qwen3.6-27b 18.0%. Table also reports corrected accuracy (excluding rollouts that hit the 100-turn / 3600s cap - none of which solved), capped + transient model-call-400 counts, and wall time. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp: terminal-bench-2-v1 baseline results in README#1656

exp: terminal-bench-2-v1 baseline results in README#1656
mikasenghaas wants to merge 1 commit into
feat/nano-as-v1from
exp/terminal-bench-2-baselines

mikasenghaas commented Jun 12, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikasenghaas commented Jun 12, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Add baseline results README for terminal-bench-2-v1 taskset

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikasenghaas commented Jun 12, 2026 •

edited by macroscopeapp Bot

Loading