Skip to content

exp: terminal-bench-2-v1 baseline results in README#1656

Draft
mikasenghaas wants to merge 1 commit into
feat/nano-as-v1from
exp/terminal-bench-2-baselines
Draft

exp: terminal-bench-2-v1 baseline results in README#1656
mikasenghaas wants to merge 1 commit into
feat/nano-as-v1from
exp/terminal-bench-2-baselines

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Adds examples/tasksets/terminal_bench_2_v1/README.md (the taskset had none) with baseline results for three models on terminal-bench-2-v1 — same command per model, rlm harness on the modal runtime, served via Prime Intellect inference, n=1 rollout/task (89 tasks):

uv run eval terminal-bench-2-v1 --harness.runtime.type modal --harness.id rlm \
  --max-turns 100 --timeout.rollout 3600 -m <model>
Model Accuracy Corrected (excl. caps) Reward Errored Capped MC-400s Wall
deepseek/deepseek-v4-flash 48.3% (43/89) 59.7% (43/72) 0.483 0 17 23 ~60 min
z-ai/glm-4.7 29.2% (26/89) 33.8% (26/77) 0.292 0 12 72 ~60 min
qwen/qwen3.6-27b 18.0% (16/89) 18.0% (16/89) 0.180 1 0 72 ~48 min
  • Corrected = accuracy excluding rollouts that hit the turn/time cap (none of the capped rollouts solved in any run).
  • Capped = hit the 100-turn or 3600 s limit rather than the agent stopping itself — qwen never capped (always agent_completed), it just finished early and wrong.
  • MC-400s = transient context-length 400s the rollout retried/recovered from (not terminal errors).

Note

Add baseline results README for terminal-bench-2-v1 taskset

Adds README.md documenting the terminal-bench-2-v1 taskset, including prerequisites (harbor CLI and container runtime), a command example for running with the rlm harness on Modal, and a results table with model accuracy and related metrics with explanatory footnotes.

Macroscope summarized 0660d4d.

Three models on terminal-bench-2-v1 via the rlm harness on the modal runtime (pi inference,
n=1 rollout/task): deepseek/deepseek-v4-flash 48.3%, z-ai/glm-4.7 29.2%, qwen/qwen3.6-27b
18.0%. Table also reports corrected accuracy (excluding rollouts that hit the 100-turn /
3600s cap - none of which solved), capped + transient model-call-400 counts, and wall time.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant