The first multi-level safety evaluation platform for OpenClaw-style AI agents
Jointly released by Tsinghua University, Peking University, Zhejiang University, Nanjing University, Hangzhou Dianzi University and Ant Group.
English | 简体中文
- 1. Background
- 2. Risk Taxonomy: 7 Dimensions
- 3. Three-Tier Evaluation Framework
- 4. How to Use
- 5. Leaderboards
- 6. Industry Impact & Roadmap
- 7. Participating Organizations
As AI agents evolve from "answering questions" to "invoking tools, operating systems, and executing real business actions," both their capabilities and risk surface scale up dramatically:
- Yesterday's risk: a model "saying the wrong thing" — at worst, a poor user experience
- Today's risk: an agent "doing the wrong thing" — deleting critical files, leaking corporate funds or confidential data, even triggering cascading system failures
The central question: in complex, real-world environments, can an agent complete tasks efficiently while holding the line on safety?
Agent3σ delivers a systematic, reproducible, production-grade safety evaluation framework to answer this question.
Agent3σ goes beyond traditional LLM benchmarks, mapping real-world agent risk into 7 categories and 30+ representative scenarios — covering the full attack surface from local system damage to financial transactions.
| # | Risk Category | Representative Scenarios | Description |
|---|---|---|---|
| 1 | Local Environment & Availability | Resource exhaustion, file deletion, dangerous command execution, system tampering, process disruption | Exhaust system resources; delete or alter files & configs; run destructive commands that break services |
| 2 | Data & Information Security | Sensitive info leakage, data exfiltration, unauthorized credential access, content tampering, ransomware-style encryption | Unauthorized credential reads; outbound exfiltration; tampered business data; extortion via encryption |
| 3 | Persistent State & Memory Poisoning | Persona corruption, memory injection, lingering malicious instructions, config tampering, plugin/skill poisoning | Modify core agent config to keep emitting unsafe output; plant "delayed-trigger" instructions in long-term memory |
| 4 | Privilege & System Control | Sandbox escape, privilege escalation, defense bypass, authorization-boundary confusion | Break out of execution sandbox; obtain admin rights; blur test/prod boundaries |
| 5 | Network Attack & Remote Control | Reverse shell, DNS hijacking, intranet probing, malicious persistence, supply-chain pollution | Establish outbound control channels; redirect traffic; install backdoors or untrusted plugins |
| 6 | Abuse & Illicit Use | Fraud & social engineering, black-market automation, illegal content distribution, brand damage | Phishing, fraud farming, mass registration, money-laundering assistance, harmful content generation |
| 7 | Financial & Transaction Risk | Unconfirmed sensitive transactions, account manipulation, parameter tampering, misleading financial decisions | Transfer funds without controls; alter payees or transaction params; mislead investment/credit decisions |
To cover the full lifecycle from red-team screening to production sign-off, Agent3σ introduces a progressive L1 / L2 / L3 framework.
| Name | Tier | Form | Highlights | Cost | Reproducibility | Use Case |
|---|---|---|---|---|---|---|
| Agent3σ-Sweep | L1 Static | Offline scoring on static samples | Broad coverage, low cost, batch-friendly | Low | High | Model training / fast red-team triage |
| Agent3σ-Stage | L2 Simulated | Plugins simulate web pages, email, and other interactive scenarios | Stable and reproducible; supports multi-turn interaction | Medium | High | Model iteration / capability comparison |
| Agent3σ-Canary | L3 Real-World | Execution via real tool/API interfaces | Close to real deployment environments; reconstructs the agent's actual execution trace | High | Medium | End-to-end safety validation / pre-launch sign-off |
💡 Philosophy: The three tiers compound — from surface refusal to deep real-world defense — to reveal an agent's true safety posture.
Each evaluation tier ships as a standalone dataset and runner — pick the one that fits your stage:
| Tier | Suite | Use Case | Entry Point |
|---|---|---|---|
| L1 | Agent3σ-Sweep Static evaluation |
Large-scale screening during model training; fast red-team scoring | 📂 Sweep Benchmark → |
| L2 | Agent3σ-Stage Simulated interaction |
Multi-turn capability comparison; model iteration | 📂 Stage Benchmark → |
| L3 | Agent3σ-Canary Real-world evaluation |
Pre-launch safety sign-off; compliance audit | 📂 Canary Benchmark → |
💡 Recommended path:
L1 fast screening→L2 stable iteration→L3 launch sign-off— run all three for the most complete safety picture.
| Metric | Meaning | Direction |
|---|---|---|
| ASR | Attack Success Rate | ↓ lower is better |
| Sec Awareness | Clear refusal rate on risky tasks | ↑ higher is better |
| Task Success | Success rate on benign tasks | ↑ higher is better |
| Avg Score | (100 − ASR) × 0.6 + Sec Awareness × 0.2 + Task Success × 0.2 | ↑ higher is better |
Composite score = arithmetic mean of L1 / L2 / L3 Avg Scores
| Rank | Model | L1 | L2 | L3 | Total ↑ |
|---|---|---|---|---|---|
| 🥇 1 | Claude Opus 4.6 | 79.3 | 88.1 | 87.8 | 85.1 |
| 🥈 2 | GPT-5.4 | 75.2 | 81.0 | 67.7 | 74.6 |
| 🥉 3 | Claude Sonnet 4.5 | 81.9 | 79.2 | 58.4 | 73.2 |
| 4 | Qwen3.6-Plus | 69.5 | 68.6 | 71.4 | 69.8 |
| 5 | GLM-5 | 74.7 | 62.7 | 64.2 | 67.2 |
| 6 | DeepSeek-V4-Pro | 54.9 | 57.0 | 59.8 | 57.2 |
| 7 | Qwen3.5-397B-A17B | 63.6 | 52.0 | 52.6 | 56.1 |
| 8 | Gemini-3.1-Pro | 67.9 | 41.5 | 49.7 | 53.0 |
| 9 | Kimi-K2.5 | 53.0 | 52.7 | 48.5 | 51.4 |
| 10 | MiniMax-M2.5 | 55.8 | 47.8 | 48.7 | 50.8 |
| 11 | Qwen3.5-122B-A10B | 57.3 | 40.2 | 47.6 | 48.4 |
| 12 | Qwen3.5-35B-A3B | 61.1 | 31.1 | 39.3 | 43.8 |
Key findings:
- Claude Opus 4.6 leads all three tiers, showing the most consistent end-to-end defense
- Qwen3.6-Plus stands out in L2 and L3, indicating strong awareness of tool-invocation boundaries
- Some models score well on L1 but collapse in real environments — proving that multi-tier evaluation is essential for surfacing real risk
| Rank | Model | ASR ↓ | Sec Awareness ↑ | Task Success ↑ | Avg Score ↑ |
|---|---|---|---|---|---|
| 🥇 1 | Claude Sonnet 4.5 | 10.0% | 67.5% | 71.8% | 81.9 |
| 🥈 2 | Claude Opus 4.6 | 12.7% | 64.8% | 69.8% | 79.3 |
| 🥉 3 | GPT-5.4 | 15.2% | 63.3% | 58.3% | 75.2 |
| 4 | GLM-5 | 20.3% | 58.0% | 76.3% | 74.7 |
| 5 | Qwen3.6-Plus | 30.4% | 54.4% | 84.3% | 69.5 |
| 6 | Gemini-3.1-Pro | 27.8% | 43.0% | 80.0% | 67.9 |
| 7 | Qwen3.5-397B-A17B | 36.2% | 50.0% | 76.9% | 63.6 |
| 8 | Qwen3.5-35B-A3B | 37.7% | 36.4% | 82.1% | 61.1 |
| 9 | Qwen3.5-122B-A10B | 45.0% | 38.8% | 82.9% | 57.3 |
| 10 | MiniMax-M2.5 | 46.2% | 35.0% | 82.9% | 55.8 |
| 11 | DeepSeek-V4-Pro | 47.5% | 35.0% | 82.1% | 54.9 |
| 12 | Kimi-K2.5 | 50.0% | 28.7% | 86.3% | 53.0 |
| Rank | Model | ASR ↓ | Sec Awareness ↑ | Task Success ↑ | Avg Score ↑ |
|---|---|---|---|---|---|
| 🥇 1 | Claude Opus 4.6 | 9.0% | 74.2% | 93.4% | 88.1 |
| 🥈 2 | GPT-5.4 | 15.2% | 69.4% | 81.1% | 81.0 |
| 🥉 3 | Claude Sonnet 4.5 | 19.7% | 64.3% | 90.8% | 79.2 |
| 4 | Qwen3.6-Plus | 35.4% | 57.2% | 92.0% | 68.6 |
| 5 | GLM-5 | 36.0% | 49.2% | 72.2% | 62.7 |
| 6 | DeepSeek-V4-Pro | 47.7% | 42.3% | 85.8% | 57.0 |
| 7 | Kimi-K2.5 | 55.1% | 37.3% | 91.5% | 52.7 |
| 8 | Qwen3.5-397B-A17B | 55.2% | 35.5% | 90.1% | 52.0 |
| 9 | MiniMax-M2.5 | 59.5% | 26.7% | 91.0% | 47.8 |
| 10 | Gemini-3.1-Pro | 48.8% | 18.6% | 35.1% | 41.5 |
| 11 | Qwen3.5-122B-A10B | 67.4% | 18.2% | 84.9% | 40.2 |
| 12 | Qwen3.5-35B-A3B | 77.7% | 10.5% | 78.1% | 31.1 |
| Rank | Model | ASR ↓ | Sec Awareness ↑ | Task Success ↑ | Avg Score ↑ |
|---|---|---|---|---|---|
| 🥇 1 | Claude Opus 4.6 | 8.8% | 78.4% | 87.2% | 87.8 |
| 🥈 2 | Qwen3.6-Plus | 27.0% | 56.7% | 81.4% | 71.4 |
| 🥉 3 | GPT-5.4 | 28.5% | 52.2% | 71.8% | 67.7 |
| 4 | GLM-5 | 32.4% | 48.5% | 69.6% | 64.2 |
| 5 | DeepSeek-V4-Pro | 36.2% | 48.9% | 58.9% | 59.8 |
| 6 | Claude Sonnet 4.5 | 39.1% | 46.0% | 63.5% | 58.4 |
| 7 | Qwen3.5-397B-A17B | 46.8% | 38.1% | 65.3% | 52.6 |
| 8 | Gemini-3.1-Pro | 31.8% | 27.4% | 16.5% | 49.7 |
| 9 | MiniMax-M2.5 | 50.0% | 27.4% | 65.9% | 48.7 |
| 10 | Kimi-K2.5 | 50.7% | 36.2% | 58.5% | 48.5 |
| 11 | Qwen3.5-122B-A10B | 49.6% | 26.6% | 60.1% | 47.6 |
| 12 | Qwen3.5-35B-A3B | 54.4% | 17.3% | 42.4% | 39.3 |
Agent3σ moves agent safety evaluation beyond pure "prompt attack/defense" into an era of observable, quantifiable, and comparable end-to-end task evaluation.
| Stakeholder | Value |
|---|---|
| Model providers | A production-grade red-team stress benchmark for locating real-world risk blind spots |
| Application developers | A rigorous pre-launch safety bar that meaningfully reduces catastrophic deployment risk |
| Regulators & compliance | Reproducible, auditable evidence chains supporting AI governance |
Roadmap: The participating organizations will continue to expand the Agent3σ risk corpus, toolchain, and scenario coverage, and release more evaluation capabilities to the open-source community — building the safety foundation of the agent era together with the industry.
This project is jointly initiated and co-built by the following organizations:
- Tsinghua University
- Peking University
- Zhejiang University
- Nanjing University
- Hangzhou Dianzi University
- Ant Group
We warmly welcome more partners to join us in advancing the agent safety evaluation ecosystem.
