feat: add chaos resilience evaluators (failure communication, partial completion, recovery strategy) by ybdarrenwang · Pull Request #236 · strands-agents/evals

ybdarrenwang · 2026-05-29T16:33:53Z

Description

Adds three chaos-specific LLM-judge evaluators:

FailureCommunicationEvaluator — scores how well an agent communicates failures to the user (clarity, actionability, transparency, tone)
PartialCompletionEvaluator — scores what percentage of the user's goal was achieved despite failures (0.0–1.0 continuous)
RecoveryStrategyEvaluator — scores the appropriateness of the agent's recovery actions (exploration breadth, retry discipline, approach variation)

Each evaluator ships with a v0 system prompt, follows the existing Evaluator base class interface, and supports both sync and async evaluation paths.

Also addresses deferred PR #224 review comments:

Remove stale ToolCallFailure reference in plugin.py docstring
Replace Optional[X] with X | None union syntax in experiment.py
Add validation that ChaosCase.effects dict keys are restricted to known categories

Related Issues

#114

Documentation PR

strands-agents/docs#836

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2026-05-29T16:39:32Z

Assessment: Comment

Well-structured PR that follows the established evaluator pattern correctly — each evaluator extends Evaluator[InputT, OutputT], returns list[EvaluationOutput], uses Agent(...) for LLM calls, and ships with versioned prompt templates.

Review Categories

Type annotations: All three evaluators import Union from typing_extensions instead of using PEP 604 | syntax, which the repo explicitly prohibits in AGENTS.md and which existing evaluators (e.g., correctness_evaluator.py) already use correctly.
Code duplication: The sync/async method pairs copy-paste the full body; other evaluators in the repo (e.g., conciseness_evaluator.py) demonstrate extracting a _create_evaluation_output helper.
Test structure: Test files are placed in tests/strands_evals/chaos/ rather than the mirrored tests/strands_evals/chaos/evaluators/ directory, and the new ChaosCase validation lacks a test.
Discoverability: Evaluators are not re-exported from chaos/__init__.py.

Prompt templates are thorough and well-designed with clear rubrics and edge case handling.

github-actions · 2026-05-29T17:34:51Z

Assessment: Comment

The evaluator implementations are clean, correct, and consistent with the rest of the codebase. Most issues from the previous review appear to have already been addressed (PEP 604 unions ✓, _build_output helpers ✓, validation test ✓).

Remaining Items

Test placement: Per AGENTS.md, tests/strands_evals/ should mirror src/strands_evals/ exactly. Evaluator tests should live at tests/strands_evals/chaos/evaluators/ to mirror the source at src/strands_evals/chaos/evaluators/. This matches how tests/strands_evals/evaluators/ mirrors src/strands_evals/evaluators/.
Discoverability: The new evaluators are exported from chaos/evaluators/__init__.py but not re-exported from chaos/__init__.py. Users doing from strands_evals.chaos import ... won't see them — they'd need to know about the nested evaluators subpackage.

Prompt templates are thoughtfully designed with clear evaluation rubrics, mandatory gate conditions, and explicit guidance on edge cases (no-failure baseline, transient vs permanent errors, goal-based vs tool-based subtask decomposition).

poshinchen

I prefer to have these evaluators in strands_evals/evaluators/chaos/*. Either of below works for me, but I prefer the latter.

from strands_evals.chaos.evaluators import ...
from strands_evals.evaluators.chaos import ...

github-actions · 2026-06-02T17:27:10Z

Assessment: Comment

Clean, well-structured implementation that follows the established evaluator pattern correctly and addresses all previously-flagged issues from the first review round.

Remaining Items

Dead config: pyproject.toml has a ruff ignore for a non-existent path (src/strands_evals/chaos/evaluators/prompt_templates/*) — suggests residual confusion from a directory structure change.
Test placement: Test files are in tests/strands_evals/chaos/ but source is at src/strands_evals/evaluators/chaos/ — mirror rule says tests should be at tests/strands_evals/evaluators/chaos/.
Discoverability: Chaos evaluators live in a different namespace (evaluators.chaos) from the rest of the chaos API (chaos.*). Consider re-exporting or documenting the import path in the ChaosExperiment docstring.

Prompt templates are thoughtfully designed with mandatory gate conditions, error-type-aware retry evaluation, and goal-based (not tool-based) subtask decomposition.

github-actions · 2026-06-02T17:27:43Z

Assessment: Comment

Clean, well-structured PR. The evaluator implementations correctly follow all established patterns and the previous review's critical issues have been addressed. Two remaining structural items to fix.

Review Details

pyproject.toml: Contains a dead ruff ignore path (src/strands_evals/chaos/evaluators/prompt_templates/*) pointing to a non-existent directory — should be removed since the correct path is already on line 141.
Test placement: Evaluator tests are at tests/strands_evals/chaos/ but their source is at src/strands_evals/evaluators/chaos/, so the mirror should be tests/strands_evals/evaluators/chaos/.

Prompt templates are excellent — thoughtful rubrics with mandatory gate conditions, error type taxonomies, and clear guidance on scoring edge cases.

github-actions · 2026-06-02T17:54:53Z

Assessment: Comment

Solid implementation that follows all established evaluator patterns correctly. The three evaluators, prompt templates, ChaosCase validation, and deferred cleanup items are all well-executed.

Review Details

Test placement: Evaluator tests should be at tests/strands_evals/evaluators/chaos/ to mirror source at src/strands_evals/evaluators/chaos/ per AGENTS.md.
Discoverability: Consider re-exporting evaluators from chaos/__init__.py so users don't need to know about the evaluators.chaos subpackage.
Previous review correction: The pyproject.toml comments from prior reviews are incorrect — the added path (src/strands_evals/evaluators/chaos/prompt_templates/*) is valid and exists.

Prompt templates are thoughtfully crafted with mandatory gate conditions, error-type-aware retry evaluation, and goal-based subtask decomposition.

ybdarrenwang requested a deployment to manual-approval May 29, 2026 16:34 — with GitHub Actions Waiting

ybdarrenwang temporarily deployed to auto-approve May 29, 2026 16:34 — with GitHub Actions Inactive

github-actions Bot added the strands-running label May 29, 2026

github-actions Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/strands_evals/chaos/evaluators/failure_communication_evaluator.py Outdated

github-actions Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/chaos/failure_communication_evaluator.py

github-actions Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/strands_evals/chaos/case.py

github-actions Bot removed the strands-running label May 29, 2026

ybdarrenwang requested a deployment to manual-approval May 29, 2026 17:18 — with GitHub Actions Waiting

ybdarrenwang temporarily deployed to auto-approve May 29, 2026 17:21 — with GitHub Actions Inactive

github-actions Bot added the strands-running label May 29, 2026

github-actions Bot removed the strands-running label May 29, 2026

poshinchen requested changes Jun 2, 2026

View reviewed changes

ybdarrenwang requested a deployment to manual-approval June 2, 2026 17:22 — with GitHub Actions Waiting

ybdarrenwang temporarily deployed to auto-approve June 2, 2026 17:22 — with GitHub Actions Inactive

github-actions Bot added the strands-running label Jun 2, 2026

ybdarrenwang requested a deployment to manual-approval June 2, 2026 17:23 — with GitHub Actions Waiting

ybdarrenwang temporarily deployed to auto-approve June 2, 2026 17:23 — with GitHub Actions Inactive

github-actions Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

github-actions Bot removed the strands-running label Jun 2, 2026

github-actions Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

ybdarrenwang added 5 commits June 2, 2026 17:47

ship 3 resilience evaluators

07fdd26

keep only v0 prompts; address comments from strands-agents#224

715c046

add unit tests

0098d07

address review comments

f64052f

move chaos evaluator folder

53f2bf5

ybdarrenwang force-pushed the feature/resilience-eval branch from fd78e10 to 53f2bf5 Compare June 2, 2026 17:48

ybdarrenwang requested a deployment to manual-approval June 2, 2026 17:49 — with GitHub Actions Waiting

ybdarrenwang temporarily deployed to auto-approve June 2, 2026 17:49 — with GitHub Actions Inactive

github-actions Bot added the strands-running label Jun 2, 2026

github-actions Bot removed the strands-running label Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add chaos resilience evaluators (failure communication, partial completion, recovery strategy)#236

feat: add chaos resilience evaluators (failure communication, partial completion, recovery strategy)#236
ybdarrenwang wants to merge 5 commits into
strands-agents:mainfrom
ybdarrenwang:feature/resilience-eval

ybdarrenwang commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

poshinchen left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ybdarrenwang commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

poshinchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ybdarrenwang commented May 29, 2026 •

edited

Loading