Skip to content

feat: add chaos resilience evaluators (failure communication, partial completion, recovery strategy)#236

Open
ybdarrenwang wants to merge 5 commits into
strands-agents:mainfrom
ybdarrenwang:feature/resilience-eval
Open

feat: add chaos resilience evaluators (failure communication, partial completion, recovery strategy)#236
ybdarrenwang wants to merge 5 commits into
strands-agents:mainfrom
ybdarrenwang:feature/resilience-eval

Conversation

@ybdarrenwang
Copy link
Copy Markdown
Collaborator

@ybdarrenwang ybdarrenwang commented May 29, 2026

Description

Adds three chaos-specific LLM-judge evaluators:

  • FailureCommunicationEvaluator — scores how well an agent communicates failures to the user (clarity, actionability, transparency, tone)
  • PartialCompletionEvaluator — scores what percentage of the user's goal was achieved despite failures (0.0–1.0 continuous)
  • RecoveryStrategyEvaluator — scores the appropriateness of the agent's recovery actions (exploration breadth, retry discipline, approach variation)

Each evaluator ships with a v0 system prompt, follows the existing Evaluator base class interface, and supports both sync and async evaluation paths.

Also addresses deferred PR #224 review comments:

  • Remove stale ToolCallFailure reference in plugin.py docstring
  • Replace Optional[X] with X | None union syntax in experiment.py
  • Add validation that ChaosCase.effects dict keys are restricted to known categories

Related Issues

#114

Documentation PR

strands-agents/docs#836

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Comment thread src/strands_evals/chaos/evaluators/failure_communication_evaluator.py Outdated
Comment thread src/strands_evals/chaos/case.py
@github-actions
Copy link
Copy Markdown

Assessment: Comment

Well-structured PR that follows the established evaluator pattern correctly — each evaluator extends Evaluator[InputT, OutputT], returns list[EvaluationOutput], uses Agent(...) for LLM calls, and ships with versioned prompt templates.

Review Categories
  • Type annotations: All three evaluators import Union from typing_extensions instead of using PEP 604 | syntax, which the repo explicitly prohibits in AGENTS.md and which existing evaluators (e.g., correctness_evaluator.py) already use correctly.
  • Code duplication: The sync/async method pairs copy-paste the full body; other evaluators in the repo (e.g., conciseness_evaluator.py) demonstrate extracting a _create_evaluation_output helper.
  • Test structure: Test files are placed in tests/strands_evals/chaos/ rather than the mirrored tests/strands_evals/chaos/evaluators/ directory, and the new ChaosCase validation lacks a test.
  • Discoverability: Evaluators are not re-exported from chaos/__init__.py.

Prompt templates are thorough and well-designed with clear rubrics and edge case handling.

@github-actions
Copy link
Copy Markdown

Assessment: Comment

The evaluator implementations are clean, correct, and consistent with the rest of the codebase. Most issues from the previous review appear to have already been addressed (PEP 604 unions ✓, _build_output helpers ✓, validation test ✓).

Remaining Items
  • Test placement: Per AGENTS.md, tests/strands_evals/ should mirror src/strands_evals/ exactly. Evaluator tests should live at tests/strands_evals/chaos/evaluators/ to mirror the source at src/strands_evals/chaos/evaluators/. This matches how tests/strands_evals/evaluators/ mirrors src/strands_evals/evaluators/.
  • Discoverability: The new evaluators are exported from chaos/evaluators/__init__.py but not re-exported from chaos/__init__.py. Users doing from strands_evals.chaos import ... won't see them — they'd need to know about the nested evaluators subpackage.

Prompt templates are thoughtfully designed with clear evaluation rubrics, mandatory gate conditions, and explicit guidance on edge cases (no-failure baseline, transient vs permanent errors, goal-based vs tool-based subtask decomposition).

Copy link
Copy Markdown
Contributor

@poshinchen poshinchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have these evaluators in strands_evals/evaluators/chaos/*. Either of below works for me, but I prefer the latter.

  • from strands_evals.chaos.evaluators import ...
  • from strands_evals.evaluators.chaos import ...

Comment thread pyproject.toml Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Assessment: Comment

Clean, well-structured implementation that follows the established evaluator pattern correctly and addresses all previously-flagged issues from the first review round.

Remaining Items
  • Dead config: pyproject.toml has a ruff ignore for a non-existent path (src/strands_evals/chaos/evaluators/prompt_templates/*) — suggests residual confusion from a directory structure change.
  • Test placement: Test files are in tests/strands_evals/chaos/ but source is at src/strands_evals/evaluators/chaos/ — mirror rule says tests should be at tests/strands_evals/evaluators/chaos/.
  • Discoverability: Chaos evaluators live in a different namespace (evaluators.chaos) from the rest of the chaos API (chaos.*). Consider re-exporting or documenting the import path in the ChaosExperiment docstring.

Prompt templates are thoughtfully designed with mandatory gate conditions, error-type-aware retry evaluation, and goal-based (not tool-based) subtask decomposition.

Comment thread pyproject.toml Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Assessment: Comment

Clean, well-structured PR. The evaluator implementations correctly follow all established patterns and the previous review's critical issues have been addressed. Two remaining structural items to fix.

Review Details
  • pyproject.toml: Contains a dead ruff ignore path (src/strands_evals/chaos/evaluators/prompt_templates/*) pointing to a non-existent directory — should be removed since the correct path is already on line 141.
  • Test placement: Evaluator tests are at tests/strands_evals/chaos/ but their source is at src/strands_evals/evaluators/chaos/, so the mirror should be tests/strands_evals/evaluators/chaos/.

Prompt templates are excellent — thoughtful rubrics with mandatory gate conditions, error type taxonomies, and clear guidance on scoring edge cases.

@ybdarrenwang ybdarrenwang force-pushed the feature/resilience-eval branch from fd78e10 to 53f2bf5 Compare June 2, 2026 17:48
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Assessment: Comment

Solid implementation that follows all established evaluator patterns correctly. The three evaluators, prompt templates, ChaosCase validation, and deferred cleanup items are all well-executed.

Review Details
  • Test placement: Evaluator tests should be at tests/strands_evals/evaluators/chaos/ to mirror source at src/strands_evals/evaluators/chaos/ per AGENTS.md.
  • Discoverability: Consider re-exporting evaluators from chaos/__init__.py so users don't need to know about the evaluators.chaos subpackage.
  • Previous review correction: The pyproject.toml comments from prior reviews are incorrect — the added path (src/strands_evals/evaluators/chaos/prompt_templates/*) is valid and exists.

Prompt templates are thoughtfully crafted with mandatory gate conditions, error-type-aware retry evaluation, and goal-based subtask decomposition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants