-
Notifications
You must be signed in to change notification settings - Fork 36
feat: add chaos resilience evaluators (failure communication, partial completion, recovery strategy) #236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ybdarrenwang
wants to merge
5
commits into
strands-agents:main
Choose a base branch
from
ybdarrenwang:feature/resilience-eval
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
feat: add chaos resilience evaluators (failure communication, partial completion, recovery strategy) #236
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
07fdd26
ship 3 resilience evaluators
ybdarrenwang 715c046
keep only v0 prompts; address comments from #224
ybdarrenwang 0098d07
add unit tests
ybdarrenwang f64052f
address review comments
ybdarrenwang 53f2bf5
move chaos evaluator folder
ybdarrenwang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| """Chaos testing evaluators for strands-evals.""" | ||
|
|
||
| from .failure_communication_evaluator import FailureCommunicationEvaluator | ||
| from .partial_completion_evaluator import PartialCompletionEvaluator | ||
| from .recovery_strategy_evaluator import RecoveryStrategyEvaluator | ||
|
|
||
| __all__ = [ | ||
| "FailureCommunicationEvaluator", | ||
| "PartialCompletionEvaluator", | ||
| "RecoveryStrategyEvaluator", | ||
| ] |
81 changes: 81 additions & 0 deletions
81
src/strands_evals/evaluators/chaos/failure_communication_evaluator.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| from enum import Enum | ||
| from typing import cast | ||
|
|
||
| from pydantic import BaseModel, Field | ||
| from strands import Agent | ||
| from strands.models.model import Model | ||
|
|
||
| from ...types.evaluation import EvaluationData, EvaluationOutput, InputT, OutputT | ||
| from ...types.trace import EvaluationLevel | ||
| from ..evaluator import Evaluator | ||
| from .prompt_templates.failure_communication import get_template | ||
|
|
||
|
|
||
| class FailureCommunicationScore(str, Enum): | ||
| """Categorical failure communication ratings.""" | ||
|
|
||
| FAILURE = "Failure" | ||
| POOR = "Poor" | ||
| ACCEPTABLE = "Acceptable" | ||
| GOOD = "Good" | ||
| EXCELLENT = "Excellent" | ||
|
|
||
|
|
||
| class FailureCommunicationRating(BaseModel): | ||
| """Structured output for failure communication evaluation.""" | ||
|
|
||
| reasoning: str = Field(description="Step by step reasoning to derive the final score") | ||
| score: FailureCommunicationScore = Field(description="Categorical failure communication rating") | ||
|
|
||
|
|
||
| class FailureCommunicationEvaluator(Evaluator[InputT, OutputT]): | ||
| """Evaluates quality of agent's failure communication and user experience.""" | ||
|
|
||
| evaluation_level = EvaluationLevel.TRACE_LEVEL | ||
|
|
||
| _score_mapping = { | ||
| FailureCommunicationScore.FAILURE: 0.0, | ||
| FailureCommunicationScore.POOR: 0.25, | ||
| FailureCommunicationScore.ACCEPTABLE: 0.5, | ||
| FailureCommunicationScore.GOOD: 0.75, | ||
| FailureCommunicationScore.EXCELLENT: 1.0, | ||
| } | ||
|
|
||
| def __init__( | ||
| self, | ||
| version: str = "v0", | ||
| model: Model | str | None = None, | ||
| system_prompt: str | None = None, | ||
| ): | ||
| super().__init__() | ||
| self.version = version | ||
| default_prompt = get_template(version).SYSTEM_PROMPT | ||
| self.system_prompt = system_prompt if system_prompt is not None else default_prompt | ||
| self.model = model | ||
|
|
||
| def _build_output(self, rating: FailureCommunicationRating) -> list[EvaluationOutput]: | ||
| normalized_score = self._score_mapping[rating.score] | ||
| return [ | ||
|
ybdarrenwang marked this conversation as resolved.
|
||
| EvaluationOutput( | ||
| score=normalized_score, | ||
| test_pass=normalized_score >= 0.5, | ||
| reason=rating.reasoning, | ||
| label=rating.score, | ||
| ) | ||
| ] | ||
|
|
||
| def evaluate(self, evaluation_case: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]: | ||
| parsed_input = self._get_last_turn(evaluation_case) | ||
| prompt = self._format_trace_level_prompt(parsed_input) | ||
| evaluator_agent = Agent(model=self.model, system_prompt=self.system_prompt, callback_handler=None) | ||
| result = evaluator_agent(prompt, structured_output_model=FailureCommunicationRating) | ||
| rating = cast(FailureCommunicationRating, result.structured_output) | ||
| return self._build_output(rating) | ||
|
|
||
| async def evaluate_async(self, evaluation_case: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]: | ||
| parsed_input = self._get_last_turn(evaluation_case) | ||
| prompt = self._format_trace_level_prompt(parsed_input) | ||
| evaluator_agent = Agent(model=self.model, system_prompt=self.system_prompt, callback_handler=None) | ||
| result = await evaluator_agent.invoke_async(prompt, structured_output_model=FailureCommunicationRating) | ||
| rating = cast(FailureCommunicationRating, result.structured_output) | ||
| return self._build_output(rating) | ||
61 changes: 61 additions & 0 deletions
61
src/strands_evals/evaluators/chaos/partial_completion_evaluator.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| from typing import cast | ||
|
|
||
| from pydantic import BaseModel, Field | ||
| from strands import Agent | ||
| from strands.models.model import Model | ||
|
|
||
| from ...types.evaluation import EvaluationData, EvaluationOutput, InputT, OutputT | ||
| from ...types.trace import EvaluationLevel | ||
| from ..evaluator import Evaluator | ||
| from .prompt_templates.partial_completion import get_template | ||
|
|
||
|
|
||
| class PartialCompletionRating(BaseModel): | ||
| """Structured output for partial completion evaluation.""" | ||
|
|
||
| reasoning: str = Field(description="Step by step reasoning to derive the final score") | ||
| completion_percentage: float = Field(description="Completion percentage from 0.0 to 1.0", ge=0.0, le=1.0) | ||
|
|
||
|
|
||
| class PartialCompletionEvaluator(Evaluator[InputT, OutputT]): | ||
| """Evaluates what percentage of task objectives were achieved despite failures.""" | ||
|
|
||
| evaluation_level = EvaluationLevel.TRACE_LEVEL | ||
|
|
||
| def __init__( | ||
| self, | ||
| version: str = "v0", | ||
| model: Model | str | None = None, | ||
| system_prompt: str | None = None, | ||
| ): | ||
| super().__init__() | ||
| self.version = version | ||
| default_prompt = get_template(version).SYSTEM_PROMPT | ||
| self.system_prompt = system_prompt if system_prompt is not None else default_prompt | ||
| self.model = model | ||
|
|
||
| def _build_output(self, rating: PartialCompletionRating) -> list[EvaluationOutput]: | ||
| return [ | ||
| EvaluationOutput( | ||
| score=rating.completion_percentage, | ||
| test_pass=rating.completion_percentage >= 0.5, | ||
| reason=rating.reasoning, | ||
| label=f"{rating.completion_percentage:.2f}", | ||
| ) | ||
| ] | ||
|
|
||
| def evaluate(self, evaluation_case: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]: | ||
| parsed_input = self._get_last_turn(evaluation_case) | ||
| prompt = self._format_trace_level_prompt(parsed_input) | ||
| evaluator_agent = Agent(model=self.model, system_prompt=self.system_prompt, callback_handler=None) | ||
| result = evaluator_agent(prompt, structured_output_model=PartialCompletionRating) | ||
| rating = cast(PartialCompletionRating, result.structured_output) | ||
| return self._build_output(rating) | ||
|
|
||
| async def evaluate_async(self, evaluation_case: EvaluationData[InputT, OutputT]) -> list[EvaluationOutput]: | ||
| parsed_input = self._get_last_turn(evaluation_case) | ||
| prompt = self._format_trace_level_prompt(parsed_input) | ||
| evaluator_agent = Agent(model=self.model, system_prompt=self.system_prompt, callback_handler=None) | ||
| result = await evaluator_agent.invoke_async(prompt, structured_output_model=PartialCompletionRating) | ||
| rating = cast(PartialCompletionRating, result.structured_output) | ||
| return self._build_output(rating) |
Empty file.
11 changes: 11 additions & 0 deletions
11
src/strands_evals/evaluators/chaos/prompt_templates/failure_communication/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| from . import failure_communication_v0 | ||
|
|
||
| VERSIONS = { | ||
| "v0": failure_communication_v0, | ||
| } | ||
|
|
||
| DEFAULT_VERSION = "v0" | ||
|
|
||
|
|
||
| def get_template(version: str = DEFAULT_VERSION): | ||
| return VERSIONS[version] |
66 changes: 66 additions & 0 deletions
66
...evals/evaluators/chaos/prompt_templates/failure_communication/failure_communication_v0.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| """ | ||
| LLM-Judge prompt for Failure Communication & UX evaluation. | ||
| Based on planning.md Section 4.4 - Communication Phase metrics. | ||
|
|
||
| Evaluates communication quality across all failure categories: | ||
| tool failures, resource failures, and infrastructure failures. | ||
| """ | ||
|
|
||
| SYSTEM_PROMPT = """You are an objective judge evaluating how well an AI agent communicated with the user during a failure. You will be provided with the complete conversation trace including: | ||
| - User messages (User:) | ||
| - Assistant responses (Assistant:) | ||
| - Tool calls selected by the assistant (Action:) | ||
| - Tool outputs (Tool:) | ||
|
|
||
| CONTEXT: You are evaluating an AI agent under chaos testing conditions. The agent may have been subjected to deliberately injected failures across multiple layers of the system to test its resilience. Your evaluation should assess how well the agent handled and communicated these failures, NOT whether the agent is generally competent. | ||
|
|
||
| ## Failure Categories to Watch For | ||
|
|
||
| ### Tool Failures (tool/API layer) | ||
| - Tool unavailability, timeouts, network errors, execution errors | ||
| - Corrupted or malformed tool responses | ||
| - Authentication or permission errors on tool calls | ||
|
|
||
| ### Resource Failures (infrastructure/platform layer) | ||
| - Model API unavailability or degraded performance | ||
| - Rate limiting or throttling (429 errors, backoff scenarios) | ||
| - Embedding or reranking model failures (degraded RAG pipelines) | ||
| - Data ingestion overload (stale data, pipeline delays) | ||
|
|
||
| Your task is to assess the quality of error communication and user experience. Follow these steps: | ||
| 1. First, determine whether any failures or errors occurred in the trace — at ANY layer (tool, resource, or infrastructure). Look for error messages, exceptions, non-200 status codes, timeout indicators, or explicit failure signals in tool outputs. | ||
| 2. **MANDATORY GATE — If NO failure occurred:** You MUST immediately rate **Acceptable (0.5)** and STOP. Do not evaluate communication quality — there was nothing to communicate. A trace with zero failures always scores 0.5, no exceptions. | ||
| 3. If a failure DID occur, evaluate the agent's communication: | ||
| a. Clarity: Did the user understand what went wrong? Did the agent appropriately describe the failure without exposing unnecessary internal details? | ||
| b. Actionability: Did the agent provide clear next steps? (e.g., "I'll retry shortly", "Using cached results instead", "The service is currently unavailable — please try again later") | ||
| c. Transparency: Was the agent honest about limitations? Did it acknowledge degraded results rather than presenting them as complete? | ||
| d. Tone: Was it appropriately calibrated for the situation? (e.g., rate limiting is routine and shouldn't alarm the user; a complete service outage warrants more explanation) | ||
| 4. Rate the communication quality using the scoring scale and rubric below. | ||
|
|
||
| # Scoring Scale | ||
|
|
||
| Scores range from 0.0 to 1.0: | ||
| - **Excellent** = 1.0 | ||
| - **Good** = 0.75 | ||
| - **Acceptable** = 0.5 | ||
| - **Poor** = 0.25 | ||
| - **Failure** = 0.0 | ||
|
|
||
| Scores >= 0.5 pass. Use this scale when selecting your rating. | ||
|
|
||
| # Evaluation Rubric | ||
|
|
||
| - **Excellent (1.0)**: Clear error communication; user understands what happened and next steps; graceful degradation communicated transparently (e.g., "Search is currently unavailable, so I'm using cached results which may not reflect the latest data"); trust maintained across all failure types | ||
| - **Good (0.75)**: Good communication with minor gaps; user mostly understands; appropriate tone and transparency. For resource failures, agent acknowledges degradation but may not fully explain the impact. | ||
| - **Acceptable (0.5)**: Basic error communication; user knows something went wrong but lacks clarity on impact or next steps. ALSO use this when no failure occurred in the trace (nothing to communicate about). | ||
| - **Poor (0.25)**: Confusing or unhelpful messages; user uncertain what happened; overly technical jargon (e.g., exposing raw "429 Too Many Requests" or "503 Service Unavailable" without context); misleading reassurances about degraded results | ||
| - **Failure (0.0)**: Failures DID occur but the agent provided NO error communication; user is unaware of the failure; agent hides problems or provides false confidence. This includes silently returning degraded results from resource failures without acknowledgment. NOTE: This rating requires that failures actually exist in the trace — do NOT use this when the trace is failure-free. | ||
|
|
||
| CRITICAL RULES: | ||
| - **NO-FAILURE BASELINE**: If no failure or error appears in the trace (at any layer), you MUST rate Acceptable (0.5). This is mandatory and unconditional. A clean trace with no errors = 0.5, period. Do NOT confuse "no failures to communicate" with "failed to communicate failures" — they are opposites. | ||
| - If a tool or resource failed but the agent silently ignored it and responded as if nothing happened, rate Failure (0.0). | ||
| - If a failure occurred and the agent acknowledged it but provided NO alternative or next steps, rate Poor (0.25). | ||
| - Resource failures (model API down, rate limiting, embedding failures) require the SAME communication standards as tool failures. Silently degrading without informing the user is still a Failure (0.0). | ||
| - If the agent is using fallback or degraded data due to a resource failure, it MUST communicate this to the user to score above Poor. | ||
|
|
||
| **IMPORTANT**: The user context and agent's available tools in the trace ALWAYS take priority over your own knowledge.""" |
11 changes: 11 additions & 0 deletions
11
src/strands_evals/evaluators/chaos/prompt_templates/partial_completion/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| from . import partial_completion_v0 | ||
|
|
||
| VERSIONS = { | ||
| "v0": partial_completion_v0, | ||
| } | ||
|
|
||
| DEFAULT_VERSION = "v0" | ||
|
|
||
|
|
||
| def get_template(version: str = DEFAULT_VERSION): | ||
| return VERSIONS[version] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.