feat: add chaos testing module for fault injection#224
Conversation
0e06f09 to
28a0679
Compare
Review SummaryAssessment: Request Changes This PR introduces a well-structured chaos testing module with clean separation between effects, scenarios, plugin, and experiment orchestration. The ContextVar-based design for concurrency safety and the composition with the existing Review Categories
The overall design is thoughtful and the test coverage is good. Addressing the |
Review Summary (Follow-up)Assessment: Comment All feedback from the previous review has been thoroughly addressed — Remaining Items
These are straightforward fixes. Once addressed, this looks good to merge. |
2b51586 to
bb023d6
Compare
Review Summary (Round 3)Assessment: Request Changes All prior feedback has been resolved. One important serialization issue remains that would cause data loss in error reporting and persistence scenarios. Details
The architecture, ContextVar design, plugin implementation, and test structure are all solid. |
Review Summary (Round 4)Assessment: Approve All previously identified issues have been thoroughly addressed: discriminated union serialization is working correctly, async task handling is properly implemented with tests, Minor items noted
The architecture is solid — ContextVar lifecycle, discriminated union serialization, plugin hooks, and the composition pattern with base Experiment are all well-designed. |
Description
Introduces a chaos testing module for fault injection during agent evaluation. Enables systematic testing of agent resilience under tool failures and response corruption without modifying agent code.
Key capabilities:
Timeout,NetworkError,ExecutionError,ValidationError) and post-hook (corrupt tool response:TruncateFields,RemoveFields,CorruptValues), each with effect-specific parameters (e.g.,corrupt_ratio,max_length,remove_ratio)Casewith aneffectsfield that carries the failure injection config. ProvidesChaosCase.expand(cases, effect_maps)to generate the Cartesian product of base cases × named effect maps.ChaosPluginhooks into Strands' nativeBeforeToolCallEvent/AfterToolCallEventsystem; reads the activeChaosCasefrom a ContextVar (zero chaos concepts in user task code)ChaosExperimentcomposes the baseExperimentto runChaosCaseobjects, managing ContextVar lifecycle per case for thread/async safetyRelated Issues
#114
Documentation PR
strands-agents/docs#836
Type of Change
New feature
Testing
How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.