In MuZeroGameBuffer, I am getting index out of bounds errors in _compute_target_policy_non_reanalyzed when running Sampled EfficientZero on a custom environment. I think the line policy_tmp[legal_action] = distributions[index] may be wrong, because policy_tmp and distributions are the same shape, and legal_action is chosen from a very large / sparse list of actions, so it seems like the index should be used instead. Normally, policy_shape is set to self._cfg.model.action_space_size, but it instead is set to self._cfg.model.num_of_sampled_actions for Sampled EfficientZero.
One possible solution, I think, is to override _compute_target_policy_reanalyzed in SampledMuZeroGameBuffer, then replace
policy_tmp[legal_action] = policy[index]
with
policy_tmp[index] = policy[index]
I'm not sure about the following questions:
- it makes sense to update only the Sampled variants here (rather than updating the base MuZeroGameBuffer)
- not sure whether the "base" MuZero implementation is correct here either, or maybe the "bug" just doesn't matter because index == legal_action in all non-sampled environments, and
- there is a "reanalyzed" and "non_reanalyzed" version of the function, and the fix might have to be applied to both of them.
- maybe there is another solution, like adjusting policy_shape.
I think this bug didn't show up when I ran SampledEfficientZero on Connect4 because there are only 6 possible actions, which is less than the "K" setting, so I never got an index out of bounds error. But that is probably affecting the performance of the algorithm, if it is indeed a bug.
In MuZeroGameBuffer, I am getting index out of bounds errors in
_compute_target_policy_non_reanalyzedwhen running Sampled EfficientZero on a custom environment. I think the linepolicy_tmp[legal_action] = distributions[index]may be wrong, becausepolicy_tmpanddistributionsare the same shape, and legal_action is chosen from a very large / sparse list of actions, so it seems like the index should be used instead. Normally, policy_shape is set toself._cfg.model.action_space_size, but it instead is set toself._cfg.model.num_of_sampled_actionsfor Sampled EfficientZero.One possible solution, I think, is to override _compute_target_policy_reanalyzed in SampledMuZeroGameBuffer, then replace
policy_tmp[legal_action] = policy[index]with
policy_tmp[index] = policy[index]I'm not sure about the following questions:
I think this bug didn't show up when I ran SampledEfficientZero on Connect4 because there are only 6 possible actions, which is less than the "K" setting, so I never got an index out of bounds error. But that is probably affecting the performance of the algorithm, if it is indeed a bug.