Fix data subset refuter index misalignment by TasfinMahmud · Pull Request #1546 · py-why/dowhy

TasfinMahmud · 2026-05-26T17:24:16Z

Preserves index in encoding.py to prevent Unalignable boolean Series errors in distance_matching_estimator.

Closes #1372

Changes

1. `dowhy/utils/encoding.py`

Removed the two reset_index(drop=True) calls in one_hot_encode that were discarding the original DataFrame index after encoding. This preserves index alignment when the function is called on subsets of data (e.g., by DataSubsetRefuter).

2. `dowhy/causal_estimators/distance_matching_estimator.py`

Changed estimate_effect() to re-encode self._observed_common_causes fresh on every call instead of reusing the stale cached version from fit(). This ensures the encoded data always matches the current (potentially subsetted) DataFrame index, preventing index misalignment even if encoding preserves index correctly.

Why both changes?

The encoding.py fix alone prevents the immediate crash, but the estimator would still carry a stale cached reference from the fitting phase with the original (pre-subset) index. Re-encoding on each estimate_effect() call is the more robust fix that handles all refuter scenarios correctly.

…ssue py-why#1372) Signed-off-by: Tasfin Mahmud <tasfinmahmud1@gmail.com>

github-actions · 2026-06-01T14:59:10Z

🤖 This is an automated response from Repo Assist.

Welcome to the project, @TasfinMahmud! Thanks for tracking down and fixing this bug.

This PR addresses issue #1372 — the Unalignable boolean Series error that occurs when DataSubsetRefuter is used with categorical columns and DistanceMatchingEstimator.

Heads-up — overlap with an existing PR: There is an existing Repo Assist PR (#1533) that makes the identical encoding.py fix (removing the two reset_index(drop=True) calls). However, your PR goes further with an additional change in distance_matching_estimator.py that re-encodes the data fresh on every estimate_effect() call, rather than reusing the stale self._observed_common_causes cached during fit(). This is actually the more robust fix: even with the corrected encoding, the cached object from the fitting phase would still carry the original (pre-subset) index and would mismatch the refuter's sampled data.

A few suggestions:

Please add Closes #1372 to the PR body so the issue auto-closes on merge.
Consider adding a focused regression test for the DistanceMatchingEstimator + DataSubsetRefuter + categorical columns path. PR [Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix DataSubsetRefuter with categorical columns #1533 includes a test for index preservation in one_hot_encode itself, but not for the estimator integration.
The maintainers may want to consolidate [Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix DataSubsetRefuter with categorical columns #1533 (which has the encoding.py fix + a test) with the distance_matching_estimator.py change from this PR. Both approaches are correct; the question is which vehicle to merge.

Thanks again for the contribution!

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run
gh aw add githubnext/agentics/workflows/repo-assist.md@11c9a2c442e519ff2b427bf58679f5a525353f76

… with categorical columns (py-why#1372) Signed-off-by: Tasfin Mahmud <tasfinmahmud1@gmail.com>

TasfinMahmud · 2026-06-02T10:21:34Z

Thanks for the thorough review, @github-actions[bot]! I've addressed all three suggestions:

Closes data subset refuter bug when dataframe has categorical columns #1372 — Added to the PR body so the issue auto-closes on merge.
Regression test — I've pushed a new commit with a focused integration test ( est_data_subset_refuter_with_categorical_columns) that exercises the exact failing path from Issue data subset refuter bug when dataframe has categorical columns #1372: DistanceMatchingEstimator + DataSubsetRefuter + categorical common causes. The test creates a DataFrame with a pd.Categorical column, runs distance matching, and then runs the data subset refuter — which would previously raise ValueError: Unalignable boolean Series provided as indexer.
Regarding overlap with PR [Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix DataSubsetRefuter with categorical columns #1533 — I understand that [Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix DataSubsetRefuter with categorical columns #1533 covers the encoding.py fix with its own test for one_hot_encode index preservation. My PR goes a step further by also fixing distance_matching_estimator.py to re-encode data fresh in estimate_effect() rather than reusing the stale cached self._observed_common_causes from it(). This is the more robust fix because even with correct index preservation in encoding, the cached object would still carry the original (pre-subset) index. Happy to help consolidate the two PRs if the maintainers prefer that approach!

amit-sharma · 2026-06-06T04:12:59Z

@all-contributors please add @TasfinMahmud for code

allcontributors · 2026-06-06T04:13:08Z

@amit-sharma

I've put up a pull request to add @TasfinMahmud! 🎉

Fix data subset refuter index misalignment for categorical columns (I…

98b091b

…ssue py-why#1372) Signed-off-by: Tasfin Mahmud <tasfinmahmud1@gmail.com>

TasfinMahmud force-pushed the fix-data-subset-categorical-bug branch from 8646f73 to 98b091b Compare May 26, 2026 17:48

github-actions Bot mentioned this pull request Jun 1, 2026

[Repo Assist] Monthly Activity 2026-06 #1559

Open

41 tasks

Add regression test for DistanceMatchingEstimator + DataSubsetRefuter…

b58e943

… with categorical columns (py-why#1372) Signed-off-by: Tasfin Mahmud <tasfinmahmud1@gmail.com>

allcontributors Bot mentioned this pull request Jun 6, 2026

docs: add TasfinMahmud as a contributor for code #1568

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data subset refuter index misalignment#1546

Fix data subset refuter index misalignment#1546
TasfinMahmud wants to merge 2 commits into
py-why:mainfrom
TasfinMahmud:fix-data-subset-categorical-bug

TasfinMahmud commented May 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

TasfinMahmud commented Jun 2, 2026

Uh oh!

amit-sharma commented Jun 6, 2026

Uh oh!

allcontributors Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TasfinMahmud commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

1. dowhy/utils/encoding.py

2. dowhy/causal_estimators/distance_matching_estimator.py

Why both changes?

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

TasfinMahmud commented Jun 2, 2026

Uh oh!

amit-sharma commented Jun 6, 2026

Uh oh!

allcontributors Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TasfinMahmud commented May 26, 2026 •

edited

Loading

1. `dowhy/utils/encoding.py`

2. `dowhy/causal_estimators/distance_matching_estimator.py`