Skip to content

Fix data subset refuter index misalignment#1546

Open
TasfinMahmud wants to merge 2 commits into
py-why:mainfrom
TasfinMahmud:fix-data-subset-categorical-bug
Open

Fix data subset refuter index misalignment#1546
TasfinMahmud wants to merge 2 commits into
py-why:mainfrom
TasfinMahmud:fix-data-subset-categorical-bug

Conversation

@TasfinMahmud
Copy link
Copy Markdown

@TasfinMahmud TasfinMahmud commented May 26, 2026

Preserves index in encoding.py to prevent Unalignable boolean Series errors in distance_matching_estimator.

Closes #1372

Changes

1. dowhy/utils/encoding.py

  • Removed the two reset_index(drop=True) calls in one_hot_encode that were discarding the original DataFrame index after encoding. This preserves index alignment when the function is called on subsets of data (e.g., by DataSubsetRefuter).

2. dowhy/causal_estimators/distance_matching_estimator.py

  • Changed estimate_effect() to re-encode self._observed_common_causes fresh on every call instead of reusing the stale cached version from fit(). This ensures the encoded data always matches the current (potentially subsetted) DataFrame index, preventing index misalignment even if encoding preserves index correctly.

Why both changes?

The encoding.py fix alone prevents the immediate crash, but the estimator would still carry a stale cached reference from the fitting phase with the original (pre-subset) index. Re-encoding on each estimate_effect() call is the more robust fix that handles all refuter scenarios correctly.

…ssue py-why#1372)

Signed-off-by: Tasfin Mahmud <tasfinmahmud1@gmail.com>
@TasfinMahmud TasfinMahmud force-pushed the fix-data-subset-categorical-bug branch from 8646f73 to 98b091b Compare May 26, 2026 17:48
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

🤖 This is an automated response from Repo Assist.

Welcome to the project, @TasfinMahmud! Thanks for tracking down and fixing this bug.

This PR addresses issue #1372 — the Unalignable boolean Series error that occurs when DataSubsetRefuter is used with categorical columns and DistanceMatchingEstimator.

Heads-up — overlap with an existing PR: There is an existing Repo Assist PR (#1533) that makes the identical encoding.py fix (removing the two reset_index(drop=True) calls). However, your PR goes further with an additional change in distance_matching_estimator.py that re-encodes the data fresh on every estimate_effect() call, rather than reusing the stale self._observed_common_causes cached during fit(). This is actually the more robust fix: even with the corrected encoding, the cached object from the fitting phase would still carry the original (pre-subset) index and would mismatch the refuter's sampled data.

A few suggestions:

Thanks again for the contribution!

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@11c9a2c442e519ff2b427bf58679f5a525353f76

… with categorical columns (py-why#1372)

Signed-off-by: Tasfin Mahmud <tasfinmahmud1@gmail.com>
@TasfinMahmud
Copy link
Copy Markdown
Author

Thanks for the thorough review, @github-actions[bot]! I've addressed all three suggestions:

  1. Closes data subset refuter bug when dataframe has categorical columns #1372 — Added to the PR body so the issue auto-closes on merge.

  2. Regression test — I've pushed a new commit with a focused integration test ( est_data_subset_refuter_with_categorical_columns) that exercises the exact failing path from Issue data subset refuter bug when dataframe has categorical columns #1372: DistanceMatchingEstimator + DataSubsetRefuter + categorical common causes. The test creates a DataFrame with a pd.Categorical column, runs distance matching, and then runs the data subset refuter — which would previously raise ValueError: Unalignable boolean Series provided as indexer.

  3. Regarding overlap with PR [Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix DataSubsetRefuter with categorical columns #1533 — I understand that [Repo Assist] fix: preserve DataFrame index in one_hot_encode to fix DataSubsetRefuter with categorical columns #1533 covers the encoding.py fix with its own test for one_hot_encode index preservation. My PR goes a step further by also fixing distance_matching_estimator.py to re-encode data fresh in estimate_effect() rather than reusing the stale cached self._observed_common_causes from it(). This is the more robust fix because even with correct index preservation in encoding, the cached object would still carry the original (pre-subset) index. Happy to help consolidate the two PRs if the maintainers prefer that approach!

@amit-sharma
Copy link
Copy Markdown
Member

@all-contributors please add @TasfinMahmud for code

@allcontributors
Copy link
Copy Markdown
Contributor

@amit-sharma

I've put up a pull request to add @TasfinMahmud! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

data subset refuter bug when dataframe has categorical columns

2 participants