[Repo Assist] fix: implement asymptotic CI/SE via Delta method for LinearRegressionEstimator with effect modifiers#1432
Conversation
… in LinearRegressionEstimator (issue #336) The _estimate_confidence_intervals and _estimate_std_error methods in LinearRegressionEstimator previously raised NotImplementedError when effect modifiers were present. Implement the Delta method (Gelman & Hill, ARM Book Ch.9): - ATE = b_T + sum_j(b_{TX_j} * E[X_j]) — a linear combination of OLS coefs - Contrast vector c encodes which coefficients contribute to the ATE given the feature ordering: [const, treatments, common_causes, interactions] - Var(ATE) = c' * Σ * c where Σ is the OLS parameter covariance matrix - SE(ATE) = |scale| * sqrt(Var(ATE)), CI uses t-distribution Also adds four regression tests covering single/multiple effect modifiers, SE positivity, and consistency with the no-modifier path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
… add tests ## What the AI's PR (py-why#1432) got right The overall approach is correct and well-structured: - Correctly identifies the Delta method as the solution: for ATE = c'β, Var(ATE) = c'Σc using model.cov_params() from statsmodels - Correctly uses scipy.stats.t with model.df_resid for finite-sample CIs - Correctly scales by (treatment_value - control_value) consistent with the existing no-modifier code path - max(var_ate, 0.0) guard against floating point negatives is good practice - _estimate_std_error and _estimate_confidence_intervals are both updated consistently via the shared _ate_and_se_for_treatment helper ## What needed fixing Bug: _ate_and_se_for_treatment used len(names) to count columns when building the contrast vector, but categorical variables are one-hot encoded by _encode() and expand into multiple columns (k-1 columns for k categories, with drop_first=True). This made interaction_start point at the wrong coefficient index, silently producing incorrect CIs with no error raised. Concretely: a 3-level categorical common cause W produces 2 encoded columns, but len(observed_common_causes_names) = 1, so interaction_start was off by 1, selecting a confounder dummy coefficient instead of the T·X interaction term. The same issue affected n_effect_modifiers when effect modifiers are categorical — len(effect_modifier_names) would undercount encoded columns, causing the em_means slice to be too short. ## Fixes applied 1. Replace len(self._observed_common_causes_names) with self._observed_common_causes.shape[1] to count actual encoded columns 2. Derive n_effect_modifiers from len(em_means) where em_means comes from self._effect_modifiers.mean(axis=0).to_numpy() — the already-encoded DataFrame — so the count always matches the actual column layout 3. Add an assert that n_params equals the expected total, turning silent wrong-index bugs into an immediate, descriptive error if column ordering ever changes in _build_features ## Tests added (TestLinearRegressionAsymptoticCI) - test_ci_no_error_continuous_common_cause: baseline, no raise for continuous W - test_ci_no_error_categorical_common_cause: no raise for 3-level categorical W - test_ci_uses_actual_encoded_column_count_not_name_count: regression test that explicitly verifies shape[1] > len(names) for categorical W and that the internal assert passes (proving the right index is used) - test_ci_contains_estimate: CI brackets the estimated ATE value All 11 tests pass (7 existing + 4 new). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kf-rahman
left a comment
There was a problem hiding this comment.
Hi — thanks for the automated draft. I reviewed the code and the overall approach is solid, but there is a bug with categorical variables that needs fixing before this can be merged. Here's my full review.
What the PR gets right
The Delta method is the correct approach. For ATE = c'β, Var(ATE) = c'Σc using model.cov_params() from statsmodels is the standard, textbook solution. For OLS it's actually exact, not just an approximation.
Specific things done well:
scipy.stats.twithmodel.df_residfor finite-sample CIs — correct- Scaling by
(treatment_value - control_value)is consistent with the existing no-modifier code path max(var_ate, 0.0)guard against floating point negatives is good defensive coding- Both
_estimate_std_errorand_estimate_confidence_intervalsare updated via the shared_ate_and_se_for_treatmenthelper — clean design
Bug: categorical variables produce silently wrong CIs
_ate_and_se_for_treatment counts common cause and effect modifier columns using variable name counts:
n_common_causes = len(self._observed_common_causes_names) # counts names
n_effect_modifiers = len(self._effect_modifier_names) # counts namesBut _encode() one-hot encodes categorical variables with drop_first=True, so a variable with k levels becomes k-1 columns, not 1. This means interaction_start points at the wrong coefficient index — silently, with no error raised.
Concrete example: a 3-level categorical common cause W produces 2 encoded columns, but len(names) = 1. So interaction_start is off by 1 and grabs a confounder dummy coefficient instead of the T·X interaction term. The same issue applies to categorical effect modifiers.
I verified this with a synthetic dataset:
len(observed_common_causes_names) = 1 ← what the PR uses
observed_common_causes.shape[1] = 2 ← actual encoded columns
interaction_start (buggy): 3 → coefficient 'x3' (a W dummy — wrong)
interaction_start (fixed): 4 → coefficient 'x4' (the T·X term — correct)
Fixes
1. Use encoded column counts instead of name counts:
# Replace:
n_common_causes = len(self._observed_common_causes_names)
n_effect_modifiers = len(self._effect_modifier_names)
em_means = np.asarray(self._effect_modifiers.mean(axis=0))
# With:
n_common_causes = self._observed_common_causes.shape[1] if self._observed_common_causes is not None else 0
em_means = self._effect_modifiers.mean(axis=0).to_numpy()
n_effect_modifiers = len(em_means)2. Add an assert to catch ordering mismatches early (instead of silently wrong CIs):
assert n_params == 1 + n_treatments + n_common_causes + n_treatments * n_effect_modifiers, (
f"Model has {n_params} params but expected "
f"{1 + n_treatments + n_common_causes + n_treatments * n_effect_modifiers}. "
"Column ordering assumption in _ate_and_se_for_treatment may be broken."
)3. Add tests covering categorical common causes — the existing tests only use continuous variables and would not catch this bug. See branch kf-rahman/dowhy:fix/issue-336-categorical-encoding for the full implementation with 4 new tests in TestLinearRegressionAsymptoticCI:
test_ci_no_error_continuous_common_causetest_ci_no_error_categorical_common_causetest_ci_uses_actual_encoded_column_count_not_name_count(regression test for this exact bug)test_ci_contains_estimate
All 11 tests pass (7 existing + 4 new).
The fix is straightforward — happy to help get this merged once the categorical encoding issue is addressed.
|
Hi @kf-rahman thank you for this review of the PR and for catching this implementation bug! Yes, could you push your fix to the branch for this PR? repo-assist/fix-issue-336-linear-regression-asymptotic-ci-4b5b9900c6c0a820 Once you do that, we can run the full suite of tests and merge it in |
|
I don't have write access to push directly to the branch, so I've opened a PR targeting it instead: #1444 It contains the fix merged with your existing tests — all 4 original tests are preserved and the 4 new categorical encoding tests are added on top. @emrekiciman |
…hod CI _ate_and_se_for_treatment was computing interaction_start using len(variable_names) instead of the actual number of encoded columns. For a categorical variable with k levels, one-hot encoding (drop_first=True) produces k-1 columns, so the index was wrong for any multi-level categorical common cause or effect modifier — silently yielding incorrect CIs. Fixes: - Use self._observed_common_causes.shape[1] (encoded width) instead of len(self._observed_common_causes_names) for n_common_causes - Use self._effect_modifiers.mean(axis=0).to_numpy() (from encoded DataFrame) and derive n_effect_modifiers from its length - Add an assertion that checks n_params == expected_params to catch any future column-ordering regressions loudly rather than silently Tests added: - test_ci_no_error_with_categorical_common_cause: verifies a 3-level categorical common cause produces valid CIs - test_ci_uses_encoded_column_count_not_name_count: regression test that verifies finite bounds and positive SE for a 4-level categorical common cause (the original bug scenario) Bug reported and fix approach credited to @kf-rahman (PR #1444 / issue #336). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
Commit pushed:
|
|
🤖 This is an automated update from Repo Assist. Thank you What was changed (commit
PR #1444 (which contained the same fix) can now be closed since the changes are incorporated here.
Note 🔒 Integrity filter blocked 125 itemsThe following items were blocked because they don't meet the GitHub integrity level.
To allow these resources, lower tools:
github:
min-integrity: approved # merged | approved | unapproved | none
|
…asymptotic-ci-4b5b9900c6c0a820
Signed-off-by: Emre Kiciman <emrek@microsoft.com>
|
@copilot resolve the merge conflicts in this pull request |
…asymptotic-ci-4b5b9900c6c0a820 Signed-off-by: GitHub <noreply@github.com> Co-authored-by: emrekiciman <5982160+emrekiciman@users.noreply.github.com>
Merge conflicts resolved in commit |
…asymptotic-ci-4b5b9900c6c0a820
* fix: identify all valid mediators, not just the first (#1334) identify_mediation() broke after finding the first valid mediator, silently dropping parallel mediators from NIE/NDE estimands. Signed-off-by: Kaushik Rajan <kaushi@alumni.ncsu.edu> * fix: rename treatment node D->X in parallel-mediators test graph The backdoor identifier tests hardcode action_nodes=["X"]; the graph fixture must use X as the treatment label to match. Signed-off-by: Kaushik Rajan <kaushi@alumni.ncsu.edu> * test: add comment explaining X/Y naming convention in parallel-mediators fixture Signed-off-by: Kaushik Rajan <kaushi@alumni.ncsu.edu> * fix: sort mediators for deterministic estimand output eligible_variables is a set, so iteration order is non-deterministic across runs. Signed-off-by: Kaushik Rajan <kaushi@alumni.ncsu.edu> --------- Signed-off-by: Kaushik Rajan <kaushi@alumni.ncsu.edu> Co-authored-by: Emre Kıcıman <emrek@microsoft.com>
) (#1467) * fix: support multiple treatments in PlaceboTreatmentRefuter (closes #251) With multiple treatments, _refute_once was assigning a 2-D array as a single 'placebo' column, raising: ValueError: Wrong number of items passed N, placement implies 1 Root cause: the single-column assign(placebo=...) pattern cannot accept a 2-D array or multi-column DataFrame. Fix: * Add _get_placebo_names(): returns ['placebo'] for one treatment (backward-compatible) and ['placebo_<name>'] per treatment otherwise. * Extract _generate_random_placebo(): generates a per-treatment random Series respecting the original dtype (float/bool/int/category). * _refute_once now iterates over treatments, creating one placebo column per treatment using the above helpers. * refute_placebo_treatment sets identified_estimand.treatment_variable to the full list of placebo column names. * Add two parametrized tests covering 3-treatment permute and random data cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix: address review comments in placebo_treatment_refuter - Use pd.api.types helpers (is_float_dtype/is_bool_dtype/is_integer_dtype) instead of brittle dtype name substring checks; handles pandas extension types like Int64/Float64 correctly - Preserve original category set and ordering in categorical placebo via pd.Categorical with explicit categories/ordered params - Fix PERMUTE path to permute each treatment column individually as a pandas Series (preserving dtype), then reset index to avoid alignment issues instead of converting whole DataFrame to a numpy array Signed-off-by: GitHub Copilot <copilot@github.com> Agent-Logs-Url: https://github.com/py-why/dowhy/sessions/cf64f618-f9bd-448f-9d78-ad2e6d9f4af5 Co-authored-by: emrekiciman <5982160+emrekiciman@users.noreply.github.com> * fix: add explanatory comment for set_axis index reset in PERMUTE path Signed-off-by: GitHub Copilot <copilot@github.com> Agent-Logs-Url: https://github.com/py-why/dowhy/sessions/cf64f618-f9bd-448f-9d78-ad2e6d9f4af5 Co-authored-by: emrekiciman <5982160+emrekiciman@users.noreply.github.com> --------- Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: emrekiciman <5982160+emrekiciman@users.noreply.github.com> Co-authored-by: Emre Kıcıman <emrek@microsoft.com>
#814) (#1465) * fix: correct exact_match_cols logic in DistanceMatchingEstimator (closes #814) Three bugs in estimate_effect() with exact_match_cols: 1. Inner loop used global numtreatedunits instead of the per-group treated count, causing IndexError / wrong summation when group sizes differ from the overall dataset size. 2. No guard when a group has zero control units, which caused NearestNeighbors.fit() to receive an empty array. 3. After the groupby loop the names 'treated' and 'control' were left pointing at the last group's subsets, so the ATC branch (and ATE = ATT + ATC) operated on stale, partial data. Fix: - Rename loop variables to group_treated / group_control so outer names are never clobbered. - Iterate over group_treated.shape[0] (not numtreatedunits). - Skip groups with no control units (same as the existing skip for no treated units). - Accumulate total_treated_matched for a correct per-sample ATT average. - Restore treated/control from saved copies before the ATC block. Also adds tests/causal_estimators/test_distance_matching_estimator.py with 12 tests covering ATT/ATC/ATE estimation, matched_indices population, exact matching correctness, and edge-case error handling. Note: C901 complexity on estimate_effect was pre-existing (21); this change adds one branch (complexity 22). A dedicated refactor PR would be appropriate to bring it below 10. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Update tests/causal_estimators/test_distance_matching_estimator.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Emre Kıcıman <emrek@microsoft.com> * fix: populate matched_indices_att in exact_match_cols branch and fix test method_params - Initialize self.matched_indices_att = {} before the groupby loop in the exact-match branch - Populate matched_indices_att per group_treated row using original dataframe indices - Fix tests to use method_params={"fit_params": {"exact_match_cols": [...]}} so exact_match_cols is correctly routed to fit() (not just __init__) - Update test_exact_match_restricts_matches_to_same_group to iterate matched_indices_att.items() and verify each matched control has the same W_cat value as the treated unit Signed-off-by: GitHub Copilot <copilot@github.com> Agent-Logs-Url: https://github.com/py-why/dowhy/sessions/58b2fe09-f297-4b0e-a6f9-a8f4c987b350 Co-authored-by: emrekiciman <5982160+emrekiciman@users.noreply.github.com> * style: break long line in exact-match branch for 120-char limit Signed-off-by: GitHub Copilot <copilot@github.com> Agent-Logs-Url: https://github.com/py-why/dowhy/sessions/58b2fe09-f297-4b0e-a6f9-a8f4c987b350 Co-authored-by: emrekiciman <5982160+emrekiciman@users.noreply.github.com> --------- Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Emre Kıcıman <emrek@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Emre Kıcıman <emrek@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: emrekiciman <5982160+emrekiciman@users.noreply.github.com>
Bumps [idna](https://github.com/kjd/idna) from 3.10 to 3.15. - [Release notes](https://github.com/kjd/idna/releases) - [Changelog](https://github.com/kjd/idna/blob/master/HISTORY.md) - [Commits](kjd/idna@v3.10...v3.15)
…1 lambdas in docs config (#1536) * Initial plan * Fix flaky auto-model test and E731 in docs conf Agent-Logs-Url: https://github.com/py-why/dowhy/sessions/8cee7c0c-7525-48fe-acd0-4aa62c812736 Co-authored-by: emrekiciman <5982160+emrekiciman@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: emrekiciman <5982160+emrekiciman@users.noreply.github.com>
|
Commit pushed:
|
🤖 This is an automated PR from Repo Assist, an AI assistant.
Closes #336.
Root Cause
LinearRegressionEstimator._estimate_confidence_intervalsand_estimate_std_errorboth raisedNotImplementedErrorwhenever effect modifiers were present. The TODO comment pointed to Gelman & Hill ARM Book Chapter 9.Fix: Delta Method
When effect modifiers are present, the Average Treatment Effect is a linear combination of OLS coefficients:
where
cis the contrast vector (matching the feature column ordering produced by_build_features:[const, treatments, common_causes, interactions]) andΣis the OLS parameter covariance matrix (model.cov_params()).The implementation:
_ate_and_se_for_treatment(treatment_index)— builds the contrast vectorc, computesATE = c'βandSE = sqrt(c'Σc)._estimate_confidence_intervalsloops over all treatments, applies the t-distribution margin (scipy.stats.t.ppfwithmodel.df_residdegrees of freedom) and returns shape(n_treatments, 2)matching the existing no-modifier return shape._estimate_std_errorreturns per-treatment SEs scaled by|treatment_value - control_value|.Multiple treatments and multiple effect modifiers are both handled correctly.
Changes
dowhy/causal_estimators/linear_regression_estimator.py— new_ate_and_se_for_treatmenthelper; replacedraise NotImplementedErrorin_estimate_confidence_intervalsand_estimate_std_errortests/causal_estimators/test_linear_regression_estimator.py— addedTestLinearRegressionAsymptoticCIwith 4 tests:NotImplementedErrorraised for single treatment + single EMTest Status
ast.parseon both changed files)black --checkpassesisort --checkpassesNote
🔒 Integrity filter blocked 46 items
The following items were blocked because they don't meet the GitHub integrity level.
list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".To allow these resources, lower
min-integrityin your GitHub frontmatter: