Skip to content

[Repo Assist] fix(cit): correct conditional_MI entropy computation for discrete variables#1547

Draft
github-actions[bot] wants to merge 1 commit into
mainfrom
repo-assist/fix-conditional-mi-entropy-computation-cdeb66eb3dcdc68d
Draft

[Repo Assist] fix(cit): correct conditional_MI entropy computation for discrete variables#1547
github-actions[bot] wants to merge 1 commit into
mainfrom
repo-assist/fix-conditional-mi-entropy-computation-cdeb66eb3dcdc68d

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

🤖 This is an automated PR from Repo Assist.

Problem

The conditional_MI function in dowhy/utils/cit.py was producing always-incorrect values for the conditional mutual information of discrete variables.

Root cause: Iterating over a pandas DataFrame yields column labels (strings), not row values. For a single-column DataFrame data[['Foo']], zip(X_DataFrame, Z_tuples) produced exactly one pair('Foo', first_z_tuple) — regardless of sample size.

This made:

  • H(X,Z) = H(Y,Z) = H(X,Y,Z) = 0 (entropy of a 1-element sequence)
  • CMI = 0 + 0 - 0 - H(Z) = -H(Z) (always negative)

Since the GraphRefuter uses the threshold cmi_val <= 0.05, a CMI of -H(Z) always passes, meaning the conditional independence test for discrete variables silently did nothing — it always declared CI to hold, even when variables were strongly dependent.

The same bug also caused a KeyError when column names had more than one character (e.g., 'Foo'): list('Foo')['F', 'o', 'o'], which then fails the DataFrame index lookup. This was reported in #949.

Changes

dowhy/utils/cit.py:

  • Wrap string x/y arguments in a list ([x]) to avoid character iteration
  • Call .squeeze(axis=1) on single-column DataFrames to get a Series — iterating a Series yields row values correctly
  • Remove two redundant intermediate variables for Z
  • Rename lambda parameter from shadowing outer x to v for clarity

tests/utils/test_cit.py (new file):

  • test_multi_char_column_names — regression for Error handling column names in CIT as used by CausalModel.graph_refute method #949 (no KeyError on 'Foo', 'Bar', 'Baz')
  • test_single_char_column_names — ensures single-char names still work
  • test_independent_variables_low_cmi — CMI < 0.05 for truly independent discrete variables
  • test_dependent_variables_high_cmi — CMI > 0.5 for fully dependent discrete variables

Before vs After

Scenario Before (broken) After (fixed)
Independent X, Y | Z -H(Z) ≈ -1.0 bit ~0.001 bits ✓
Fully dependent X = Y | Z -H(Z) ≈ -1.0 bit ~1.0 bit ✓
Column name 'Foo' KeyError Works ✓

Test Status

The four new unit tests pass locally (verified via direct Python execution with numpy, pandas, scipy). The full CI suite requires Poetry with all optional extras.

Closes #949

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@11c9a2c442e519ff2b427bf58679f5a525353f76

…iables

The conditional_MI function was producing incorrect values because iterating
over a pandas DataFrame yields column labels (strings), not row values.
For a single-column DataFrame df[['Foo']], zip(df, Z_tuples) produced only
a single pair ('Foo', first_z_tuple) regardless of sample size, making
H(X,Z) = H(Y,Z) = H(X,Y,Z) = 0 and CMI = -H(Z) (always negative).

Fixes:
- Wrap string column names in a list to prevent character-by-character
  iteration (also fixes the KeyError from #949)
- Call .squeeze(axis=1) to convert single-column DataFrames to Series,
  so zip(X_series, Z_tuples) correctly iterates over n sample values
- Remove redundant intermediate variable assignments for Z

With the corrected implementation:
- CMI for independent X, Y | Z is ~0 (correctly < 0.05 threshold)
- CMI for fully dependent X = Y | Z is ~1 bit (correctly > 0.05 threshold)

Tests added in tests/utils/test_cit.py covering both the column-name bug
and the entropy correctness for independent and dependent discrete variables.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error handling column names in CIT as used by CausalModel.graph_refute method

0 participants