[Repo Assist] fix(cit): correct conditional_MI entropy computation for discrete variables#1547
Draft
github-actions[bot] wants to merge 1 commit into
Conversation
…iables
The conditional_MI function was producing incorrect values because iterating
over a pandas DataFrame yields column labels (strings), not row values.
For a single-column DataFrame df[['Foo']], zip(df, Z_tuples) produced only
a single pair ('Foo', first_z_tuple) regardless of sample size, making
H(X,Z) = H(Y,Z) = H(X,Y,Z) = 0 and CMI = -H(Z) (always negative).
Fixes:
- Wrap string column names in a list to prevent character-by-character
iteration (also fixes the KeyError from #949)
- Call .squeeze(axis=1) to convert single-column DataFrames to Series,
so zip(X_series, Z_tuples) correctly iterates over n sample values
- Remove redundant intermediate variable assignments for Z
With the corrected implementation:
- CMI for independent X, Y | Z is ~0 (correctly < 0.05 threshold)
- CMI for fully dependent X = Y | Z is ~1 bit (correctly > 0.05 threshold)
Tests added in tests/utils/test_cit.py covering both the column-name bug
and the entropy correctness for independent and dependent discrete variables.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This was referenced May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 This is an automated PR from Repo Assist.
Problem
The
conditional_MIfunction indowhy/utils/cit.pywas producing always-incorrect values for the conditional mutual information of discrete variables.Root cause: Iterating over a pandas
DataFrameyields column labels (strings), not row values. For a single-column DataFramedata[['Foo']],zip(X_DataFrame, Z_tuples)produced exactly one pair —('Foo', first_z_tuple)— regardless of sample size.This made:
H(X,Z) = H(Y,Z) = H(X,Y,Z) = 0(entropy of a 1-element sequence)CMI = 0 + 0 - 0 - H(Z) = -H(Z)(always negative)Since the
GraphRefuteruses the thresholdcmi_val <= 0.05, a CMI of-H(Z)always passes, meaning the conditional independence test for discrete variables silently did nothing — it always declared CI to hold, even when variables were strongly dependent.The same bug also caused a
KeyErrorwhen column names had more than one character (e.g.,'Foo'):list('Foo')→['F', 'o', 'o'], which then fails the DataFrame index lookup. This was reported in #949.Changes
dowhy/utils/cit.py:x/yarguments in a list ([x]) to avoid character iteration.squeeze(axis=1)on single-column DataFrames to get a Series — iterating a Series yields row values correctlyZxtovfor claritytests/utils/test_cit.py(new file):test_multi_char_column_names— regression for Error handling column names in CIT as used by CausalModel.graph_refute method #949 (noKeyErroron'Foo','Bar','Baz')test_single_char_column_names— ensures single-char names still worktest_independent_variables_low_cmi— CMI < 0.05 for truly independent discrete variablestest_dependent_variables_high_cmi— CMI > 0.5 for fully dependent discrete variablesBefore vs After
-H(Z)≈ -1.0 bit-H(Z)≈ -1.0 bit'Foo'KeyErrorTest Status
The four new unit tests pass locally (verified via direct Python execution with numpy, pandas, scipy). The full CI suite requires Poetry with all optional extras.
Closes #949