Skip to content

Add GCM example notebook: auditing CNN predictions for spurious correlations using chest X-ray data#1524

Open
sanzits wants to merge 3 commits into
py-why:mainfrom
sanzits:add-chest-xray-gcm-notebook
Open

Add GCM example notebook: auditing CNN predictions for spurious correlations using chest X-ray data#1524
sanzits wants to merge 3 commits into
py-why:mainfrom
sanzits:add-chest-xray-gcm-notebook

Conversation

@sanzits
Copy link
Copy Markdown

@sanzits sanzits commented May 14, 2026

What the notebook demonstrates

A CNN trained on chest X-rays can achieve decent AUC and still be learning the wrong things. The notebook shows that roughly 28% of what drives this model's predictions comes from image brightness — a property of the scanner hardware, not the patient's lungs. We use DoWhy's GCM framework to open the black box and separate the legitimate clinical signal from the spurious scanner artifact. The core message is simple: good predictive accuracy does not mean the model learned the right causal relationships.

Which DoWhy features it uses
Three GCM tools, each answering a different question:

gcm.intrinsic_causal_influence — "of everything driving model predictions, how much does each variable causally own?" This is the headline analysis. It tells you that image brightness owns 4.8x more of the model's predictive variance than the actual clinical signal (opacity score).
gcm.arrow_strength — "how load-bearing is each individual edge in the causal graph?" Measured by KL divergence — if you removed that edge, how much would the prediction distribution change? The brightness→prediction edge (0.014) is 6x stronger than the opacity→prediction edge (0.002).
gcm.interventional_samples — "what actually happens if we swap the scanner?" This runs a do-calculus intervention: force every image to have the brightness of scanner group 1, then propagate that change through the causal graph. Mean predictions shift from 0.44 to 0.56 — a 12 percentage point change with no change to any patient's clinical condition.

The dataset and how to get it
NIH Chest X-ray Dataset, published by Wang et al. at CVPR 2017. It contains 112,120 frontal-view chest X-rays from 30,805 unique patients, with 14 disease labels. The notebook uses a 15,000-image stratified subset (first 2 folders) targeting the Infiltration label (about 15% prevalence).
How to get it: it's publicly available on Kaggle at https://www.kaggle.com/datasets/nih-chest-xrays/data. You need a free Kaggle account. The notebook's first cell has the exact download and setup instructions including how to create the stratified subset and generate the feature CSV that DoWhy consumes.

Where it should be listed
Real world-inspired examples section, alongside the microservice latency notebook and the counterfactual medical case. Level: Advanced. Task: Intrinsic causal influence, intervention, and root cause analysis via GCM.
It fits here rather than "Examples on benchmark datasets" because the point isn't to benchmark DoWhy on a standard dataset — it's to show a real-world use case where causal auditing of a deployed model reveals something that standard ML evaluation would miss

…lations using chest X-ray data

Signed-off-by: Sanchit <sanzit.s@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

🤖 This is an automated response from Repo Assist.

Welcome, @sanzits! This is a compelling real-world demonstration of DoWhy's GCM capabilities. The use case — auditing a chest X-ray CNN to reveal that a disproportionate share of its predictive signal comes from scanner hardware brightness rather than clinical pathology — is exactly the kind of story that motivates the GCM framework and makes it concrete for practitioners.

A few notes for review:

External data dependency: The notebook requires downloading the NIH Chest X-ray Dataset from Kaggle. This means it cannot be executed in automated CI. Please add a clear note at the top of the notebook about the data download requirement, and the PR should ensure it's marked so CI skips it. Check whether it should carry the @pytest.mark.skip decoration in test_notebooks.py or a similar exclusion.

Docs indexing: Please ensure the notebook is referenced in the appropriate RST index file under docs/source/example_notebooks/ so it appears in the rendered documentation site.

CNN training time: If the notebook includes full model training from scratch, runtime may be very long for anyone following along. Consider noting approximate wall-clock time and whether a pre-trained weights option is feasible for the demo.

Despite these practical considerations, the analytical story (train model → construct causal graph → apply GCM intrinsic influence / arrow strength / interventional samples → interpret) is clean and instructive. The intrinsic_causal_influence result quantifying scanner artifact vs. clinical signal is a great headline finding that illustrates why causal auditing matters beyond accuracy metrics.

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@11c9a2c442e519ff2b427bf58679f5a525353f76

@sanzits
Copy link
Copy Markdown
Author

sanzits commented Jun 2, 2026

Hello, thanks for the detailed feedback! I've addressed all three points:

Added a data download warning at the top of the notebook with instructions for the NIH Chest X-ray dataset
Committed the notebook with all cell outputs pre-saved so it doesn't need to re-execute during the docs build
Added approximate training time (~45–60 min on GPU, ~4–6 hrs on CPU).

Happy to make any further changes. Thanks!

sanzits added 2 commits June 4, 2026 17:03
…ing time note

Signed-off-by: Sanchit <sanzit.s@gmail.com>
…ing time note

Signed-off-by: Sanchit <sanzit.s@gmail.com>
@sanzits sanzits force-pushed the add-chest-xray-gcm-notebook branch from d2ee462 to dba5fe8 Compare June 4, 2026 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant