[Repo Assist] perf(gcm): avoid redundant get_ordered_predecessors calls and pre-allocated DataFrame in fitting_sampling by github-actions[bot] · Pull Request #1544 · py-why/dowhy

github-actions · 2026-05-26T14:03:01Z

🤖 This is an automated pull request from Repo Assist.

Summary

Two performance improvements in dowhy/gcm/fitting_sampling.py:

1. Eliminate redundant `get_ordered_predecessors` call in `fit_causal_model_of_target`

Previously, get_ordered_predecessors(graph, node) was called twice per non-root node during fit:

# Call 1 — inside .fit()
X=training_data[get_ordered_predecessors(causal_model.graph, target_node)].to_numpy()
# Call 2 — for PARENTS_DURING_FIT
causal_model.graph.nodes[target_node][PARENTS_DURING_FIT] = get_ordered_predecessors(causal_model.graph, target_node)

Fix: compute once and reuse:

ordered_predecessors = get_ordered_predecessors(causal_model.graph, target_node)
causal_model.causal_mechanism(target_node).fit(X=training_data[ordered_predecessors]..., ...)
causal_model.graph.nodes[target_node][PARENTS_DURING_FIT] = ordered_predecessors

For a graph with N nodes, this halves the number of sorted(graph.predecessors(node)) calls made per gcm.fit().

2. Replace pre-allocated DataFrame with dict of arrays in `draw_samples`

Previously, a DataFrame was pre-allocated with pd.DataFrame(np.empty(...)) — containing uninitialized data — then filled column-by-column, triggering repeated copy operations in pandas 2.x (copy-on-write).

Fix: accumulate samples in a Dict[node, np.ndarray] and create the DataFrame once at the end:

drawn_samples: Dict[Any, np.ndarray] = {}
for node in sorted_nodes:
    ...
    drawn_samples[node] = causal_mechanism.draw_samples(parent_data).squeeze()
return pd.DataFrame(drawn_samples, columns=sorted_nodes)

This also removes the now-unnecessary _parent_samples_of helper (which was private to fitting_sampling.py — different copies exist in whatif.py and _noise.py and are unchanged).

Root Cause

No bug — these are proactive performance improvements. The pre-allocated DataFrame approach worked but incurred unnecessary overhead.

Trade-offs

No behaviour change: column order is preserved via columns=sorted_nodes
No API change: public function signatures are unchanged
Minor correctness improvement: avoids uninitialized data in the intermediate state

Test Status

Tests could not be run in this environment (no Python runtime with dependencies available). CI will validate. The changes are semantically equivalent transformations with no logic changes.

Relevant test files:

tests/gcm/test_data_generator.py — exercises draw_samples
tests/gcm/test_graph.py — exercises fit and draw_samples
tests/gcm/test_confidence_intervals_cms.py

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run
gh aw add githubnext/agentics/workflows/repo-assist.md@11c9a2c442e519ff2b427bf58679f5a525353f76

…ocated DataFrame in fitting_sampling In fit_causal_model_of_target, get_ordered_predecessors was called twice for every non-root node (once for fitting, once for PARENTS_DURING_FIT). Store the result once per node. In draw_samples, the pre-allocated pd.DataFrame(np.empty(...)) was filled column-by-column, which triggers repeated copy operations in pandas 2.x (copy-on-write). Switch to a dict of numpy arrays and construct the DataFrame once at the end. Also removes the now-unnecessary _parent_samples_of helper. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

github-actions Bot added automation enhancement New feature or request performance repo-assist labels May 26, 2026

This was referenced May 26, 2026

[Repo Assist] Monthly Activity 2026-05 #1494

Closed

[Repo Assist] Monthly Activity 2026-06 #1559

Open

github-actions Bot mentioned this pull request Jun 2, 2026

[Repo Assist] perf(gcm): avoid pre-allocated DataFrame in compute_data_from_noise and compute_noise_from_data #1561

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Repo Assist] perf(gcm): avoid redundant get_ordered_predecessors calls and pre-allocated DataFrame in fitting_sampling#1544

[Repo Assist] perf(gcm): avoid redundant get_ordered_predecessors calls and pre-allocated DataFrame in fitting_sampling#1544
github-actions[bot] wants to merge 1 commit into
mainfrom
repo-assist/perf-fitting-sampling-20260526-d067d6062e6e6591

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

github-actions Bot commented May 26, 2026

Summary

1. Eliminate redundant get_ordered_predecessors call in fit_causal_model_of_target

2. Replace pre-allocated DataFrame with dict of arrays in draw_samples

Root Cause

Trade-offs

Test Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

1. Eliminate redundant `get_ordered_predecessors` call in `fit_causal_model_of_target`

2. Replace pre-allocated DataFrame with dict of arrays in `draw_samples`