[Repo Assist] perf(gcm): avoid pre-allocated DataFrame in compute_data_from_noise and compute_noise_from_data by github-actions[bot] · Pull Request #1561 · py-why/dowhy

github-actions · 2026-06-02T14:21:43Z

🤖 This is an automated pull request from Repo Assist, an AI assistant.

Summary

Extends the dict-of-arrays accumulation pattern (introduced in #1544 for fitting_sampling.py) to two more functions in dowhy/gcm/_noise.py.

Affected functions

compute_data_from_noise and compute_noise_from_data both used the pre-allocated DataFrame anti-pattern:

# Before (inefficient with pandas 2.x CoW)
data = pd.DataFrame(np.empty((noise_data.shape[0], len(sorted_nodes))), columns=sorted_nodes)
for node in sorted_nodes:
    ...
    data[node] = ...  # triggers copy on each assignment
return data

# After (no intermediate copies)
data: Dict[Any, np.ndarray] = {}
for node in sorted_nodes:
    ...
    data[node] = ...  # plain dict insertion
return pd.DataFrame(data, columns=sorted_nodes)  # single allocation

`compute_data_from_noise` — additional improvement

The parent-data lookup moved from DataFrame column-selection to direct dict access:

# Before: DataFrame slice → triggers another copy
data[get_ordered_predecessors(causal_model.graph, node)].to_numpy()

# After: build parent matrix directly from already-computed arrays
parents = get_ordered_predecessors(causal_model.graph, node)
np.column_stack([data[p] for p in parents])

np.column_stack on 1-D arrays produces the same (N, k) shape as the original .to_numpy() call, so the contract with .evaluate() is unchanged.

Why this matters

With pandas ≥ 2.0 copy-on-write semantics, assigning to a column of a pre-allocated DataFrame forces an internal copy of the backing block on every iteration. For graphs with many nodes, this adds O(N2) memory work. Accumulating results in a plain dict and constructing the DataFrame once at the end reduces that to O(N).

These two functions are called inside tight loops in get_noise_dependent_function → _get_exact_noise_dependent_function, and during compute_noise_from_data in counterfactual and noise-estimation workflows, so the saving compounds.

Trade-offs

No API change — function signatures and return types are identical.
np.column_stack with 1-D arrays produces identical shape to the original DataFrame .to_numpy().
The noise_samples_of_ancestors function in the same file uses the same pre-allocation pattern but is intentionally left for a follow-up, as its partial-ancestor skip logic makes the change slightly more involved.

Test Status

Tests could not be run in this environment (no Python runtime with dependencies). CI will validate.

Relevant test files:

tests/gcm/test_noise.py — directly exercises compute_data_from_noise and compute_noise_from_data
tests/gcm/test_counterfactuals.py — exercises the counterfactual workflow that calls these functions
tests/gcm/test_graph.py

Generated by 🌈 Repo Assist, see workflow run. Learn more.

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run
gh aw add githubnext/agentics/workflows/repo-assist.md@11c9a2c442e519ff2b427bf58679f5a525353f76

…nd compute_noise_from_data Replace pd.DataFrame(np.empty(...)) pre-allocation with dict-of-arrays accumulation in two functions in _noise.py, following the same pattern applied to fitting_sampling.py. - compute_data_from_noise: accumulate node arrays in a dict, build parent arrays with np.column_stack instead of DataFrame column-selection, and create the output DataFrame once at the end. - compute_noise_from_data: accumulate noise arrays in a dict, create the output DataFrame once at the end (parent lookup still uses the immutable observed_data DataFrame, so no change there). With pandas 2.x copy-on-write semantics, writing to a pre-allocated DataFrame column-by-column triggers repeated implicit copies. Using a dict avoids all intermediate DataFrame copies. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

github-actions Bot added automation repo-assist labels Jun 2, 2026

github-actions Bot mentioned this pull request Jun 3, 2026

[Repo Assist] Monthly Activity 2026-06 #1559

Open

41 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Repo Assist] perf(gcm): avoid pre-allocated DataFrame in compute_data_from_noise and compute_noise_from_data#1561

[Repo Assist] perf(gcm): avoid pre-allocated DataFrame in compute_data_from_noise and compute_noise_from_data#1561
github-actions[bot] wants to merge 1 commit into
mainfrom
repo-assist/perf-noise-dict-accumulation-2026-06-9268ad97bc5f0a3a

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

github-actions Bot commented Jun 2, 2026

Summary

Affected functions

compute_data_from_noise — additional improvement

Why this matters

Trade-offs

Test Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

`compute_data_from_noise` — additional improvement