Skip to content

[Repo Assist] perf(gcm): avoid pre-allocated DataFrame in compute_data_from_noise and compute_noise_from_data#1561

Draft
github-actions[bot] wants to merge 1 commit into
mainfrom
repo-assist/perf-noise-dict-accumulation-2026-06-9268ad97bc5f0a3a
Draft

[Repo Assist] perf(gcm): avoid pre-allocated DataFrame in compute_data_from_noise and compute_noise_from_data#1561
github-actions[bot] wants to merge 1 commit into
mainfrom
repo-assist/perf-noise-dict-accumulation-2026-06-9268ad97bc5f0a3a

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot commented Jun 2, 2026

🤖 This is an automated pull request from Repo Assist, an AI assistant.

Summary

Extends the dict-of-arrays accumulation pattern (introduced in #1544 for fitting_sampling.py) to two more functions in dowhy/gcm/_noise.py.

Affected functions

compute_data_from_noise and compute_noise_from_data both used the pre-allocated DataFrame anti-pattern:

# Before (inefficient with pandas 2.x CoW)
data = pd.DataFrame(np.empty((noise_data.shape[0], len(sorted_nodes))), columns=sorted_nodes)
for node in sorted_nodes:
    ...
    data[node] = ...  # triggers copy on each assignment
return data
# After (no intermediate copies)
data: Dict[Any, np.ndarray] = {}
for node in sorted_nodes:
    ...
    data[node] = ...  # plain dict insertion
return pd.DataFrame(data, columns=sorted_nodes)  # single allocation

compute_data_from_noise — additional improvement

The parent-data lookup moved from DataFrame column-selection to direct dict access:

# Before: DataFrame slice → triggers another copy
data[get_ordered_predecessors(causal_model.graph, node)].to_numpy()

# After: build parent matrix directly from already-computed arrays
parents = get_ordered_predecessors(causal_model.graph, node)
np.column_stack([data[p] for p in parents])

np.column_stack on 1-D arrays produces the same (N, k) shape as the original .to_numpy() call, so the contract with .evaluate() is unchanged.

Why this matters

With pandas ≥ 2.0 copy-on-write semantics, assigning to a column of a pre-allocated DataFrame forces an internal copy of the backing block on every iteration. For graphs with many nodes, this adds O(N2) memory work. Accumulating results in a plain dict and constructing the DataFrame once at the end reduces that to O(N).

These two functions are called inside tight loops in get_noise_dependent_function_get_exact_noise_dependent_function, and during compute_noise_from_data in counterfactual and noise-estimation workflows, so the saving compounds.

Trade-offs

  • No API change — function signatures and return types are identical.
  • np.column_stack with 1-D arrays produces identical shape to the original DataFrame .to_numpy().
  • The noise_samples_of_ancestors function in the same file uses the same pre-allocation pattern but is intentionally left for a follow-up, as its partial-ancestor skip logic makes the change slightly more involved.

Test Status

Tests could not be run in this environment (no Python runtime with dependencies). CI will validate.

Relevant test files:

  • tests/gcm/test_noise.py — directly exercises compute_data_from_noise and compute_noise_from_data
  • tests/gcm/test_counterfactuals.py — exercises the counterfactual workflow that calls these functions
  • tests/gcm/test_graph.py

Generated by 🌈 Repo Assist, see workflow run. Learn more.

Generated by 🌈 Repo Assist, see workflow run. Learn more.

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@11c9a2c442e519ff2b427bf58679f5a525353f76

…nd compute_noise_from_data

Replace pd.DataFrame(np.empty(...)) pre-allocation with dict-of-arrays
accumulation in two functions in _noise.py, following the same pattern
applied to fitting_sampling.py.

- compute_data_from_noise: accumulate node arrays in a dict, build parent
  arrays with np.column_stack instead of DataFrame column-selection, and
  create the output DataFrame once at the end.
- compute_noise_from_data: accumulate noise arrays in a dict, create the
  output DataFrame once at the end (parent lookup still uses the immutable
  observed_data DataFrame, so no change there).

With pandas 2.x copy-on-write semantics, writing to a pre-allocated
DataFrame column-by-column triggers repeated implicit copies. Using a
dict avoids all intermediate DataFrame copies.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants