[Repo Assist] perf(gcm): avoid pre-allocated DataFrame in compute_data_from_noise and compute_noise_from_data#1561
Draft
github-actions[bot] wants to merge 1 commit into
Conversation
…nd compute_noise_from_data Replace pd.DataFrame(np.empty(...)) pre-allocation with dict-of-arrays accumulation in two functions in _noise.py, following the same pattern applied to fitting_sampling.py. - compute_data_from_noise: accumulate node arrays in a dict, build parent arrays with np.column_stack instead of DataFrame column-selection, and create the output DataFrame once at the end. - compute_noise_from_data: accumulate noise arrays in a dict, create the output DataFrame once at the end (parent lookup still uses the immutable observed_data DataFrame, so no change there). With pandas 2.x copy-on-write semantics, writing to a pre-allocated DataFrame column-by-column triggers repeated implicit copies. Using a dict avoids all intermediate DataFrame copies. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
41 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 This is an automated pull request from Repo Assist, an AI assistant.
Summary
Extends the dict-of-arrays accumulation pattern (introduced in #1544 for
fitting_sampling.py) to two more functions indowhy/gcm/_noise.py.Affected functions
compute_data_from_noiseandcompute_noise_from_databoth used the pre-allocated DataFrame anti-pattern:compute_data_from_noise— additional improvementThe parent-data lookup moved from DataFrame column-selection to direct dict access:
np.column_stackon 1-D arrays produces the same(N, k)shape as the original.to_numpy()call, so the contract with.evaluate()is unchanged.Why this matters
With pandas ≥ 2.0 copy-on-write semantics, assigning to a column of a pre-allocated DataFrame forces an internal copy of the backing block on every iteration. For graphs with many nodes, this adds O(N2) memory work. Accumulating results in a plain
dictand constructing the DataFrame once at the end reduces that to O(N).These two functions are called inside tight loops in
get_noise_dependent_function→_get_exact_noise_dependent_function, and duringcompute_noise_from_datain counterfactual and noise-estimation workflows, so the saving compounds.Trade-offs
np.column_stackwith 1-D arrays produces identical shape to the original DataFrame.to_numpy().noise_samples_of_ancestorsfunction in the same file uses the same pre-allocation pattern but is intentionally left for a follow-up, as its partial-ancestor skip logic makes the change slightly more involved.Test Status
Tests could not be run in this environment (no Python runtime with dependencies). CI will validate.
Relevant test files:
tests/gcm/test_noise.py— directly exercisescompute_data_from_noiseandcompute_noise_from_datatests/gcm/test_counterfactuals.py— exercises the counterfactual workflow that calls these functionstests/gcm/test_graph.py