Skip to content

Port AtomicLocal integration to GPU#1163

Merged
mfherbst merged 4 commits intoJuliaMolSim:masterfrom
abussy:simpson
Feb 19, 2026
Merged

Port AtomicLocal integration to GPU#1163
mfherbst merged 4 commits intoJuliaMolSim:masterfrom
abussy:simpson

Conversation

@abussy
Copy link
Copy Markdown
Collaborator

@abussy abussy commented Oct 7, 2025

This PR ports the integration of the AtomicLocal form factors to the GPU.

In standard calculations, the instantiation of the AtomicLocal term in the PlaneWaveBasis is generally negligible. However, when ForwardDiff Duals are involved, this is an other story. In particular, for stress calculations on the GPU (not yet generally available, see JuliaMolSim/DftFunctionals.jl#23), this step becomes vastly more expensive than anything else.

For efficiency, the form factors are only calculated for a set of G vector norms, rather than for each individual G. With ForwardDiff Duals, 2 G vectors with the same value might differ in their partials and have a different norm. As a result, a lot more integrations must take place.

This PR introduces a special case for the GPU, where a map! kernel is launched over the G vector norms, and each GPU thread takes care of an integration on the atomic grid. This only triggers in the special case of UPF pseudos on the GPU.

Notes:

  • This PR will drastically speed up stress calculations on the GPU. While the new function will also be used for the instantiation of a standard PlaneWaveBasis, its impact there will be smaller
  • The AtomicLocal term will remain expensive on the CPU. Since everything else also gets more expensive, it is less of an obvious bottleneck in that case. However, I think it would be trivial to parallelize it over Julia threads
  • Due to the complex dispatching taking place, and the fact that the ElementPsp{<:PspUpf} type is far from being isbits, I did not manage to come up with a single code to run on both CPU and GPU.
  • The instantiation of the Xc term suffers from a similar problem, to be treated in a separate PR

Copy link
Copy Markdown
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion. I think it becomes increasingly clear that local_potential_fourier(element, p) is not the right interface. I think it's better to fix that rather than adding yet another level of indirection with the atomic_local_inner_loop!.

What I have in mind could be an interface where one gets multiple ps at once or something similar, perhaps in combination with an in-place version, where one is supposed to place it into CPU or GPU memory. That allows specialisation directly at the level of the PspUpf implementation.

What I really do not like here is that gpu/local.jl now essentially contains bleeded-out details of how one is supposed to integrate a PspUpf. Such code should be directly associated to the PspUpf structure and nothing else.

What do you think @abussy ?

Comment thread src/gpu/local.jl Outdated
end
end

ints_cpu = to_cpu(ints)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. The only reason this is on the CPU is because we needed it for something. Now essentially you put form_factors on the CPU only to put it on the GPU once this function call is over, right ? Can this not directly be a GPU array in the GPU version of the code ?

@abussy
Copy link
Copy Markdown
Collaborator Author

abussy commented Oct 8, 2025

What I really do not like here is that gpu/local.jl now essentially contains bleeded-out details of how one is supposed to integrate a PspUpf. Such code should be directly associated to the PspUpf structure and nothing else.

I fully agree. The proposed solution happens to be the one with the least amount of change to the current code, but some change would allow a much smoother integration.

As you suggest, allowing local_potential_fourier(element, p) to treat multiple ps at once would pass the loop logic down to the specific pseudo implementation. If we remain general and pass an AbstractArray of ps, we might even be able to write architecture agnostic code. I'll look into it.

@Technici4n
Copy link
Copy Markdown
Collaborator

In standard calculations, the instantiation of the AtomicLocal term in the PlaneWaveBasis is generally negligible. However, when ForwardDiff Duals are involved, this is an other story.

Could we fix this on the CPU somehow? I wonder what exactly makes them so slow. In principle it should only be twice as slow.

@abussy
Copy link
Copy Markdown
Collaborator Author

abussy commented Oct 8, 2025

I wonder what exactly makes them so slow. In principle it should only be twice as slow.

There is a whole machinery to only perform integrations for unique values of norm(G) (see here). When duals are involved, the number of unique norms explodes. My interpretation is that G vectors with the same value can have different partials. So the overall norm is different.

@abussy
Copy link
Copy Markdown
Collaborator Author

abussy commented Oct 8, 2025

This last commit is a proof of concept and not a final product. It illustrates how one might implement the loop over norm(G) at the Psp level (here only implemented for <:PspUpf).

To make it general, one needs to:

  • Modify all function signatures of element.jl so that arrays are expected, and make sure the underlying implementations match that
  • Update the rest of the code to this new convention

This approach has the advantage of having a single code base running on both CPU and GPU, and not over complicating the high level calls. However, the low-level implementation of the integrals become a bit awkward.

Before proceeding, I'd like to make sure we agree on the way forward.

@abussy
Copy link
Copy Markdown
Collaborator Author

abussy commented Feb 3, 2026

I am reviving this PR, because running any kind of stress calculation on the GPU is extremely slow without it. I have also noticed a large impact on the performance of AtomicLocal forces. This is particularly important when running a SCF with the DFTK.ScfConvergenceForce() callback. Finally, similar optimizations must take place for the Xc and AtomicNonlocal terms, and this PR being merged is a prerequisite to starting the work.

Essentially, this PR introduces vectorized versions of the local_potential_fourier() functions for the various types of elements. In most cases, it is a simple map(). However, for the PspUpf case, a highly optimized GPU implementation is provided.

Note: code up to date with master.

@Technici4n
Copy link
Copy Markdown
Collaborator

Technici4n commented Feb 3, 2026

I wonder if we could use fast spherical bessel transforms somehow, see e.g. https://doi.org/10.1016/j.cpc.2009.09.020 Maybe this could be a student project idea?

(Not saying we shouldn't merge this PR ;) )

Copy link
Copy Markdown
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Comment thread src/pseudo/PspHgh.jl Outdated
Comment thread src/pseudo/NormConservingPsp.jl Outdated
Comment thread src/elements.jl Outdated
Comment thread src/common/quadrature.jl Outdated
Comment thread src/terms/local.jl Outdated
Comment thread src/common/quadrature.jl Outdated
Comment thread src/pseudo/PspUpf.jl Outdated
Comment thread src/pseudo/PspUpf.jl Outdated
Comment thread src/architecture.jl Outdated
Comment thread src/terms/local.jl
@abussy
Copy link
Copy Markdown
Collaborator Author

abussy commented Feb 18, 2026

Implemented requested and suggested changes. When generalizing vectorized general fallbacks of the type:

function local_potential_fourier(el::Element, ps::AbstractVector{T}) where {T <: Real}
    arch = architecture(ps)
    to_device(arch, map(p -> local_potential_fourier(el, p), to_cpu(ps)))
end

I also made sure it was GPU safe. Some non UPF calculations would fail on the GPU.

Note: up to date with current master

@mfherbst mfherbst merged commit b6880b8 into JuliaMolSim:master Feb 19, 2026
10 checks passed
@mfherbst
Copy link
Copy Markdown
Member

Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants