Port AtomicLocal integration to GPU by abussy · Pull Request #1163 · JuliaMolSim/DFTK.jl

abussy · 2025-10-07T15:26:05Z

This PR ports the integration of the AtomicLocal form factors to the GPU.

In standard calculations, the instantiation of the AtomicLocal term in the PlaneWaveBasis is generally negligible. However, when ForwardDiff Duals are involved, this is an other story. In particular, for stress calculations on the GPU (not yet generally available, see JuliaMolSim/DftFunctionals.jl#23), this step becomes vastly more expensive than anything else.

For efficiency, the form factors are only calculated for a set of G vector norms, rather than for each individual G. With ForwardDiff Duals, 2 G vectors with the same value might differ in their partials and have a different norm. As a result, a lot more integrations must take place.

This PR introduces a special case for the GPU, where a map! kernel is launched over the G vector norms, and each GPU thread takes care of an integration on the atomic grid. This only triggers in the special case of UPF pseudos on the GPU.

Notes:

This PR will drastically speed up stress calculations on the GPU. While the new function will also be used for the instantiation of a standard PlaneWaveBasis, its impact there will be smaller
The AtomicLocal term will remain expensive on the CPU. Since everything else also gets more expensive, it is less of an obvious bottleneck in that case. However, I think it would be trivial to parallelize it over Julia threads
Due to the complex dispatching taking place, and the fact that the ElementPsp{<:PspUpf} type is far from being isbits, I did not manage to come up with a single code to run on both CPU and GPU.
The instantiation of the Xc term suffers from a similar problem, to be treated in a separate PR

mfherbst

Thanks for this suggestion. I think it becomes increasingly clear that local_potential_fourier(element, p) is not the right interface. I think it's better to fix that rather than adding yet another level of indirection with the atomic_local_inner_loop!.

What I have in mind could be an interface where one gets multiple ps at once or something similar, perhaps in combination with an in-place version, where one is supposed to place it into CPU or GPU memory. That allows specialisation directly at the level of the PspUpf implementation.

What I really do not like here is that gpu/local.jl now essentially contains bleeded-out details of how one is supposed to integrate a PspUpf. Such code should be directly associated to the PspUpf structure and nothing else.

What do you think @abussy ?

mfherbst · 2025-10-08T01:11:09Z

+        end
+    end
+
+    ints_cpu = to_cpu(ints)


This is weird. The only reason this is on the CPU is because we needed it for something. Now essentially you put form_factors on the CPU only to put it on the GPU once this function call is over, right ? Can this not directly be a GPU array in the GPU version of the code ?

abussy · 2025-10-08T07:44:55Z

What I really do not like here is that gpu/local.jl now essentially contains bleeded-out details of how one is supposed to integrate a PspUpf. Such code should be directly associated to the PspUpf structure and nothing else.

I fully agree. The proposed solution happens to be the one with the least amount of change to the current code, but some change would allow a much smoother integration.

As you suggest, allowing local_potential_fourier(element, p) to treat multiple ps at once would pass the loop logic down to the specific pseudo implementation. If we remain general and pass an AbstractArray of ps, we might even be able to write architecture agnostic code. I'll look into it.

Technici4n · 2025-10-08T08:41:12Z

In standard calculations, the instantiation of the AtomicLocal term in the PlaneWaveBasis is generally negligible. However, when ForwardDiff Duals are involved, this is an other story.

Could we fix this on the CPU somehow? I wonder what exactly makes them so slow. In principle it should only be twice as slow.

abussy · 2025-10-08T09:07:58Z

I wonder what exactly makes them so slow. In principle it should only be twice as slow.

There is a whole machinery to only perform integrations for unique values of norm(G) (see here). When duals are involved, the number of unique norms explodes. My interpretation is that G vectors with the same value can have different partials. So the overall norm is different.

abussy · 2025-10-08T12:38:21Z

This last commit is a proof of concept and not a final product. It illustrates how one might implement the loop over norm(G) at the Psp level (here only implemented for <:PspUpf).

To make it general, one needs to:

Modify all function signatures of element.jl so that arrays are expected, and make sure the underlying implementations match that
Update the rest of the code to this new convention

This approach has the advantage of having a single code base running on both CPU and GPU, and not over complicating the high level calls. However, the low-level implementation of the integrals become a bit awkward.

Before proceeding, I'd like to make sure we agree on the way forward.

abussy · 2026-02-03T17:13:41Z

I am reviving this PR, because running any kind of stress calculation on the GPU is extremely slow without it. I have also noticed a large impact on the performance of AtomicLocal forces. This is particularly important when running a SCF with the DFTK.ScfConvergenceForce() callback. Finally, similar optimizations must take place for the Xc and AtomicNonlocal terms, and this PR being merged is a prerequisite to starting the work.

Essentially, this PR introduces vectorized versions of the local_potential_fourier() functions for the various types of elements. In most cases, it is a simple map(). However, for the PspUpf case, a highly optimized GPU implementation is provided.

Note: code up to date with master.

Technici4n · 2026-02-03T18:04:07Z

I wonder if we could use fast spherical bessel transforms somehow, see e.g. https://doi.org/10.1016/j.cpc.2009.09.020 Maybe this could be a student project idea?

(Not saying we shouldn't merge this PR ;) )

mfherbst

Thanks

abussy · 2026-02-18T15:59:16Z

Implemented requested and suggested changes. When generalizing vectorized general fallbacks of the type:

function local_potential_fourier(el::Element, ps::AbstractVector{T}) where {T <: Real}
    arch = architecture(ps)
    to_device(arch, map(p -> local_potential_fourier(el, p), to_cpu(ps)))
end

I also made sure it was GPU safe. Some non UPF calculations would fail on the GPU.

Note: up to date with current master

mfherbst · 2026-02-19T09:58:54Z

Thanks !

mfherbst reviewed Oct 8, 2025

View reviewed changes

abussy mentioned this pull request Oct 13, 2025

Optimize uniform Simpson #1171

Merged

abussy mentioned this pull request Nov 7, 2025

Port stress/response calculations to the GPU #1187

Merged

abussy force-pushed the simpson branch from 189e18e to dc63716 Compare February 3, 2026 17:06

mfherbst reviewed Feb 11, 2026

View reviewed changes

abussy force-pushed the simpson branch from dc63716 to 1a6a15c Compare February 18, 2026 15:53

abussy added 4 commits February 18, 2026 16:53

Port AtomicLocal integration to GPU

01b4cd7

Implement loop over ps within Psp

ce5faa9

Clean-up

d296b4b

PR review changes

7bc0017

abussy force-pushed the simpson branch from 1a6a15c to 7bc0017 Compare February 18, 2026 15:54

mfherbst merged commit b6880b8 into JuliaMolSim:master Feb 19, 2026
10 checks passed

This was referenced Feb 19, 2026

Consider using SphericalBesselTransforms.jl for log-grid pseudos #1258

Open

Fix architecture(x::AbstractArray) function #1260

Merged

Optimizing XC instantiation for GPUs #1262

Open

Optimize AtomicNonlocal term for the GPU #1265

Merged

Conversation

abussy commented Oct 7, 2025

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

mfherbst Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

abussy commented Oct 8, 2025

Uh oh!

Technici4n commented Oct 8, 2025

Uh oh!

abussy commented Oct 8, 2025

Uh oh!

abussy commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abussy commented Feb 3, 2026

Uh oh!

Technici4n commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abussy commented Feb 18, 2026

Uh oh!

Uh oh!

mfherbst commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abussy commented Oct 8, 2025 •

edited

Loading

Technici4n commented Feb 3, 2026 •

edited

Loading