Skip to content

Optimizing XC instantiation for GPUs#1262

Open
abussy wants to merge 5 commits intoJuliaMolSim:masterfrom
abussy:hankel
Open

Optimizing XC instantiation for GPUs#1262
abussy wants to merge 5 commits intoJuliaMolSim:masterfrom
abussy:hankel

Conversation

@abussy
Copy link
Copy Markdown
Collaborator

@abussy abussy commented Feb 20, 2026

This PR optimizes the instantiation of the XC term on the GPU, with a particularly large impact when Duals are involved (stress and response calculations). It follows the same design patterns as #1163.

Particular care was given to the hankel function in workarounds/forwarddiff_rules.jl, in order to make it GPU compatible. A side effect is the suppression of allocations, which also marginally (few %) speeds-up Xc instantiation on the CPU.

For illustration, here are XC instantiation timings for the input below (stress of a 3x3x3 rattled aluminium supercell):

master this PR
A100 65 s 2 s
Mi250 66 s 2 s
using DFTK
using PseudoPotentialData
using CUDA
setup_threading()

arch = DFTK.GPU(CuArray)

Ecut = 50.0
kgrid = [1, 1, 1]
maxiter = 10
tol = 1.0e-10
temperature = 0.01

factor = 3
a = 3.8267
lattice = factor*a * [[0.0 1.0 1.0];
                      [1.0 0.0 1.0];
                      [1.0 1.0 0.0]]
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
atoms = [Al for i in 1:factor^3]
positions = Vector{Vector{Float64}}([])
for i = 1:factor, j = 1:factor, k=1:factor
   push!(positions, [i/factor, j/factor, k/factor])
end

#rattle a few atoms                                                                                  
positions[5] = positions[5] + [0.01, 0.0, -0.01]
positions[10] = positions[10] + [0.01, 0.0, -0.01]
positions[15] = positions[15] + [0.01, 0.0, -0.01]
positions[20] = positions[20] + [0.01, 0.0, -0.01]
positions[25] = positions[25] + [0.01, 0.0, -0.01]

model = model_DFT(lattice, atoms, positions; temperature=temperature,
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())
                  
# warmup
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=3, tol=tol);
forces = compute_forces(scfres)
stress = compute_stresses_cart(scfres)

#actual calculations
DFTK.reset_timer!(DFTK.timer)
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=maxiter, is_converged=ScfConvergenceEnergy(tol))
forces = compute_forces(scfres)
stress = compute_stresses_cart(scfres)
@show DFTK.timer

Comment thread src/pseudo/PspUpf.jl
Comment on lines +266 to +267
rgrid = to_device(arch, @view psp.rgrid[1:psp.ircut])
r2_ρcore = to_device(arch, @view psp.r2_ρcore[1:psp.ircut])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with what you do here. But I think it's worth having a small discussion as this does now emerge as a more general pattern. Some questions:

  • Right now we implement the "single-p-version" in all pseudos and then make vectorised versions for the cases where we want extra speedup on the GPU. Would it make more sense to go the other way and implement the vectorised version and then provide the "single-p-version" as a fallback, i.e. just calling the vector-p version with [p] ? My thinking is that it probably makes sense to do exactly the same thing (prefer the vectorised version for the GPU) for the HGH pseudos and for possible UPF pseudos we implement in the future. So it's probably the better pattern to make the vector version the thing you actually have to implement as a developer.
  • What about the other functions that also take a single p right now. Should we not move those to vector versions as well ? The very least for consistency ?
  • In line with that, does it make sense to put rgrid and r2_ρcore onto the GPU from the start ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an important discussion, and here is my take on it:

On a conceptual level, I think it makes more sense to have a "single-p" version as a default, and a "vector-p" as a special case: it follows human thinking a bit more closely. That being said, optimized "vector-p" functions are many times more efficient on the GPU, and also have a strong potential for CPU shared memory parallelism (OMP threading style), if this becomes necessary.

I also agree that using the "vector-p" as a default would push future developments in the right direction from the start. It is however not straight forward to write a GPU compatible map call, and it is likely never gonna happen by chance. Since most people develop and test on the CPU, there is a real risk of introducing faulty code. In that regard, using the current pattern of:

# Fallback vectorized implementation for non GPU-optimized code.
function eval_psp_local_fourier(psp::NormConservingPsp, ps::AbstractVector{T}) where {T <: Real}
arch = architecture(ps)
to_device(arch, map(p -> eval_psp_local_fourier(psp, p), to_cpu(ps)))
end

carries an explicit message regarding GPUs: this piece of code does not work (or has not been tested) on the GPU. Therefore, I personally slightly lean towards "single-p" version as a default.

does it make sense to put rgrid and r2_ρcore onto the GPU from the start ?

I think this would introduce a lot of complication: everything in the PSP code would need to work with GPU arrays. Performance wise, the data transfer towards the GPU is cheap, because arrays are small, and the PSP functions are rarely called (mostly basis instantiation, so 2-3 times per calculation).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a real risk of introducing faulty code.

Is this true ? If people develop new pseudos they are likely copying from existing implementations. If we do it right there, would it not either immediately work in a new implementation or reduce the effort in refactoring such that it works efficiently on the GPU ? In the sense that it makes it in the long run easier to fix that faulty code rather than refactoring one by one every function such that there also is a vector version ? If you think I'm wrong here and having the vector code around will not reduce efforts in refactoring, then I agree that because of the conceptional simplicity, it makes sense to default on the single-p-version.

In that regard, using the current pattern of:

Ok, but then we should use this consistently everywhere, with possibly a macro to avoid writing the same code 20 times (if it's not too cumbersome ... macros have always the danger to obfuscate). It's weird that the vector version only exists for some PSP functions and not others.

I think this would introduce a lot of complication:

That makes sense. Let's not do it for now.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true ?

I think the risk exists, yes. For example, while working on #1163, I looked into vectorizing PspHgh:

@doc raw"""
The local potential of a HGH pseudopotentials in reciprocal space
can be brought to the form ``Q(t) / (t^2 exp(t^2 / 2))``
where ``t = r_\text{loc} p`` and `Q`
is a polynomial of at most degree 8. This function returns `Q`.
"""
@inline function psp_local_polynomial(T, psp::PspHgh, t=Polynomial(T[0, 1]))
rloc::T = psp.rloc
Zion::T = psp.Zion
# The polynomial prefactor P(t) (as used inside the { ... } brackets of equation
# (5) of the HGH98 paper)
P = ( psp.cloc[1]
+ psp.cloc[2] * ( 3 - t^2 )
+ psp.cloc[3] * ( 15 - 10t^2 + t^4 )
+ psp.cloc[4] * (105 - 105t^2 + 21t^4 - t^6))
4T(π) * rloc^2 * (-Zion + sqrt(T(π) / 2) * rloc * t^2 * P)
end
# [GTH98] (6) except they do it with plane waves normalized by 1/sqrt(Ω).
function eval_psp_local_fourier(psp::PspHgh, p::T) where {T <: Real}
p == 0 && return zero(T) # Compensating charge background
t::T = p * psp.rloc
psp_local_polynomial(T, psp, t) * exp(-t^2 / 2) / t^2
end

A function like the following works fine on the CPU, but fails on the GPU because instances of Polynomial are not isbits:

function eval_psp_local_fourier(psp::PspHgh, ps::AbstractVector{T}) where {T <: Real}
    map(ps) do p
        if p == 0
            zero(T)  # Compensating charge background
        else
            t = p * psp.rloc
            psp_local_polynomial(T, psp, t) * exp(-t^2 / 2) / t^2
        end
    end
end

I am sure there are ways to circumvent this, but I didn't spend the time because I didn't identify obvious performance bottlenecks.

It would probably be possible to rework all current PSP implementations to be vectorized by default in a GPU compatible way. If properly commented, these could also be used as a base for new Psp implementations, which would hopefully result in GPU compatible code from the start.

This is a lot of work, and a major refactoring of the code. I am willing to commit to it, but I'd rather have everybody on board with it before I do.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the Polynomial: This was implemented a long, long time ago for some idea that never really matured. So if this is the obstacle, it can be completely removed. In fact this would imply we can even remove the Polynomials dependency from DFTK completely, which is a good thing ... given that this rather obscure use case is the only place we need the package at all.

This is a lot of work, and a major refactoring of the code. I am willing to commit to it, but I'd rather have everybody on board with it before I do.

I agree with that and I think we should indeed only do it, if it's worth it. What is most important in my opinion is consistency across our API and something that works by default. I think for now it would be fine to just consistently introduce vector-based interfaces for all similar PSP function calls, that fall back to the single-p versions. It's just a little weird to single out some functions over others, in my opinion.

I think somewhat independently is the question whether the vector version should be the default and the dispatch reverted. This we can discuss elsewhere (in an issue ?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds reasonable. I'll keep working on this PR and systematically introduce vectorized alternatives to the PSP functions. I'll try to reduce code duplication if I can, I'll see if some clever macro can be used. I'll also open an issue for the discussion on what should be the default.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, perfect. If you prefer, we can also first merge this PR and #1265 or somehow arrange things differently.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to do it as part of this PR, and adapt #1265 afterwards. This way, things are clean from the start. If someone has an urgent need to perform lots of fast stress calculations in the near future, then we can merge first.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think so, let's not rush.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few thoughts here:

  1. The vector-p-as-default makes a lot of sense if using NumSBT or another FFT-based algorithm for the spherical Bessel transform. There, the single-p implementation would be calling the vector-p and returning a single value (very bad, performance-wise).
  2. Polynomials will probably be needed for ultrasoft/PAW as far as I remember. There, the augmentation charges close to the core are often represented as polynomials.

Comment thread src/workarounds/forwarddiff_rules.jl Outdated
Comment thread src/workarounds/forwarddiff_rules.jl
Comment thread src/density_methods.jl
@mfherbst
Copy link
Copy Markdown
Member

This MPI failure we start to see a bit more consistently now. I'm a bit confused about it.

@abussy
Copy link
Copy Markdown
Collaborator Author

abussy commented Feb 26, 2026

This MPI failure we start to see a bit more consistently now. I'm a bit confused about it

I am also utterly confused. It certainly started happening after the extension of the MPI tests (#1248). It is hard to pin down, but I'll try and find out if the crash is systematically triggered by the same test or not.

@abussy
Copy link
Copy Markdown
Collaborator Author

abussy commented Mar 10, 2026

I generalized the dual existence of scalar and vectorized functions for all Element types, and all NormConservingPsp types. I introduced macros to define default vectorized functions based on their scalar counterpart, when there is no need for an optimized implementation.

I chose to implement the vectorized functions at the lowest level of Psp, e.g. in PspUpf.jl and PspHgh.jl rather than generalizing for abstract NormConservingPsp. This leaves the choice of scalar or vector by default to the individual pseudopotential implementation. At the NormConservingPsp level, it is just expected that all such pseudos will implement a scalar and vector version, but the details are abstracted. The documentation was adapted accordingly.

Edit: concerning the failing MPI tests, it looks like failure always happen in the same test, namely in Anisotropic strain sensitivity using ForwardDiff. I have been unable to reproduce it locally yet, though.

Comment thread src/elements.jl Outdated
Comment on lines +179 to +196
# Vectorized versions of the above, specific implementation depending on the Psp type
function local_potential_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
eval_psp_local_fourier(el.psp, ps)
end
function local_potential_real(el::ElementPsp, rs::AbstractVector{T}) where {T <: Real}
eval_psp_local_real(el.psp, rs)
end
function valence_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
if has_valence_density(el.psp)
eval_psp_density_valence_fourier(el.psp, ps)
else
gaussian_valence_charge_density_fourier(el, ps)
end
end
function core_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
eval_psp_density_core_fourier(el.psp, ps)
end

Copy link
Copy Markdown
Collaborator Author

@abussy abussy Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the code duplication annoying here. I tried to define the Psp dispatch in a more compact way:

function local_potential_fourier(el::ElementPsp, p) 
    eval_psp_local_fourier(el.psp, p)
end

or with

function local_potential_fourier(el::ElementPsp, p::Union{T, AbstractVector{T}}) where {T <: Real}
    eval_psp_local_fourier(el.psp, p)
end

But in either case, I run into ambiguity issues with the generic Element fallback with signature:
local_potental_fourier(el::Element, ps::AbstractVector{T}) where {T}

I am happy to take suggestions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your problem. Would it not be a possible solution to pass on all evaluations of this function to the el.psp. i.e. simply have functions of the form

core_charge_density_fourier(el::ElementPsp, p) = eval_psp_density_core_fourier(el.psp, p)

and then it's the job of the PspHgh and consorts to provide both a version for p::AbstractVector and p::Real ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my review. Hope this works, if not, happy to think more about it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one of the things I initially tried, but at the moment it does not work because of the @vectorize_element_function macro which defines core_charge_density_fourier(el::Element, p::AbstractVector{T}).

Because ElementPsp <: Element, but p::AbstractVector{T} is more specific than just p, there is an ambiguity.

I see a potential way out using another of your suggestion:

for fn in [:valence_charge_density_fourier, :gaussian_valence_charge_density_fourier, :core_charge_density_fourier, :local_potential_fourier, :local_potential_real]
    @eval begin
        function DFTK.$fn(el::Element, arg::AbstractVector{T}) where {T <: Real}
            arch = architecture(arg)
            to_device(arch, map(p -> $fn(el, p), to_cpu(arg)))
        end
    end
end

where I could also do a loop over all Elements except ElementPsp. I'll look into it.

Copy link
Copy Markdown
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with this mostly. We should see if we can remove some code duplication at the element level by deferring the vector -> map over single elements dispatch fully towards the dispatches on PSP structs.

Comment thread src/elements.jl Outdated
Comment on lines +179 to +196
# Vectorized versions of the above, specific implementation depending on the Psp type
function local_potential_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
eval_psp_local_fourier(el.psp, ps)
end
function local_potential_real(el::ElementPsp, rs::AbstractVector{T}) where {T <: Real}
eval_psp_local_real(el.psp, rs)
end
function valence_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
if has_valence_density(el.psp)
eval_psp_density_valence_fourier(el.psp, ps)
else
gaussian_valence_charge_density_fourier(el, ps)
end
end
function core_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
eval_psp_density_core_fourier(el.psp, ps)
end

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your problem. Would it not be a possible solution to pass on all evaluations of this function to the el.psp. i.e. simply have functions of the form

core_charge_density_fourier(el::ElementPsp, p) = eval_psp_density_core_fourier(el.psp, p)

and then it's the job of the PspHgh and consorts to provide both a version for p::AbstractVector and p::Real ?

Comment thread src/elements.jl Outdated
Comment thread src/pseudo/NormConservingPsp.jl Outdated
@abussy
Copy link
Copy Markdown
Collaborator Author

abussy commented Apr 23, 2026

I reorganized a bit, and I was able to reduce code duplication in elements.jl. Unfortunately, that comes at the cost of extra complication in the generation of vectorized code (extra loop on Element type, excluding ElementPsp)

Comment thread src/elements.jl
end
function gaussian_valence_charge_density_fourier(el::Element, ps::AbstractVector{T}) where {T <: Real}
arch = architecture(ps)
to_device(arch, map(p -> gaussian_valence_charge_density_fourier(el, p), to_cpu(ps)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this needs to be executed on the CPU. The implementation above looks so simple.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work on the GPU, I was on auto pilot. I'll check and change

Comment thread src/elements.jl
# elementwise. This is GPU safe and generic. Performance critical functions should have their
# own GPU-optimized implementation. Note: explicit loop over Element types in order to avoid
# ambiguities with the specific ElementPsp functions.
for fn in [:gaussian_valence_charge_density_fourier, :core_charge_density_fourier,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust you checked we still really need this after the simplifications in the ElementPsp interface you did most recently ... so while this is still cursed, it's probably the least evil solution.

But we should make sure that once new ElementXyz types are added, the tests break loudly until one adds the element type to the list below. Is this the case ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, while it is more compact than the previous solution, it is absolutely necessary that the abstract Element is not used here (because of ambiguity with the ElementPsp functions). If there was a way to just say "All elements, except ElementPsp`, we would be less error prone when new element types are added. I do not know of such a way though.

That being said, when adding a new element type, tests will not pass if the important vectorized functions are not provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants