Optimizing XC instantiation for GPUs by abussy · Pull Request #1262 · JuliaMolSim/DFTK.jl

abussy · 2026-02-20T17:08:22Z

This PR optimizes the instantiation of the XC term on the GPU, with a particularly large impact when Duals are involved (stress and response calculations). It follows the same design patterns as #1163.

Particular care was given to the hankel function in workarounds/forwarddiff_rules.jl, in order to make it GPU compatible. A side effect is the suppression of allocations, which also marginally (few %) speeds-up Xc instantiation on the CPU.

For illustration, here are XC instantiation timings for the input below (stress of a 3x3x3 rattled aluminium supercell):

	master	this PR
A100	65 s	2 s
Mi250	66 s	2 s

using DFTK
using PseudoPotentialData
using CUDA
setup_threading()

arch = DFTK.GPU(CuArray)

Ecut = 50.0
kgrid = [1, 1, 1]
maxiter = 10
tol = 1.0e-10
temperature = 0.01

factor = 3
a = 3.8267
lattice = factor*a * [[0.0 1.0 1.0];
                      [1.0 0.0 1.0];
                      [1.0 1.0 0.0]]
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
atoms = [Al for i in 1:factor^3]
positions = Vector{Vector{Float64}}([])
for i = 1:factor, j = 1:factor, k=1:factor
   push!(positions, [i/factor, j/factor, k/factor])
end

#rattle a few atoms                                                                                  
positions[5] = positions[5] + [0.01, 0.0, -0.01]
positions[10] = positions[10] + [0.01, 0.0, -0.01]
positions[15] = positions[15] + [0.01, 0.0, -0.01]
positions[20] = positions[20] + [0.01, 0.0, -0.01]
positions[25] = positions[25] + [0.01, 0.0, -0.01]

model = model_DFT(lattice, atoms, positions; temperature=temperature,
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())
                  
# warmup
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=3, tol=tol);
forces = compute_forces(scfres)
stress = compute_stresses_cart(scfres)

#actual calculations
DFTK.reset_timer!(DFTK.timer)
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=maxiter, is_converged=ScfConvergenceEnergy(tol))
forces = compute_forces(scfres)
stress = compute_stresses_cart(scfres)
@show DFTK.timer

mfherbst · 2026-02-25T08:40:45Z

+    rgrid = to_device(arch, @view psp.rgrid[1:psp.ircut])
+    r2_ρcore = to_device(arch, @view psp.r2_ρcore[1:psp.ircut])


I agree with what you do here. But I think it's worth having a small discussion as this does now emerge as a more general pattern. Some questions:

Right now we implement the "single-p-version" in all pseudos and then make vectorised versions for the cases where we want extra speedup on the GPU. Would it make more sense to go the other way and implement the vectorised version and then provide the "single-p-version" as a fallback, i.e. just calling the vector-p version with [p] ? My thinking is that it probably makes sense to do exactly the same thing (prefer the vectorised version for the GPU) for the HGH pseudos and for possible UPF pseudos we implement in the future. So it's probably the better pattern to make the vector version the thing you actually have to implement as a developer.

What about the other functions that also take a single p right now. Should we not move those to vector versions as well ? The very least for consistency ?

In line with that, does it make sense to put rgrid and r2_ρcore onto the GPU from the start ?

I think this is an important discussion, and here is my take on it:

On a conceptual level, I think it makes more sense to have a "single-p" version as a default, and a "vector-p" as a special case: it follows human thinking a bit more closely. That being said, optimized "vector-p" functions are many times more efficient on the GPU, and also have a strong potential for CPU shared memory parallelism (OMP threading style), if this becomes necessary.

I also agree that using the "vector-p" as a default would push future developments in the right direction from the start. It is however not straight forward to write a GPU compatible map call, and it is likely never gonna happen by chance. Since most people develop and test on the CPU, there is a real risk of introducing faulty code. In that regard, using the current pattern of:

DFTK.jl/src/pseudo/NormConservingPsp.jl

Lines 89 to 93 in b3c1e9a

# Fallback vectorized implementation for non GPU-optimized code.

function eval_psp_local_fourier(psp::NormConservingPsp, ps::AbstractVector{T}) where {T <: Real}

arch = architecture(ps)

to_device(arch, map(p -> eval_psp_local_fourier(psp, p), to_cpu(ps)))

end

carries an explicit message regarding GPUs: this piece of code does not work (or has not been tested) on the GPU. Therefore, I personally slightly lean towards "single-p" version as a default.

does it make sense to put rgrid and r2_ρcore onto the GPU from the start ?

I think this would introduce a lot of complication: everything in the PSP code would need to work with GPU arrays. Performance wise, the data transfer towards the GPU is cheap, because arrays are small, and the PSP functions are rarely called (mostly basis instantiation, so 2-3 times per calculation).

there is a real risk of introducing faulty code.

Is this true ? If people develop new pseudos they are likely copying from existing implementations. If we do it right there, would it not either immediately work in a new implementation or reduce the effort in refactoring such that it works efficiently on the GPU ? In the sense that it makes it in the long run easier to fix that faulty code rather than refactoring one by one every function such that there also is a vector version ? If you think I'm wrong here and having the vector code around will not reduce efforts in refactoring, then I agree that because of the conceptional simplicity, it makes sense to default on the single-p-version.

In that regard, using the current pattern of:

Ok, but then we should use this consistently everywhere, with possibly a macro to avoid writing the same code 20 times (if it's not too cumbersome ... macros have always the danger to obfuscate). It's weird that the vector version only exists for some PSP functions and not others.

I think this would introduce a lot of complication:

That makes sense. Let's not do it for now.

Is this true ?

I think the risk exists, yes. For example, while working on #1163, I looked into vectorizing PspHgh:

DFTK.jl/src/pseudo/PspHgh.jl

Lines 110 to 135 in b3c1e9a

@doc raw"""

The local potential of a HGH pseudopotentials in reciprocal space

can be brought to the form ``Q(t) / (t^2 exp(t^2 / 2))``

where ``t = r_\text{loc} p`` and `Q`

is a polynomial of at most degree 8. This function returns `Q`.

"""

@inline function psp_local_polynomial(T, psp::PspHgh, t=Polynomial(T[0, 1]))

rloc::T = psp.rloc

Zion::T = psp.Zion

# The polynomial prefactor P(t) (as used inside the { ... } brackets of equation

# (5) of the HGH98 paper)

P = ( psp.cloc[1]

+ psp.cloc[2] * ( 3 - t^2 )

+ psp.cloc[3] * ( 15 - 10t^2 + t^4 )

+ psp.cloc[4] * (105 - 105t^2 + 21t^4 - t^6))

4T(π) * rloc^2 * (-Zion + sqrt(T(π) / 2) * rloc * t^2 * P)

end

# [GTH98] (6) except they do it with plane waves normalized by 1/sqrt(Ω).

function eval_psp_local_fourier(psp::PspHgh, p::T) where {T <: Real}

p == 0 && return zero(T) # Compensating charge background

t::T = p * psp.rloc

psp_local_polynomial(T, psp, t) * exp(-t^2 / 2) / t^2

end

A function like the following works fine on the CPU, but fails on the GPU because instances of Polynomial are not isbits:

function eval_psp_local_fourier(psp::PspHgh, ps::AbstractVector{T}) where {T <: Real} map(ps) do p if p == 0 zero(T) # Compensating charge background else t = p * psp.rloc psp_local_polynomial(T, psp, t) * exp(-t^2 / 2) / t^2 end end end

I am sure there are ways to circumvent this, but I didn't spend the time because I didn't identify obvious performance bottlenecks.

It would probably be possible to rework all current PSP implementations to be vectorized by default in a GPU compatible way. If properly commented, these could also be used as a base for new Psp implementations, which would hopefully result in GPU compatible code from the start.

This is a lot of work, and a major refactoring of the code. I am willing to commit to it, but I'd rather have everybody on board with it before I do.

Regarding the Polynomial: This was implemented a long, long time ago for some idea that never really matured. So if this is the obstacle, it can be completely removed. In fact this would imply we can even remove the Polynomials dependency from DFTK completely, which is a good thing ... given that this rather obscure use case is the only place we need the package at all.

This is a lot of work, and a major refactoring of the code. I am willing to commit to it, but I'd rather have everybody on board with it before I do.

I agree with that and I think we should indeed only do it, if it's worth it. What is most important in my opinion is consistency across our API and something that works by default. I think for now it would be fine to just consistently introduce vector-based interfaces for all similar PSP function calls, that fall back to the single-p versions. It's just a little weird to single out some functions over others, in my opinion.

I think somewhat independently is the question whether the vector version should be the default and the dispatch reverted. This we can discuss elsewhere (in an issue ?)

Ok, sounds reasonable. I'll keep working on this PR and systematically introduce vectorized alternatives to the PSP functions. I'll try to reduce code duplication if I can, I'll see if some clever macro can be used. I'll also open an issue for the discussion on what should be the default.

Thanks, perfect. If you prefer, we can also first merge this PR and #1265 or somehow arrange things differently.

I think it's ok to do it as part of this PR, and adapt #1265 afterwards. This way, things are clean from the start. If someone has an urgent need to perform lots of fast stress calculations in the near future, then we can merge first.

Don't think so, let's not rush.

Just a few thoughts here:

The vector-p-as-default makes a lot of sense if using NumSBT or another FFT-based algorithm for the spherical Bessel transform. There, the single-p implementation would be calling the vector-p and returning a single value (very bad, performance-wise).

Polynomials will probably be needed for ultrasoft/PAW as far as I remember. There, the augmentation charges close to the core are often represented as polynomials.

mfherbst · 2026-02-25T16:51:50Z

This MPI failure we start to see a bit more consistently now. I'm a bit confused about it.

abussy · 2026-02-26T09:15:53Z

This MPI failure we start to see a bit more consistently now. I'm a bit confused about it

I am also utterly confused. It certainly started happening after the extension of the MPI tests (#1248). It is hard to pin down, but I'll try and find out if the crash is systematically triggered by the same test or not.

abussy · 2026-03-10T10:46:03Z

I generalized the dual existence of scalar and vectorized functions for all Element types, and all NormConservingPsp types. I introduced macros to define default vectorized functions based on their scalar counterpart, when there is no need for an optimized implementation.

I chose to implement the vectorized functions at the lowest level of Psp, e.g. in PspUpf.jl and PspHgh.jl rather than generalizing for abstract NormConservingPsp. This leaves the choice of scalar or vector by default to the individual pseudopotential implementation. At the NormConservingPsp level, it is just expected that all such pseudos will implement a scalar and vector version, but the details are abstracted. The documentation was adapted accordingly.

Edit: concerning the failing MPI tests, it looks like failure always happen in the same test, namely in Anisotropic strain sensitivity using ForwardDiff. I have been unable to reproduce it locally yet, though.

abussy · 2026-03-10T10:52:59Z

+# Vectorized versions of the above, specific implementation depending on the Psp type
+function local_potential_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
+    eval_psp_local_fourier(el.psp, ps)
+end
+function local_potential_real(el::ElementPsp, rs::AbstractVector{T}) where {T <: Real}
+    eval_psp_local_real(el.psp, rs)
+end
+function valence_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
+    if has_valence_density(el.psp)
+        eval_psp_density_valence_fourier(el.psp, ps)
+    else
+        gaussian_valence_charge_density_fourier(el, ps)
+    end
+end
+function core_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
+    eval_psp_density_core_fourier(el.psp, ps)
+end
+


I find the code duplication annoying here. I tried to define the Psp dispatch in a more compact way:

function local_potential_fourier(el::ElementPsp, p) eval_psp_local_fourier(el.psp, p) end

or with

function local_potential_fourier(el::ElementPsp, p::Union{T, AbstractVector{T}}) where {T <: Real} eval_psp_local_fourier(el.psp, p) end

But in either case, I run into ambiguity issues with the generic Element fallback with signature:
local_potental_fourier(el::Element, ps::AbstractVector{T}) where {T}

I am happy to take suggestions.

I see your problem. Would it not be a possible solution to pass on all evaluations of this function to the el.psp. i.e. simply have functions of the form

core_charge_density_fourier(el::ElementPsp, p) = eval_psp_density_core_fourier(el.psp, p)

and then it's the job of the PspHgh and consorts to provide both a version for p::AbstractVector and p::Real ?

See my review. Hope this works, if not, happy to think more about it.

That's one of the things I initially tried, but at the moment it does not work because of the @vectorize_element_function macro which defines core_charge_density_fourier(el::Element, p::AbstractVector{T}).

Because ElementPsp <: Element, but p::AbstractVector{T} is more specific than just p, there is an ambiguity.

I see a potential way out using another of your suggestion:

for fn in [:valence_charge_density_fourier, :gaussian_valence_charge_density_fourier, :core_charge_density_fourier, :local_potential_fourier, :local_potential_real] @eval begin function DFTK.$fn(el::Element, arg::AbstractVector{T}) where {T <: Real} arch = architecture(arg) to_device(arch, map(p -> $fn(el, p), to_cpu(arg))) end end end

where I could also do a loop over all Elements except ElementPsp. I'll look into it.

mfherbst

I think I agree with this mostly. We should see if we can remove some code duplication at the element level by deferring the vector -> map over single elements dispatch fully towards the dispatches on PSP structs.

mfherbst · 2026-04-22T15:02:46Z

+# Vectorized versions of the above, specific implementation depending on the Psp type
+function local_potential_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
+    eval_psp_local_fourier(el.psp, ps)
+end
+function local_potential_real(el::ElementPsp, rs::AbstractVector{T}) where {T <: Real}
+    eval_psp_local_real(el.psp, rs)
+end
+function valence_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
+    if has_valence_density(el.psp)
+        eval_psp_density_valence_fourier(el.psp, ps)
+    else
+        gaussian_valence_charge_density_fourier(el, ps)
+    end
+end
+function core_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real}
+    eval_psp_density_core_fourier(el.psp, ps)
+end
+


I see your problem. Would it not be a possible solution to pass on all evaluations of this function to the el.psp. i.e. simply have functions of the form

core_charge_density_fourier(el::ElementPsp, p) = eval_psp_density_core_fourier(el.psp, p)

and then it's the job of the PspHgh and consorts to provide both a version for p::AbstractVector and p::Real ?

abussy · 2026-04-23T14:07:50Z

I reorganized a bit, and I was able to reduce code duplication in elements.jl. Unfortunately, that comes at the cost of extra complication in the generation of vectorized code (extra loop on Element type, excluding ElementPsp)

mfherbst · 2026-04-24T06:19:43Z

 end
+function gaussian_valence_charge_density_fourier(el::Element, ps::AbstractVector{T}) where {T <: Real}
+    arch = architecture(ps)
+    to_device(arch, map(p -> gaussian_valence_charge_density_fourier(el, p), to_cpu(ps)))


I'm surprised this needs to be executed on the CPU. The implementation above looks so simple.

It should work on the GPU, I was on auto pilot. I'll check and change

mfherbst · 2026-04-24T06:31:34Z

+# elementwise. This is GPU safe and generic. Performance critical functions should have their
+# own GPU-optimized implementation. Note: explicit loop over Element types in order to avoid
+# ambiguities with the specific ElementPsp functions.
+for fn in [:gaussian_valence_charge_density_fourier, :core_charge_density_fourier,


I trust you checked we still really need this after the simplifications in the ElementPsp interface you did most recently ... so while this is still cursed, it's probably the least evil solution.

But we should make sure that once new ElementXyz types are added, the tests break loudly until one adds the element type to the list below. Is this the case ?

Yes, while it is more compact than the previous solution, it is absolutely necessary that the abstract Element is not used here (because of ambiguity with the ElementPsp functions). If there was a way to just say "All elements, except ElementPsp`, we would be less error prone when new element types are added. I do not know of such a way though.

That being said, when adding a new element type, tests will not pass if the important vectorized functions are not provided.

abussy mentioned this pull request Feb 24, 2026

Optimize AtomicNonlocal term for the GPU #1265

Merged

mfherbst reviewed Feb 25, 2026

View reviewed changes

abussy mentioned this pull request Feb 26, 2026

Should PSP functions be vectorized by default? #1270

Open

abussy force-pushed the hankel branch from 34f8c3a to 65c5dbe Compare March 10, 2026 10:34

abussy commented Mar 10, 2026

View reviewed changes

mfherbst reviewed Apr 22, 2026

View reviewed changes

mfherbst reviewed Apr 24, 2026

View reviewed changes

abussy added 5 commits April 27, 2026 09:30

Optimizing XC instantiation for GPUs

92708ef

Fix PspLinComb test and generic atomic_density()

af6c331

Uniformization of vectorized functions

c65c8ad

Reorganize macros

102079e

Rebased onto current master

3e8d219

abussy force-pushed the hankel branch from e7ec155 to 3e8d219 Compare April 27, 2026 09:22

		rgrid = to_device(arch, @view psp.rgrid[1:psp.ircut])
		r2_ρcore = to_device(arch, @view psp.r2_ρcore[1:psp.ircut])

	# Fallback vectorized implementation for non GPU-optimized code.
	function eval_psp_local_fourier(psp::NormConservingPsp, ps::AbstractVector{T}) where {T <: Real}
	arch = architecture(ps)
	to_device(arch, map(p -> eval_psp_local_fourier(psp, p), to_cpu(ps)))
	end

	@doc raw"""
	The local potential of a HGH pseudopotentials in reciprocal space
	can be brought to the form ``Q(t) / (t^2 exp(t^2 / 2))``
	where ``t = r_\text{loc} p`` and `Q`
	is a polynomial of at most degree 8. This function returns `Q`.
	"""
	@inline function psp_local_polynomial(T, psp::PspHgh, t=Polynomial(T[0, 1]))
	rloc::T = psp.rloc
	Zion::T = psp.Zion

	# The polynomial prefactor P(t) (as used inside the { ... } brackets of equation
	# (5) of the HGH98 paper)
	P = ( psp.cloc[1]
	+ psp.cloc[2] * ( 3 - t^2 )
	+ psp.cloc[3] * ( 15 - 10t^2 + t^4 )
	+ psp.cloc[4] * (105 - 105t^2 + 21t^4 - t^6))

	4T(π) * rloc^2 * (-Zion + sqrt(T(π) / 2) * rloc * t^2 * P)
	end

	# [GTH98] (6) except they do it with plane waves normalized by 1/sqrt(Ω).
	function eval_psp_local_fourier(psp::PspHgh, p::T) where {T <: Real}
	p == 0 && return zero(T) # Compensating charge background
	t::T = p * psp.rloc
	psp_local_polynomial(T, psp, t) * exp(-t^2 / 2) / t^2
	end

Conversation

abussy commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfherbst commented Feb 25, 2026

Uh oh!

abussy commented Feb 26, 2026

Uh oh!

abussy commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abussy Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

abussy commented Apr 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abussy commented Mar 10, 2026 •

edited

Loading

abussy Mar 10, 2026 •

edited

Loading