Optimizing XC instantiation for GPUs#1262
Conversation
| rgrid = to_device(arch, @view psp.rgrid[1:psp.ircut]) | ||
| r2_ρcore = to_device(arch, @view psp.r2_ρcore[1:psp.ircut]) |
There was a problem hiding this comment.
I agree with what you do here. But I think it's worth having a small discussion as this does now emerge as a more general pattern. Some questions:
- Right now we implement the "single-
p-version" in all pseudos and then make vectorised versions for the cases where we want extra speedup on the GPU. Would it make more sense to go the other way and implement the vectorised version and then provide the "single-p-version" as a fallback, i.e. just calling the vector-pversion with[p]? My thinking is that it probably makes sense to do exactly the same thing (prefer the vectorised version for the GPU) for the HGH pseudos and for possible UPF pseudos we implement in the future. So it's probably the better pattern to make the vector version the thing you actually have to implement as a developer. - What about the other functions that also take a single
pright now. Should we not move those to vector versions as well ? The very least for consistency ? - In line with that, does it make sense to put
rgridandr2_ρcoreonto the GPU from the start ?
There was a problem hiding this comment.
I think this is an important discussion, and here is my take on it:
On a conceptual level, I think it makes more sense to have a "single-p" version as a default, and a "vector-p" as a special case: it follows human thinking a bit more closely. That being said, optimized "vector-p" functions are many times more efficient on the GPU, and also have a strong potential for CPU shared memory parallelism (OMP threading style), if this becomes necessary.
I also agree that using the "vector-p" as a default would push future developments in the right direction from the start. It is however not straight forward to write a GPU compatible map call, and it is likely never gonna happen by chance. Since most people develop and test on the CPU, there is a real risk of introducing faulty code. In that regard, using the current pattern of:
DFTK.jl/src/pseudo/NormConservingPsp.jl
Lines 89 to 93 in b3c1e9a
carries an explicit message regarding GPUs: this piece of code does not work (or has not been tested) on the GPU. Therefore, I personally slightly lean towards "single-
p" version as a default.
does it make sense to put rgrid and r2_ρcore onto the GPU from the start ?
I think this would introduce a lot of complication: everything in the PSP code would need to work with GPU arrays. Performance wise, the data transfer towards the GPU is cheap, because arrays are small, and the PSP functions are rarely called (mostly basis instantiation, so 2-3 times per calculation).
There was a problem hiding this comment.
there is a real risk of introducing faulty code.
Is this true ? If people develop new pseudos they are likely copying from existing implementations. If we do it right there, would it not either immediately work in a new implementation or reduce the effort in refactoring such that it works efficiently on the GPU ? In the sense that it makes it in the long run easier to fix that faulty code rather than refactoring one by one every function such that there also is a vector version ? If you think I'm wrong here and having the vector code around will not reduce efforts in refactoring, then I agree that because of the conceptional simplicity, it makes sense to default on the single-p-version.
In that regard, using the current pattern of:
Ok, but then we should use this consistently everywhere, with possibly a macro to avoid writing the same code 20 times (if it's not too cumbersome ... macros have always the danger to obfuscate). It's weird that the vector version only exists for some PSP functions and not others.
I think this would introduce a lot of complication:
That makes sense. Let's not do it for now.
There was a problem hiding this comment.
Is this true ?
I think the risk exists, yes. For example, while working on #1163, I looked into vectorizing PspHgh:
Lines 110 to 135 in b3c1e9a
A function like the following works fine on the CPU, but fails on the GPU because instances of Polynomial are not isbits:
function eval_psp_local_fourier(psp::PspHgh, ps::AbstractVector{T}) where {T <: Real}
map(ps) do p
if p == 0
zero(T) # Compensating charge background
else
t = p * psp.rloc
psp_local_polynomial(T, psp, t) * exp(-t^2 / 2) / t^2
end
end
end
I am sure there are ways to circumvent this, but I didn't spend the time because I didn't identify obvious performance bottlenecks.
It would probably be possible to rework all current PSP implementations to be vectorized by default in a GPU compatible way. If properly commented, these could also be used as a base for new Psp implementations, which would hopefully result in GPU compatible code from the start.
This is a lot of work, and a major refactoring of the code. I am willing to commit to it, but I'd rather have everybody on board with it before I do.
There was a problem hiding this comment.
Regarding the Polynomial: This was implemented a long, long time ago for some idea that never really matured. So if this is the obstacle, it can be completely removed. In fact this would imply we can even remove the Polynomials dependency from DFTK completely, which is a good thing ... given that this rather obscure use case is the only place we need the package at all.
This is a lot of work, and a major refactoring of the code. I am willing to commit to it, but I'd rather have everybody on board with it before I do.
I agree with that and I think we should indeed only do it, if it's worth it. What is most important in my opinion is consistency across our API and something that works by default. I think for now it would be fine to just consistently introduce vector-based interfaces for all similar PSP function calls, that fall back to the single-p versions. It's just a little weird to single out some functions over others, in my opinion.
I think somewhat independently is the question whether the vector version should be the default and the dispatch reverted. This we can discuss elsewhere (in an issue ?)
There was a problem hiding this comment.
Ok, sounds reasonable. I'll keep working on this PR and systematically introduce vectorized alternatives to the PSP functions. I'll try to reduce code duplication if I can, I'll see if some clever macro can be used. I'll also open an issue for the discussion on what should be the default.
There was a problem hiding this comment.
Thanks, perfect. If you prefer, we can also first merge this PR and #1265 or somehow arrange things differently.
There was a problem hiding this comment.
I think it's ok to do it as part of this PR, and adapt #1265 afterwards. This way, things are clean from the start. If someone has an urgent need to perform lots of fast stress calculations in the near future, then we can merge first.
There was a problem hiding this comment.
Don't think so, let's not rush.
There was a problem hiding this comment.
Just a few thoughts here:
- The vector-
p-as-default makes a lot of sense if usingNumSBTor another FFT-based algorithm for the spherical Bessel transform. There, the single-pimplementation would be calling the vector-pand returning a single value (very bad, performance-wise). Polynomialswill probably be needed for ultrasoft/PAW as far as I remember. There, the augmentation charges close to the core are often represented as polynomials.
|
This MPI failure we start to see a bit more consistently now. I'm a bit confused about it. |
I am also utterly confused. It certainly started happening after the extension of the MPI tests (#1248). It is hard to pin down, but I'll try and find out if the crash is systematically triggered by the same test or not. |
|
I generalized the dual existence of scalar and vectorized functions for all I chose to implement the vectorized functions at the lowest level of Psp, e.g. in Edit: concerning the failing MPI tests, it looks like failure always happen in the same test, namely in |
| # Vectorized versions of the above, specific implementation depending on the Psp type | ||
| function local_potential_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real} | ||
| eval_psp_local_fourier(el.psp, ps) | ||
| end | ||
| function local_potential_real(el::ElementPsp, rs::AbstractVector{T}) where {T <: Real} | ||
| eval_psp_local_real(el.psp, rs) | ||
| end | ||
| function valence_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real} | ||
| if has_valence_density(el.psp) | ||
| eval_psp_density_valence_fourier(el.psp, ps) | ||
| else | ||
| gaussian_valence_charge_density_fourier(el, ps) | ||
| end | ||
| end | ||
| function core_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real} | ||
| eval_psp_density_core_fourier(el.psp, ps) | ||
| end | ||
|
|
There was a problem hiding this comment.
I find the code duplication annoying here. I tried to define the Psp dispatch in a more compact way:
function local_potential_fourier(el::ElementPsp, p)
eval_psp_local_fourier(el.psp, p)
end
or with
function local_potential_fourier(el::ElementPsp, p::Union{T, AbstractVector{T}}) where {T <: Real}
eval_psp_local_fourier(el.psp, p)
end
But in either case, I run into ambiguity issues with the generic Element fallback with signature:
local_potental_fourier(el::Element, ps::AbstractVector{T}) where {T}
I am happy to take suggestions.
There was a problem hiding this comment.
I see your problem. Would it not be a possible solution to pass on all evaluations of this function to the el.psp. i.e. simply have functions of the form
core_charge_density_fourier(el::ElementPsp, p) = eval_psp_density_core_fourier(el.psp, p)and then it's the job of the PspHgh and consorts to provide both a version for p::AbstractVector and p::Real ?
There was a problem hiding this comment.
See my review. Hope this works, if not, happy to think more about it.
There was a problem hiding this comment.
That's one of the things I initially tried, but at the moment it does not work because of the @vectorize_element_function macro which defines core_charge_density_fourier(el::Element, p::AbstractVector{T}).
Because ElementPsp <: Element, but p::AbstractVector{T} is more specific than just p, there is an ambiguity.
I see a potential way out using another of your suggestion:
for fn in [:valence_charge_density_fourier, :gaussian_valence_charge_density_fourier, :core_charge_density_fourier, :local_potential_fourier, :local_potential_real]
@eval begin
function DFTK.$fn(el::Element, arg::AbstractVector{T}) where {T <: Real}
arch = architecture(arg)
to_device(arch, map(p -> $fn(el, p), to_cpu(arg)))
end
end
end
where I could also do a loop over all Elements except ElementPsp. I'll look into it.
mfherbst
left a comment
There was a problem hiding this comment.
I think I agree with this mostly. We should see if we can remove some code duplication at the element level by deferring the vector -> map over single elements dispatch fully towards the dispatches on PSP structs.
| # Vectorized versions of the above, specific implementation depending on the Psp type | ||
| function local_potential_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real} | ||
| eval_psp_local_fourier(el.psp, ps) | ||
| end | ||
| function local_potential_real(el::ElementPsp, rs::AbstractVector{T}) where {T <: Real} | ||
| eval_psp_local_real(el.psp, rs) | ||
| end | ||
| function valence_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real} | ||
| if has_valence_density(el.psp) | ||
| eval_psp_density_valence_fourier(el.psp, ps) | ||
| else | ||
| gaussian_valence_charge_density_fourier(el, ps) | ||
| end | ||
| end | ||
| function core_charge_density_fourier(el::ElementPsp, ps::AbstractVector{T}) where {T <: Real} | ||
| eval_psp_density_core_fourier(el.psp, ps) | ||
| end | ||
|
|
There was a problem hiding this comment.
I see your problem. Would it not be a possible solution to pass on all evaluations of this function to the el.psp. i.e. simply have functions of the form
core_charge_density_fourier(el::ElementPsp, p) = eval_psp_density_core_fourier(el.psp, p)and then it's the job of the PspHgh and consorts to provide both a version for p::AbstractVector and p::Real ?
|
I reorganized a bit, and I was able to reduce code duplication in |
| end | ||
| function gaussian_valence_charge_density_fourier(el::Element, ps::AbstractVector{T}) where {T <: Real} | ||
| arch = architecture(ps) | ||
| to_device(arch, map(p -> gaussian_valence_charge_density_fourier(el, p), to_cpu(ps))) |
There was a problem hiding this comment.
I'm surprised this needs to be executed on the CPU. The implementation above looks so simple.
There was a problem hiding this comment.
It should work on the GPU, I was on auto pilot. I'll check and change
| # elementwise. This is GPU safe and generic. Performance critical functions should have their | ||
| # own GPU-optimized implementation. Note: explicit loop over Element types in order to avoid | ||
| # ambiguities with the specific ElementPsp functions. | ||
| for fn in [:gaussian_valence_charge_density_fourier, :core_charge_density_fourier, |
There was a problem hiding this comment.
I trust you checked we still really need this after the simplifications in the ElementPsp interface you did most recently ... so while this is still cursed, it's probably the least evil solution.
But we should make sure that once new ElementXyz types are added, the tests break loudly until one adds the element type to the list below. Is this the case ?
There was a problem hiding this comment.
Yes, while it is more compact than the previous solution, it is absolutely necessary that the abstract Element is not used here (because of ambiguity with the ElementPsp functions). If there was a way to just say "All elements, except ElementPsp`, we would be less error prone when new element types are added. I do not know of such a way though.
That being said, when adding a new element type, tests will not pass if the important vectorized functions are not provided.
This PR optimizes the instantiation of the XC term on the GPU, with a particularly large impact when Duals are involved (stress and response calculations). It follows the same design patterns as #1163.
Particular care was given to the
hankelfunction inworkarounds/forwarddiff_rules.jl, in order to make it GPU compatible. A side effect is the suppression of allocations, which also marginally (few %) speeds-up Xc instantiation on the CPU.For illustration, here are XC instantiation timings for the input below (stress of a 3x3x3 rattled aluminium supercell):