Skip to content

Commit ff410ee

Browse files
lkdvosJutho
andauthored
[Feature] buffer- and stack-based allocator strategies (#251)
* add stack-based `allocator` interface functions * small formatting changes * add checkpoint postprocessor * rewrite Bumper extension in terms of new functions * Add native BufferAllocator * add tests * add note on thread-safety * move buffer growth to allocation instead of reset * update docs * Update src/implementation/allocator.jl Co-authored-by: Jutho <Jutho@users.noreply.github.com> * avoid reallocating if the size hasn't changed * refactor buffer size determination * Update src/implementation/allocator.jl Co-authored-by: Jutho <Jutho@users.noreply.github.com> * small fix * rewrite with `resize!` * try to address comments on the docs --------- Co-authored-by: Jutho <Jutho@users.noreply.github.com>
1 parent ae9d80f commit ff410ee

File tree

12 files changed

+304
-27
lines changed

12 files changed

+304
-27
lines changed

docs/src/man/backends.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,11 +129,14 @@ In particular, this means that the backend will be selected **first**, while onl
129129
```@docs
130130
TensorOperations.DefaultAllocator
131131
TensorOperations.ManualAllocator
132+
TensorOperations.BufferAllocator
132133
```
133134

134135
By default, the `DefaultAllocator` is used, which uses Julia's built-in memory management system.
135136
Optionally, it can be useful to use the `ManualAllocator`, as the manual memory management reduces the pressure on the garbage collector.
136137
In particular in multi-threaded applications, this can sometimes lead to a significant performance improvement.
138+
On the other hand, for repeated (but thread-safe!) `@tensor` calls, the `BufferAllocator` is a lightweight slab allocator that pre-allocates a buffer for temporaries, falling back to Julia's default if needed.
139+
Upon repeated use it will automatically resize the buffer to accommodate the requested temporaries, avoiding repeated reallocation.
137140

138141
Finally, users can also opt to use the `Bumper.jl` system, which pre-allocates a slab of memory that can be re-used afterwards.
139142
This is available through a package extension for `Bumper`.
@@ -165,4 +168,24 @@ TensorOperations.CUDAAllocator
165168

166169
Users can also define their own allocators, to facilitate experimentation with new implementations.
167170
Here, no restriction is made on the type of the allocator, and any object can be passed as an allocator.
168-
The required implemented methods are [`tensoralloc`](@ref) and [`tensorfree!`](@ref).
171+
172+
The core methods that can be customized for an allocator are:
173+
174+
* [`tensoralloc`](@ref): Allocate a tensor of a given type and structure. This method receives a flag indicating whether the output lives only within an `@tensor` block or persists outside of it. Temporary tensors can be allocated from internal buffers or pools, while permanent tensors should use a standard allocation strategy.
175+
* [`tensorfree!`](@ref): Explicitly free a tensor, if applicable. For custom allocators that manage internal pools or buffers, this can be used to track when temporaries are no longer needed.
176+
177+
Here we are guaranteeing that the code generated by `@tensor` will contain exactly one matching call to `tensorfree!` for every `tensoralloc` call that has the temporary flag, as soon as the temporary object is no longer needed.
178+
179+
!!! warning
180+
Special care should be taken when using these functions directly, as it is then up to the user to ensure every allocated temporary is freed appropriately.
181+
Invalid usage such as not freeing a tensor, freeing it twice, or freeing an tensor that was not allocated as temporary is considered undefined behavior,
182+
and can lead to memory leaks and/or segmentation faults.
183+
184+
185+
For allocators that manage reusable buffers or maintain state across multiple contractions, the following helper methods can be useful to indicate that it is safe to free all objects that were allocated after a given checkpoint.
186+
187+
* [`allocator_checkpoint!`](@ref): Save the current state of the allocator (e.g., the current offset in a buffer). This can be called before a sequence of tensor operations to capture the allocation state.
188+
* [`allocator_reset!`](@ref): Restore the allocator to a previously saved checkpoint, effectively releasing all allocations made since the checkpoint was taken.
189+
190+
Here we are guaranteeing that every created checkpoint will be restored, and all temporary allocations that are enclosed within this scope will no longer be accessed.
191+
Additionally, if multiple checkpoints are created, they will be restored in a first-in-last-out order.

docs/src/man/interface.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,21 @@ TensorOperations.tensoralloc_add
9494
TensorOperations.tensoralloc_contract
9595
```
9696

97+
For allocators that manage reusable buffers or maintain state across multiple tensor
98+
operations, the following helper functions provide an interface for managing allocation
99+
regions:
100+
101+
```@docs
102+
TensorOperations.allocator_checkpoint!
103+
TensorOperations.allocator_reset!
104+
```
105+
106+
These can be used to save the current allocator state before a region of code and restore it
107+
afterwards, where it is guaranteed that all temporary objects allocated within that region
108+
are released. This is particularly useful for nested tensor operations or when repeatedly
109+
evaluating the same tensor contraction with an allocator like
110+
[`TensorOperations.BufferAllocator`](@ref).
111+
97112
## Utility
98113

99114
Some of the optional keywords for `@tensor` can be accessed only after implementing the

ext/TensorOperationsBumperExt.jl

Lines changed: 7 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,10 @@ module TensorOperationsBumperExt
33
using TensorOperations
44
using Bumper
55

6+
const BumperBuffer = Union{SlabBuffer, AllocBuffer}
7+
68
function TensorOperations.tensoralloc(
7-
::Type{A}, structure, ::Val{istemp}, buf::Union{SlabBuffer, AllocBuffer}
9+
::Type{A}, structure, ::Val{istemp}, buf::BumperBuffer
810
) where {A <: AbstractArray, istemp}
911
# TODO: remove the `ndims` check if this is fixed in Bumper / StrideArraysCore
1012
if istemp && ndims(A) > 0
@@ -14,37 +16,23 @@ function TensorOperations.tensoralloc(
1416
end
1517
end
1618

17-
function TensorOperations.blas_contract!(
18-
C, A, pA, B, pB, pAB, α, β,
19-
backend, allocator::Union{SlabBuffer, AllocBuffer}
20-
)
21-
@no_escape allocator begin
22-
C = Base.@invoke TensorOperations.blas_contract!(
23-
C, A, pA, B, pB, pAB, α, β, backend, allocator::Any
24-
)
25-
end
26-
return C
27-
end
19+
TensorOperations.allocator_checkpoint!(alloc::BumperBuffer) = Bumper.checkpoint_save(alloc)
20+
TensorOperations.allocator_reset!(::BumperBuffer, cp) = Bumper.checkpoint_restore!(cp)
2821

2922
function TensorOperations._butensor(src, ex...)
3023
buf_sym = gensym("buffer")
31-
cp_sym = gensym("checkpoint")
32-
res_sym = gensym("result")
3324

3425
# TODO: there is no check for doubled tensor kwargs
3526
newex = quote
3627
$buf_sym = $(Expr(:call, GlobalRef(Bumper, :default_buffer)))
37-
$cp_sym = $(Expr(:call, GlobalRef(Bumper, :checkpoint_save), buf_sym))
38-
$res_sym = $(
28+
$(
3929
Expr(
4030
:macrocall, GlobalRef(TensorOperations, Symbol("@tensor")),
4131
src, :(allocator = $buf_sym), ex...
4232
)
4333
)
44-
$(Expr(:call, GlobalRef(Bumper, :checkpoint_restore!), cp_sym))
45-
$res_sym
4634
end
47-
return return Base.remove_linenums!(newex)
35+
return Base.remove_linenums!(newex)
4836
end
4937

5038
end

src/TensorOperations.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ export @cutensor, @butensor
2222
export ncon
2323
export tensorcopy!, tensoradd!, tensortrace!, tensorcontract!, tensorproduct!, tensorscalar
2424
export tensorcopy, tensoradd, tensortrace, tensorcontract, tensorproduct, scalartype
25-
export tensoralloc, tensorfree!
25+
export tensoralloc, tensorfree!, allocator_checkpoint!, allocator_reset!
2626

2727
export IndexTuple, Index2Tuple, linearize
2828

src/implementation/allocator.jl

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,48 @@ block, which will thus still be managed using Julia's GC. The other tensors will
4141
"""
4242
struct ManualAllocator end
4343

44+
"""
45+
BufferAllocator(; sizehint = 0)
46+
47+
Allocator that uses a pre-allocated buffer for storing temporary tensors.
48+
When the buffer is full, the allocator falls back on Julia's default allocation mechanism
49+
to create temporary tensors, but keeps track of how much additional memory is required.
50+
When the buffer is fully reset, the buffer is automatically resized to ensure subsequent
51+
contractions will now fit in the buffer.
52+
53+
!!! warning
54+
This allocator is **not** thread-safe, and it is the user's responsibility to avoid running
55+
the same allocator on concurrent jobs. For concurrent usage, it is recommended to either
56+
manually use a separate buffer per task, or leverage Bumper.jl through [`@butensor`](@ref)
57+
instead.
58+
"""
59+
mutable struct BufferAllocator{Storage}
60+
buffer::Storage
61+
offset::UInt
62+
max_offset::UInt
63+
64+
function BufferAllocator{Storage}(; sizehint::Integer = 0) where {Storage}
65+
T = eltype(Storage)
66+
(isbitstype(T) && sizeof(T) == 1) ||
67+
throw(ArgumentError("Buffer should have elements that take up a single byte."))
68+
n = _buffersz(sizehint)
69+
return new{Storage}(Storage(undef, n), 0, 0)
70+
end
71+
end
72+
73+
const DefaultStorageType = @static isdefined(Core, :Memory) ? Memory{UInt8} : Vector{UInt8}
74+
BufferAllocator(; kwargs...) = BufferAllocator{DefaultStorageType}(; kwargs...)
75+
76+
# allocate buffers in sizes that are powers of 2
77+
_buffersz(x::Integer) = iszero(x) ? x : Base.nextpow(2, x)
78+
4479
# ------------------------------------------------------------------------------------------
4580
# Generic implementation
4681
# ------------------------------------------------------------------------------------------
82+
83+
# function that mimicks the operations that are applied to the scalars during contraction
4784
tensorop(args...) = +(*(args...), *(args...))
85+
4886
"""
4987
promote_contract(args...)
5088
@@ -172,6 +210,7 @@ tensorfree!(C, allocator = DefaultAllocator()) = nothing
172210
# ------------------------------------------------------------------------------------------
173211
# ManualAllocator implementation
174212
# ------------------------------------------------------------------------------------------
213+
175214
function tensoralloc(
176215
::Type{A}, structure, ::Val{istemp}, ::ManualAllocator
177216
) where {A <: AbstractArray, istemp}
@@ -186,3 +225,68 @@ function tensorfree!(C::PtrArray, ::ManualAllocator)
186225
free(C)
187226
return nothing
188227
end
228+
229+
# ------------------------------------------------------------------------------------------
230+
# BufferAllocator implementation
231+
# ------------------------------------------------------------------------------------------
232+
233+
# length in bytes
234+
Base.length(buffer::BufferAllocator) = length(buffer.buffer)
235+
Base.isempty(buffer::BufferAllocator) = iszero(buffer.offset)
236+
Base.pointer(buffer::BufferAllocator) = pointer(buffer.buffer)
237+
Base.pointer(buffer::BufferAllocator, offset) = pointer(buffer) + offset
238+
239+
Base.empty!(buffer::BufferAllocator) = (buffer.offset = 0; buffer)
240+
241+
function Base.resize!(buffer::BufferAllocator, n::Integer)
242+
isempty(buffer) || error("Cannot resize a buffer that still contains elements")
243+
n = _buffersz(n)
244+
n == length(buffer) || (buffer.buffer = similar(buffer.buffer, n))
245+
return buffer
246+
end
247+
function Base.resize!(buffer::BufferAllocator{<:Vector}, n::Integer)
248+
isempty(buffer) || error("Cannot resize a buffer that still contains elements")
249+
n = _buffersz(n)
250+
n == length(buffer) || resize!(buffer.buffer, n)
251+
return buffer
252+
end
253+
254+
function Base.sizehint!(buffer::BufferAllocator, n::Integer; shrink::Bool = false)
255+
buffer.max_offset = shrink ? n : max(buffer.max_offset, n)
256+
return buffer
257+
end
258+
259+
# how many bytes should be reserved
260+
allocation_size(::Type{T}, structure::Base.Dims) where {T} = prod(structure) * sizeof(T)
261+
262+
function tensoralloc(
263+
::Type{A}, structure, ::Val{istemp}, buffer::BufferAllocator
264+
) where {A <: AbstractArray, istemp}
265+
if istemp
266+
T = eltype(A)
267+
offset = buffer.offset + allocation_size(T, structure)
268+
sizehint!(buffer, offset)
269+
270+
# grow buffer if empty
271+
isempty(buffer) && resize!(buffer, buffer.max_offset)
272+
273+
# Use pointer if there is enough space
274+
if offset < length(buffer)
275+
ptr = convert(Ptr{T}, pointer(buffer, buffer.offset))
276+
buffer.offset = offset
277+
return Base.unsafe_wrap(Array, ptr, structure)
278+
end
279+
end
280+
281+
# Allocate default if not
282+
return A(undef, structure)
283+
end
284+
285+
allocator_checkpoint!(buffer::BufferAllocator) = buffer.offset
286+
287+
function allocator_reset!(buffer::BufferAllocator, checkpoint)
288+
checkpoint buffer.offset ||
289+
throw(ArgumentError("Invalid checkpoint: `allocator_reset!` has to be called in reverse order on saved checkpoints"))
290+
buffer.offset = checkpoint
291+
return buffer
292+
end

src/implementation/blascontract.jl

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,14 @@ function blas_contract!(C, A, pA, B, pB, pAB, α, β, backend, allocator)
1414
TupleTools.getindices(indCinoBA, tpAB[1]),
1515
TupleTools.getindices(indCinoBA, tpAB[2]),
1616
)
17-
17+
cp = allocator_checkpoint!(allocator)
1818
if contract_memcost(C, A, pA, B, pB, pAB) <= contract_memcost(C, B, rpB, A, rpA, rpAB)
19-
return _blas_contract!(C, A, pA, B, pB, pAB, α, β, backend, allocator)
19+
C = _blas_contract!(C, A, pA, B, pB, pAB, α, β, backend, allocator)
2020
else
21-
return _blas_contract!(C, B, rpB, A, rpA, rpAB, α, β, backend, allocator)
21+
C = _blas_contract!(C, B, rpB, A, rpA, rpAB, α, β, backend, allocator)
2222
end
23+
allocator_reset!(allocator, cp)
24+
return C
2325
end
2426
# specialised fast path for matrix matrix multiplication
2527
function blas_contract!(

src/indexnotation/postprocessors.jl

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,3 +113,20 @@ function insertallocator(ex, allocator)
113113
)
114114
)
115115
end
116+
117+
# TODO: this is currently only marking a single checkpoint per `@tensor` call.
118+
"""
119+
insertcheckpoints(ex, allocator)
120+
121+
Insert the [`allocator_checkpoint!`](@ref) and [`allocator_reset!`](@ref) calls before and after tensor contractions.
122+
"""
123+
function insertcheckpoints(ex, allocator)
124+
cp = gensym("checkpoint")
125+
res = gensym("result")
126+
return quote
127+
$cp = $(GlobalRef(TensorOperations, :allocator_checkpoint!))($allocator)
128+
$res = $ex
129+
$(GlobalRef(TensorOperations, :allocator_reset!))($allocator, $cp)
130+
$res
131+
end
132+
end

src/indexnotation/tensormacros.jl

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ function tensorparser(tensorexpr, kwargs...)
6767
push!(parser.postprocessors, ex -> insertbackend(ex, backend))
6868
end
6969
push!(parser.postprocessors, ex -> insertallocator(ex, allocator))
70+
push!(parser.postprocessors, ex -> insertcheckpoints(ex, allocator))
7071
break
7172
end
7273
end
@@ -302,7 +303,8 @@ will transfer all arrays to the GPU before performing the requested operations.
302303
output is an existing host array, the result will be transferred back. If a new array is
303304
created (i.e. using `:=`), it will remain on the GPU device and it is up to the user to
304305
transfer it back. This macro is equivalent to
305-
`@tensor backend=cuTENSORBackend() allocator=CUDAAllocator() tensor_expr`.
306+
307+
@tensor backend = cuTENSORBackend() allocator = CUDAAllocator() tensor_expr
306308
307309
!!! note
308310
This macro requires the cuTENSOR library to be installed and loaded. This can be
@@ -316,6 +318,7 @@ macro cutensor(ex...)
316318
end
317319

318320
function _cutensor end
321+
319322
"""
320323
@butensor tensor_expr
321324

src/interface.jl

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -267,6 +267,26 @@ See also [`tensoralloc`](@ref).
267267
"""
268268
function tensorfree! end
269269

270+
"""
271+
allocator_checkpoint!(allocator) -> checkpoint
272+
273+
Mark a checkpoint for a given allocator.
274+
This can for example be used to implement stack-based buffer schemes, such as provided by Bumper.jl.
275+
276+
See also [`allocator_reset!`](@ref).
277+
"""
278+
allocator_checkpoint!(allocator) = nothing
279+
280+
"""
281+
allocator_reset!(allocator, checkpoint)
282+
283+
Reset a given allocator to the state provided by the marked `checkpoint`.
284+
This can for example be used to implement stack-based buffer schemes, such as provided by Bumper.jl.
285+
286+
See also [`allocator_checkpoint!`](@ref).
287+
"""
288+
allocator_reset!(allocator, checkpoint) = allocator
289+
270290
#-------------------------------------------------------------------------------------------
271291
# Utility
272292
#-------------------------------------------------------------------------------------------

0 commit comments

Comments
 (0)