[Feature] buffer- and stack-based allocator strategies (#251)

lkdvos · Jutho · web-flow · commit ff410eedba5e · 2026-01-28T15:10:39.000Z
* add stack-based `allocator` interface functions

* small formatting changes

* add checkpoint postprocessor

* rewrite Bumper extension in terms of new functions

* Add native BufferAllocator

* add tests

* add note on thread-safety

* move buffer growth to allocation instead of reset

* update docs

* Update src/implementation/allocator.jl

Co-authored-by: Jutho &lt;Jutho@users.noreply.github.com&gt;

* avoid reallocating if the size hasn't changed

* refactor buffer size determination

* Update src/implementation/allocator.jl

Co-authored-by: Jutho &lt;Jutho@users.noreply.github.com&gt;

* small fix

* rewrite with `resize!`

* try to address comments on the docs

---------

Co-authored-by: Jutho &lt;Jutho@users.noreply.github.com&gt;
diff --git a/docs/src/man/backends.md b/docs/src/man/backends.md
@@ -129,11 +129,14 @@ In particular, this means that the backend will be selected **first**, while onl
 ```@docs
 TensorOperations.DefaultAllocator
 TensorOperations.ManualAllocator
+TensorOperations.BufferAllocator
 ```
 
 By default, the `DefaultAllocator` is used, which uses Julia's built-in memory management system.
 Optionally, it can be useful to use the `ManualAllocator`, as the manual memory management reduces the pressure on the garbage collector.
 In particular in multi-threaded applications, this can sometimes lead to a significant performance improvement.
+On the other hand, for repeated (but thread-safe!) `@tensor` calls, the `BufferAllocator` is a lightweight slab allocator that pre-allocates a buffer for temporaries, falling back to Julia's default if needed.
+Upon repeated use it will automatically resize the buffer to accommodate the requested temporaries, avoiding repeated reallocation.
 
 Finally, users can also opt to use the `Bumper.jl` system, which pre-allocates a slab of memory that can be re-used afterwards.
 This is available through a package extension for `Bumper`.
@@ -165,4 +168,24 @@ TensorOperations.CUDAAllocator
 
 Users can also define their own allocators, to facilitate experimentation with new implementations.
 Here, no restriction is made on the type of the allocator, and any object can be passed as an allocator.
-The required implemented methods are [`tensoralloc`](@ref) and [`tensorfree!`](@ref).
+
+The core methods that can be customized for an allocator are:
+
+* [`tensoralloc`](@ref): Allocate a tensor of a given type and structure. This method receives a flag indicating whether the output lives only within an `@tensor` block or persists outside of it. Temporary tensors can be allocated from internal buffers or pools, while permanent tensors should use a standard allocation strategy.
+* [`tensorfree!`](@ref): Explicitly free a tensor, if applicable. For custom allocators that manage internal pools or buffers, this can be used to track when temporaries are no longer needed.
+
+Here we are guaranteeing that the code generated by `@tensor` will contain exactly one matching call to `tensorfree!` for every `tensoralloc` call that has the temporary flag, as soon as the temporary object is no longer needed.
+
+!!! warning
+    Special care should be taken when using these functions directly, as it is then up to the user to ensure every allocated temporary is freed appropriately.
+    Invalid usage such as not freeing a tensor, freeing it twice, or freeing an tensor that was not allocated as temporary is considered undefined behavior,
+    and can lead to memory leaks and/or segmentation faults.
+
+
+For allocators that manage reusable buffers or maintain state across multiple contractions, the following helper methods can be useful to indicate that it is safe to free all objects that were allocated after a given checkpoint.
+
+* [`allocator_checkpoint!`](@ref): Save the current state of the allocator (e.g., the current offset in a buffer). This can be called before a sequence of tensor operations to capture the allocation state.
+* [`allocator_reset!`](@ref): Restore the allocator to a previously saved checkpoint, effectively releasing all allocations made since the checkpoint was taken. 
+
+Here we are guaranteeing that every created checkpoint will be restored, and all temporary allocations that are enclosed within this scope will no longer be accessed.
+Additionally, if multiple checkpoints are created, they will be restored in a first-in-last-out order.
diff --git a/docs/src/man/interface.md b/docs/src/man/interface.md
@@ -94,6 +94,21 @@ TensorOperations.tensoralloc_add
 TensorOperations.tensoralloc_contract
 ```
 
+For allocators that manage reusable buffers or maintain state across multiple tensor
+operations, the following helper functions provide an interface for managing allocation
+regions:
+
+```@docs
+TensorOperations.allocator_checkpoint!
+TensorOperations.allocator_reset!
+```
+
+These can be used to save the current allocator state before a region of code and restore it
+afterwards, where it is guaranteed that all temporary objects allocated within that region
+are released. This is particularly useful for nested tensor operations or when repeatedly
+evaluating the same tensor contraction with an allocator like
+[`TensorOperations.BufferAllocator`](@ref).
+
 ## Utility
 
 Some of the optional keywords for `@tensor` can be accessed only after implementing the
diff --git a/ext/TensorOperationsBumperExt.jl b/ext/TensorOperationsBumperExt.jl
@@ -3,8 +3,10 @@ module TensorOperationsBumperExt
 using TensorOperations
 using Bumper
 
+const BumperBuffer = Union{SlabBuffer, AllocBuffer}
+
 function TensorOperations.tensoralloc(
-        ::Type{A}, structure, ::Val{istemp}, buf::Union{SlabBuffer, AllocBuffer}
+        ::Type{A}, structure, ::Val{istemp}, buf::BumperBuffer
     ) where {A <: AbstractArray, istemp}
     # TODO: remove the `ndims` check if this is fixed in Bumper / StrideArraysCore
     if istemp && ndims(A) > 0
@@ -14,37 +16,23 @@ function TensorOperations.tensoralloc(
     end
 end
 
-function TensorOperations.blas_contract!(
-        C, A, pA, B, pB, pAB, α, β,
-        backend, allocator::Union{SlabBuffer, AllocBuffer}
-    )
-    @no_escape allocator begin
-        C = Base.@invoke TensorOperations.blas_contract!(
-            C, A, pA, B, pB, pAB, α, β, backend, allocator::Any
-        )
-    end
-    return C
-end
+TensorOperations.allocator_checkpoint!(alloc::BumperBuffer) = Bumper.checkpoint_save(alloc)
+TensorOperations.allocator_reset!(::BumperBuffer, cp) = Bumper.checkpoint_restore!(cp)
 
 function TensorOperations._butensor(src, ex...)
     buf_sym = gensym("buffer")
-    cp_sym = gensym("checkpoint")
-    res_sym = gensym("result")
 
     # TODO: there is no check for doubled tensor kwargs
     newex = quote
         $buf_sym = $(Expr(:call, GlobalRef(Bumper, :default_buffer)))
-        $cp_sym = $(Expr(:call, GlobalRef(Bumper, :checkpoint_save), buf_sym))
-        $res_sym = $(
+        $(
             Expr(
                 :macrocall, GlobalRef(TensorOperations, Symbol("@tensor")),
                 src, :(allocator = $buf_sym), ex...
             )
         )
-        $(Expr(:call, GlobalRef(Bumper, :checkpoint_restore!), cp_sym))
-        $res_sym
     end
-    return return Base.remove_linenums!(newex)
+    return Base.remove_linenums!(newex)
 end
 
 end
diff --git a/src/TensorOperations.jl b/src/TensorOperations.jl
@@ -22,7 +22,7 @@ export @cutensor, @butensor
 export ncon
 export tensorcopy!, tensoradd!, tensortrace!, tensorcontract!, tensorproduct!, tensorscalar
 export tensorcopy, tensoradd, tensortrace, tensorcontract, tensorproduct, scalartype
-export tensoralloc, tensorfree!
+export tensoralloc, tensorfree!, allocator_checkpoint!, allocator_reset!
 
 export IndexTuple, Index2Tuple, linearize
 
diff --git a/src/implementation/allocator.jl b/src/implementation/allocator.jl
@@ -41,10 +41,48 @@ block, which will thus still be managed using Julia's GC. The other tensors will
 """
 struct ManualAllocator end
 
+"""
+    BufferAllocator(; sizehint = 0)
+
+Allocator that uses a pre-allocated buffer for storing temporary tensors.
+When the buffer is full, the allocator falls back on Julia's default allocation mechanism
+to create temporary tensors, but keeps track of how much additional memory is required.
+When the buffer is fully reset, the buffer is automatically resized to ensure subsequent
+contractions will now fit in the buffer.
+
+!!! warning
+    This allocator is **not** thread-safe, and it is the user's responsibility to avoid running
+    the same allocator on concurrent jobs. For concurrent usage, it is recommended to either
+    manually use a separate buffer per task, or leverage Bumper.jl through [`@butensor`](@ref)
+    instead.
+"""
+mutable struct BufferAllocator{Storage}
+    buffer::Storage
+    offset::UInt
+    max_offset::UInt
+
+    function BufferAllocator{Storage}(; sizehint::Integer = 0) where {Storage}
+        T = eltype(Storage)
+        (isbitstype(T) && sizeof(T) == 1) ||
+            throw(ArgumentError("Buffer should have elements that take up a single byte."))
+        n = _buffersz(sizehint)
+        return new{Storage}(Storage(undef, n), 0, 0)
+    end
+end
+
+const DefaultStorageType = @static isdefined(Core, :Memory) ? Memory{UInt8} : Vector{UInt8}
+BufferAllocator(; kwargs...) = BufferAllocator{DefaultStorageType}(; kwargs...)
+
+# allocate buffers in sizes that are powers of 2
+_buffersz(x::Integer) = iszero(x) ? x : Base.nextpow(2, x)
+
 # ------------------------------------------------------------------------------------------
 # Generic implementation
 # ------------------------------------------------------------------------------------------
+
+# function that mimicks the operations that are applied to the scalars during contraction
 tensorop(args...) = +(*(args...), *(args...))
+
 """
     promote_contract(args...)
 
@@ -172,6 +210,7 @@ tensorfree!(C, allocator = DefaultAllocator()) = nothing
 # ------------------------------------------------------------------------------------------
 # ManualAllocator implementation
 # ------------------------------------------------------------------------------------------
+
 function tensoralloc(
         ::Type{A}, structure, ::Val{istemp}, ::ManualAllocator
     ) where {A <: AbstractArray, istemp}
@@ -186,3 +225,68 @@ function tensorfree!(C::PtrArray, ::ManualAllocator)
     free(C)
     return nothing
 end
+
+# ------------------------------------------------------------------------------------------
+# BufferAllocator implementation
+# ------------------------------------------------------------------------------------------
+
+# length in bytes
+Base.length(buffer::BufferAllocator) = length(buffer.buffer)
+Base.isempty(buffer::BufferAllocator) = iszero(buffer.offset)
+Base.pointer(buffer::BufferAllocator) = pointer(buffer.buffer)
+Base.pointer(buffer::BufferAllocator, offset) = pointer(buffer) + offset
+
+Base.empty!(buffer::BufferAllocator) = (buffer.offset = 0; buffer)
+
+function Base.resize!(buffer::BufferAllocator, n::Integer)
+    isempty(buffer) || error("Cannot resize a buffer that still contains elements")
+    n = _buffersz(n)
+    n == length(buffer) || (buffer.buffer = similar(buffer.buffer, n))
+    return buffer
+end
+function Base.resize!(buffer::BufferAllocator{<:Vector}, n::Integer)
+    isempty(buffer) || error("Cannot resize a buffer that still contains elements")
+    n = _buffersz(n)
+    n == length(buffer) || resize!(buffer.buffer, n)
+    return buffer
+end
+
+function Base.sizehint!(buffer::BufferAllocator, n::Integer; shrink::Bool = false)
+    buffer.max_offset = shrink ? n : max(buffer.max_offset, n)
+    return buffer
+end
+
+# how many bytes should be reserved
+allocation_size(::Type{T}, structure::Base.Dims) where {T} = prod(structure) * sizeof(T)
+
+function tensoralloc(
+        ::Type{A}, structure, ::Val{istemp}, buffer::BufferAllocator
+    ) where {A <: AbstractArray, istemp}
+    if istemp
+        T = eltype(A)
+        offset = buffer.offset + allocation_size(T, structure)
+        sizehint!(buffer, offset)
+
+        # grow buffer if empty
+        isempty(buffer) && resize!(buffer, buffer.max_offset)
+
+        # Use pointer if there is enough space
+        if offset < length(buffer)
+            ptr = convert(Ptr{T}, pointer(buffer, buffer.offset))
+            buffer.offset = offset
+            return Base.unsafe_wrap(Array, ptr, structure)
+        end
+    end
+
+    # Allocate default if not
+    return A(undef, structure)
+end
+
+allocator_checkpoint!(buffer::BufferAllocator) = buffer.offset
+
+function allocator_reset!(buffer::BufferAllocator, checkpoint)
+    checkpoint ≤ buffer.offset ||
+        throw(ArgumentError("Invalid checkpoint: `allocator_reset!` has to be called in reverse order on saved checkpoints"))
+    buffer.offset = checkpoint
+    return buffer
+end
diff --git a/src/implementation/blascontract.jl b/src/implementation/blascontract.jl
@@ -14,12 +14,14 @@ function blas_contract!(C, A, pA, B, pB, pAB, α, β, backend, allocator)
         TupleTools.getindices(indCinoBA, tpAB[1]),
         TupleTools.getindices(indCinoBA, tpAB[2]),
     )
-
+    cp = allocator_checkpoint!(allocator)
     if contract_memcost(C, A, pA, B, pB, pAB) <= contract_memcost(C, B, rpB, A, rpA, rpAB)
-        return _blas_contract!(C, A, pA, B, pB, pAB, α, β, backend, allocator)
+        C = _blas_contract!(C, A, pA, B, pB, pAB, α, β, backend, allocator)
     else
-        return _blas_contract!(C, B, rpB, A, rpA, rpAB, α, β, backend, allocator)
+        C = _blas_contract!(C, B, rpB, A, rpA, rpAB, α, β, backend, allocator)
     end
+    allocator_reset!(allocator, cp)
+    return C
 end
 # specialised fast path for matrix matrix multiplication
 function blas_contract!(
diff --git a/src/indexnotation/postprocessors.jl b/src/indexnotation/postprocessors.jl
@@ -113,3 +113,20 @@ function insertallocator(ex, allocator)
         )
     )
 end
+
+# TODO: this is currently only marking a single checkpoint per `@tensor` call.
+"""
+    insertcheckpoints(ex, allocator)
+
+Insert the [`allocator_checkpoint!`](@ref) and [`allocator_reset!`](@ref) calls before and after tensor contractions.
+"""
+function insertcheckpoints(ex, allocator)
+    cp = gensym("checkpoint")
+    res = gensym("result")
+    return quote
+        $cp = $(GlobalRef(TensorOperations, :allocator_checkpoint!))($allocator)
+        $res = $ex
+        $(GlobalRef(TensorOperations, :allocator_reset!))($allocator, $cp)
+        $res
+    end
+end
diff --git a/src/indexnotation/tensormacros.jl b/src/indexnotation/tensormacros.jl
@@ -67,6 +67,7 @@ function tensorparser(tensorexpr, kwargs...)
                 push!(parser.postprocessors, ex -> insertbackend(ex, backend))
             end
             push!(parser.postprocessors, ex -> insertallocator(ex, allocator))
+            push!(parser.postprocessors, ex -> insertcheckpoints(ex, allocator))
             break
         end
     end
@@ -302,7 +303,8 @@ will transfer all arrays to the GPU before performing the requested operations.
 output is an existing host array, the result will be transferred back. If a new array is
 created (i.e. using `:=`), it will remain on the GPU device and it is up to the user to
 transfer it back. This macro is equivalent to
-`@tensor backend=cuTENSORBackend() allocator=CUDAAllocator() tensor_expr`.
+
+    @tensor backend = cuTENSORBackend() allocator = CUDAAllocator() tensor_expr
 
 !!! note
     This macro requires the cuTENSOR library to be installed and loaded. This can be
@@ -316,6 +318,7 @@ macro cutensor(ex...)
 end
 
 function _cutensor end
+
 """
     @butensor tensor_expr
 
diff --git a/src/interface.jl b/src/interface.jl
@@ -267,6 +267,26 @@ See also [`tensoralloc`](@ref).
 """
 function tensorfree! end
 
+"""
+    allocator_checkpoint!(allocator) -> checkpoint
+
+Mark a checkpoint for a given allocator.
+This can for example be used to implement stack-based buffer schemes, such as provided by Bumper.jl.
+
+See also [`allocator_reset!`](@ref).
+"""
+allocator_checkpoint!(allocator) = nothing
+
+"""
+    allocator_reset!(allocator, checkpoint)
+
+Reset a given allocator to the state provided by the marked `checkpoint`.
+This can for example be used to implement stack-based buffer schemes, such as provided by Bumper.jl.
+
+See also [`allocator_checkpoint!`](@ref).
+"""
+allocator_reset!(allocator, checkpoint) = allocator
+
 #-------------------------------------------------------------------------------------------
 # Utility
 #-------------------------------------------------------------------------------------------
diff --git a/test/allocator.jl b/test/allocator.jl
diff --git a/test/methods.jl b/test/methods.jl
diff --git a/test/runtests.jl b/test/runtests.jl