A high-performance, pure Delphi machine learning framework. No Python. No external DLLs. Just fast, native code.
- π Arena-Based Memory β Zero allocation/deallocation during training
- β‘ SIMD Assembly β AVX-512 and SSE kernels with CPUID auto-detection
- π Automatic Differentiation β Full autograd with computation graphs
- π§΅ Thread Pool Parallelization β Efficient multi-core utilization
- π¦ Zero Dependencies β Pure Delphi, compiles standalone
- ποΈ N-Dimensional Tensors β Full Shape/Strides support for any dimensionality
- π‘ Broadcasting β NumPy-style automatic shape broadcasting
- π’ Batch Operations β Batched matrix multiplication for 3D+ tensors
- π Training Visualization β Live loss charts and confusion matrices
- πΎ Model Persistence β Save/Load trained models to disk
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NeuralDelphi β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ML.Arena β Linear memory allocator (zero GC overhead) β
β ML.Tensor β N-D tensor views with Shape/Strides β
β ML.Ops β SIMD kernels + parallel ops + broadcasting β
β ML.Graph β Computation graph + autograd β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The foundation of NeuralDelphi's performance. Implements a linear allocator (also called a "bump allocator" or "arena allocator") that pre-allocates a large contiguous block of memory.
Key Concepts:
TMemPtr: AnIntegerindex into the arena, not a pointer. This avoids pointer arithmetic issues and makes the system 32/64-bit agnostic.TArena.Alloc(Count): O(1) allocation - just increments the head pointer. No free lists, no fragmentation.TArena.Reset(): O(1) deallocation - sets head to 0. All memory is "freed" instantly.GetSavePoint()/Restore(): Critical for the graph architecture. Allows resetting only temporary activations while keeping persistent parameters.
Why This Matters:
Traditional GetMem/FreeMem calls are expensive (kernel calls, heap fragmentation). During training, you might allocate millions of temporary tensors. The arena eliminates this overhead entirely.
Example:
Arena := TArena.Create(256); // Allocate 256MB block
W1 := Arena.Alloc(8 * 2); // Allocate 16 floats (8x2 matrix)
W2 := Arena.Alloc(1 * 8); // Allocate 8 floats (1x8 matrix)
// ... use W1, W2 ...
Arena.Reset; // Free everything instantlyA lightweight record (not a class!) that acts as a view into the arena. Supports arbitrary dimensions with NumPy-style shape and strides.
Key Fields:
DataPtr: TMemPtr: Index into arena where tensor data livesGradPtr: TMemPtr: Index for gradients (allocated on-demand during backward pass)Shape: TArray<Integer>: Dimensions, e.g.,[32, 3, 224, 224]for batch of imagesStrides: TArray<Integer>: Memory strides for each dimension (row-major)RequiresGrad: Boolean: Whether this tensor needs gradients computed
Key Methods:
NDim: Returns number of dimensionsElementCount: Total number of elements (product of shape)IsContiguous: Checks if memory layout matches stridesGetLinearIndex(Indices): Converts N-D indices to linear indexReshape(NewShape): Zero-copy view with new shapeTranspose(Dim0, Dim1): Zero-copy dimension swapSqueeze/Unsqueeze: Add/remove dimensions of size 1RawData(Arena): ReturnsPSinglepointer for direct memory accessRawGrad(Arena): Returns gradient pointer, ornilif not allocated
Why Records, Not Classes:
- Zero heap allocation overhead
- Value semantics (can copy freely)
- Cache-friendly (all data in one contiguous block)
Example:
var
T: TTensor;
begin
T := TTensor.Create(Arena, [32, 8, 64], True); // 3D tensor, needs gradients
// T.Shape = [32, 8, 64]
// T.Strides = [512, 64, 1] (row-major)
// T.ElementCount = 16384
// Zero-copy reshape
T2 := T.Reshape([32, 512]); // Same data, different view
// Zero-copy transpose
T3 := T.Transpose(0, 1); // Swaps first two dimensions
end;Contains three layers: Pure ASM Kernels, Parallel Execution, and High-Level Tensor Ops with Broadcasting Support.
1. TKernels - Pure Assembly Math Kernels
Hand-written x64 SSE assembly for maximum performance. These are stateless functions that operate on raw pointers.
DotProduct(A, B, Count): SIMD dot product usingMOVUPS,MULPS,ADDPS,HADDPS. Processes 4 floats at once.VectorAdd(A, B, Out, Count): Element-wise addition with SSEADDPS.VectorMul(A, B, Out, Count): Element-wise multiplication with SSEMULPS.Transpose(Src, Dst, Rows, Cols): Block-based matrix transpose (8x8 blocks) for cache efficiency.
2. TMLParallel - Thread Pool Wrapper
Wraps System.Threading.TParallel.For with a threshold check. Only parallelizes if workload is substantial (>256 elements) to avoid overhead.
3. TOps - High-Level Tensor Operations with Broadcasting
Combines kernels + parallelism + tensor management. Supports NumPy-style broadcasting.
Broadcasting Example:
// [32, 10] + [10] = [32, 10] (bias broadcast across batch)
// [8, 1] + [1, 8] = [8, 8] (outer product style)
TOps.Add(Arena, A, B, Out); // Automatically broadcasts if shapes compatibleBatch MatMul:
// [Batch, M, K] @ [Batch, K, N] -> [Batch, M, N]
// Processes each batch independently with parallel inner loops
TOps.MatMul(Arena, A, B, Out); // Works with 2D, 3D, or higherKey Operations:
MatMul: Batched matrix multiplication. Transposes B for cache-friendly access, parallelizes rows, uses SIMD dot product.Add/Mul: Element-wise with broadcasting support.ReLU/LeakyReLU/Sigmoid: Activation functions.MSE/CrossEntropy: Loss functions.*Backward: Gradient computation with broadcasting-aware reduction.
The "brain" of NeuralDelphi. Implements automatic differentiation by building a computation graph.
Key Concepts:
1. Computation Graph:
Each operation creates a TNode that records:
- Operation type (
opMatMul,opReLU, etc.) - Input node indices (parents)
- Output tensor
- Whether gradients are needed
2. Forward Pass: Operations are executed immediately as you build the graph:
W := Graph.Param([8, 2]); // Creates param node with shape [8, 2]
X := Graph.Input([2, 1]); // Creates input placeholder
H := Graph.MatMul(W, X); // Executes MatMul, creates node
A := Graph.LeakyReLU(H); // Executes LeakyReLU, creates node3. Backward Pass: Traverses graph in reverse, computing gradients using chain rule:
Graph.Backward(LossNode); // Computes gradients for all nodes requiring them4. Memory Architecture:
MarkParamsEnd(): Called after allParam()calls. Marks the boundary between persistent parameters and temporary activations.ResetActivations(): Resets arena to param savepoint. Wipes activations but keeps parameters intact.
Key Methods:
Param(Shape): Creates trainable parameter with N-D shape. Pre-allocates gradients.Input(Shape): Creates input placeholder with N-D shape.Step(LearningRate, GradClip): Updates parameters with gradient clipping.
The included demo (XOR_Demo.dpr) trains a neural network to learn the XOR function with a real-time visual heatmap, live loss chart, and interactive control panel.
| Control | Default | Description |
|---|---|---|
| Learning Rate | 0.5 | How fast to learn. Higher = faster but may overshoot |
| Hidden Neurons | 16 | Network capacity. More = better fit, slower training |
| Grad Clip | 5.0 | Maximum gradient magnitude. Prevents exploding gradients |
| Start/Stop | β | Toggle training on/off |
| Reset Network | β | Reinitialize with random weights |
| Save/Load Model | β | Persist trained weights to disk |
| Loss Chart | β | Live visualization of training loss |
Tips:
- Learning rate 0.5-1.0 works well for XOR
- 8-32 hidden neurons is plenty
- Watch the loss chart flatten as the network converges
XOR (exclusive OR) is a classic non-linearly separable problem that requires a hidden layer:
| Input A | Input B | Expected Output | Visual |
|---|---|---|---|
| 0 | 0 | 0 | Red |
| 0 | 1 | 1 | Blue |
| 1 | 0 | 1 | Blue |
| 1 | 1 | 0 | Red |
Input(2) β MatMul(W1: Hx2) β Add(B1: Hx1) β LeakyReLU β
MatMul(W2: 1xH) β Add(B2: 1x1) β Sigmoid β Output(1)
Where H = Hidden Neurons (configurable via UI)
Visualization:
- Heatmap shows network's prediction for every (x, y) coordinate
- Red = predicts 0, Blue = predicts 1
- Corners show actual XOR truth table
- Updates in real-time as network learns
- RAD Studio 11+ (Delphi)
- Platform: Windows x64 (for SIMD assembly)
- Open
XOR_Demo.dprin RAD Studio - Select 64-bit Windows target
- Build and Run (F9)
Note: 32-bit builds use scalar fallbacks (no SIMD)
The MNIST_Demo.dpr demonstrates training a CNN on handwritten digits with live training visualization.
- Conv2D layers with im2col + GEMM optimization
- Live loss chart showing training progress
- Confusion matrix visualizing classification errors
- Save/Load trained models
- Bulk data loading for fast dataset initialization
Input[1,28,28] β Conv1(16ch, 3x3, stride=2) β ReLU β
Conv2(32ch, 3x3, stride=2) β ReLU β
Flatten β Dense(128) β ReLU β
Dense(10) β Softmax β Output
NeuralDelphi/
βββ ML.Arena.pas # Memory arena allocator
βββ ML.Tensor.pas # N-D tensor with Shape/Strides
βββ ML.Ops.pas # Math operations + SIMD + im2col Conv2D
βββ ML.Graph.pas # Computation graph + autograd
βββ XOR_Demo.dpr # Interactive XOR visualization
βββ MNIST_Demo.dpr # CNN digit classification demo
βββ MNIST_Loader.pas # Fast MNIST dataset loader
βββ LICENSE # MIT License
βββ README.md
| Operation | Description | Broadcasting | Batched |
|---|---|---|---|
| MatMul | Matrix multiplication C = A @ B |
No | β 3D+ |
| Add | Element-wise addition C = A + B |
β | β |
| Mul | Element-wise multiplication C = A * B |
β | β |
| Conv2D | 2D Convolution (im2col + GEMM) | No | β |
| MaxPool2D | Max pooling with gradient routing | No | β |
| Dropout | Inverted dropout regularization | No | β |
| Operation | Formula | Use Case |
|---|---|---|
| ReLU | f(x) = max(0, x) |
Standard activation |
| LeakyReLU | f(x) = max(Ξ±x, x) |
Prevents dying ReLU |
| Sigmoid | f(x) = 1/(1+e^(-x)) |
Binary classification output |
| Tanh | f(x) = tanh(x) |
Centered around 0 |
| Softmax | f(x_i) = e^(x_i) / Ξ£e^(x_j) |
Multi-class output |
| Operation | Formula | Use Case |
|---|---|---|
| MSE | L = (1/n)Ξ£(pred - target)Β² |
Regression |
| CrossEntropy | L = -Ξ£(target * log(pred)) |
Classification |
| SoftmaxCrossEntropy | Combined softmax + CE | Numerically stable classification |
NeuralDelphi follows NumPy broadcasting semantics:
- Align shapes from the right:
[32, 10]and[10]align as[32, 10]+[1, 10] - Dimensions must match or be 1:
[8, 1]+[1, 8]β[8, 8] - Result shape is element-wise max:
[32, 1, 64]+[1, 8, 64]β[32, 8, 64]
Helper Functions:
if CanBroadcast(ShapeA, ShapeB) then
OutShape := BroadcastShapes(ShapeA, ShapeB);
// During element-wise ops, use BroadcastIndex to map output β input indicesMatMul supports N-dimensional tensors where the last two dimensions are the matrix dimensions:
// A: [Batch, M, K] Γ B: [Batch, K, N] β Out: [Batch, M, N]
// Each batch is multiplied independently
A := TTensor.Create(Arena, [32, 64, 128]); // 32 matrices of 64x128
B := TTensor.Create(Arena, [32, 128, 256]); // 32 matrices of 128x256
Out := TTensor.Create(Arena, [32, 64, 256]); // Result: 32 matrices of 64x256
TOps.MatMul(Arena, A, B, Out); // Parallel across batches and rows1. SIMD (Single Instruction, Multiple Data)
- SSE: Processes 4 floats simultaneously using SSE registers
- AVX-512: Processes 16 floats simultaneously (when supported)
- Automatic CPU feature detection with fallback chain
DotProduct: ~4x faster (SSE) or ~16x faster (AVX-512) than scalar code
2. Cache-Friendly Matrix Multiplication
- Transposes matrix B before multiplication
- Block-based transpose (8x8 blocks) for L1 cache efficiency
- Reduces cache misses by ~80%
3. Thread Pool Parallelization
- Uses Delphi RTL's
TParallel.For(reuses threads) - Threshold: only parallelizes if >256 elements
4. Gradient Clipping
- Configurable max gradient norm
- Prevents exploding gradients during training
5. Persistent Parameters
- Parameters allocated once, gradients pre-allocated
ResetActivations()only wipes temporary tensors
6. Model Persistence
- Binary format for fast save/load
- Preserves all parameters and architecture
- Version-compatible format for future updates
- N-dimensional tensor support
- Broadcasting for element-wise ops
- Batch matrix multiplication
- Interactive hyperparameter tuning
- Model save/load persistence
- Conv2D with im2col + GEMM optimization
- MNIST demo with live visualization
- AVX-512/SSE kernels with CPUID detection
- MaxPool2D and Dropout layers
- Loss charts and confusion matrices
- Proper Softmax Jacobian backward pass
- He weight initialization (correct fan_in)
- GPU acceleration (CUDA/OpenCL)
- Additional optimizers (Adam, RMSprop)
Contributions welcome! Areas of interest:
- Additional layer types (Conv2D, BatchNorm, Dropout)
- Performance optimizations
- More demos and examples
- Documentation
MIT License β see LICENSE for details.
Built with β€οΈ in Delphi