Skip to content

leighrobertabbott/NeuralDelphi

Repository files navigation

NeuralDelphi

Delphi MIT License Windows

A high-performance, pure Delphi machine learning framework. No Python. No external DLLs. Just fast, native code.


✨ Features

  • πŸš€ Arena-Based Memory β€” Zero allocation/deallocation during training
  • ⚑ SIMD Assembly β€” AVX-512 and SSE kernels with CPUID auto-detection
  • πŸ”„ Automatic Differentiation β€” Full autograd with computation graphs
  • 🧡 Thread Pool Parallelization β€” Efficient multi-core utilization
  • πŸ“¦ Zero Dependencies β€” Pure Delphi, compiles standalone
  • πŸŽ›οΈ N-Dimensional Tensors β€” Full Shape/Strides support for any dimensionality
  • πŸ“‘ Broadcasting β€” NumPy-style automatic shape broadcasting
  • πŸ”’ Batch Operations β€” Batched matrix multiplication for 3D+ tensors
  • πŸ“Š Training Visualization β€” Live loss charts and confusion matrices
  • πŸ’Ύ Model Persistence β€” Save/Load trained models to disk

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        NeuralDelphi                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ML.Arena    β”‚  Linear memory allocator (zero GC overhead)       β”‚
β”‚  ML.Tensor   β”‚  N-D tensor views with Shape/Strides              β”‚
β”‚  ML.Ops      β”‚  SIMD kernels + parallel ops + broadcasting       β”‚
β”‚  ML.Graph    β”‚  Computation graph + autograd                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Details

ML.Arena.pas - Memory Management

The foundation of NeuralDelphi's performance. Implements a linear allocator (also called a "bump allocator" or "arena allocator") that pre-allocates a large contiguous block of memory.

Key Concepts:

  • TMemPtr: An Integer index into the arena, not a pointer. This avoids pointer arithmetic issues and makes the system 32/64-bit agnostic.
  • TArena.Alloc(Count): O(1) allocation - just increments the head pointer. No free lists, no fragmentation.
  • TArena.Reset(): O(1) deallocation - sets head to 0. All memory is "freed" instantly.
  • GetSavePoint() / Restore(): Critical for the graph architecture. Allows resetting only temporary activations while keeping persistent parameters.

Why This Matters: Traditional GetMem/FreeMem calls are expensive (kernel calls, heap fragmentation). During training, you might allocate millions of temporary tensors. The arena eliminates this overhead entirely.

Example:

Arena := TArena.Create(256);        // Allocate 256MB block
W1 := Arena.Alloc(8 * 2);          // Allocate 16 floats (8x2 matrix)
W2 := Arena.Alloc(1 * 8);          // Allocate 8 floats (1x8 matrix)
// ... use W1, W2 ...
Arena.Reset;                        // Free everything instantly

ML.Tensor.pas - N-Dimensional Tensor Abstraction

A lightweight record (not a class!) that acts as a view into the arena. Supports arbitrary dimensions with NumPy-style shape and strides.

Key Fields:

  • DataPtr: TMemPtr: Index into arena where tensor data lives
  • GradPtr: TMemPtr: Index for gradients (allocated on-demand during backward pass)
  • Shape: TArray<Integer>: Dimensions, e.g., [32, 3, 224, 224] for batch of images
  • Strides: TArray<Integer>: Memory strides for each dimension (row-major)
  • RequiresGrad: Boolean: Whether this tensor needs gradients computed

Key Methods:

  • NDim: Returns number of dimensions
  • ElementCount: Total number of elements (product of shape)
  • IsContiguous: Checks if memory layout matches strides
  • GetLinearIndex(Indices): Converts N-D indices to linear index
  • Reshape(NewShape): Zero-copy view with new shape
  • Transpose(Dim0, Dim1): Zero-copy dimension swap
  • Squeeze / Unsqueeze: Add/remove dimensions of size 1
  • RawData(Arena): Returns PSingle pointer for direct memory access
  • RawGrad(Arena): Returns gradient pointer, or nil if not allocated

Why Records, Not Classes:

  • Zero heap allocation overhead
  • Value semantics (can copy freely)
  • Cache-friendly (all data in one contiguous block)

Example:

var
  T: TTensor;
begin
  T := TTensor.Create(Arena, [32, 8, 64], True);  // 3D tensor, needs gradients
  // T.Shape = [32, 8, 64]
  // T.Strides = [512, 64, 1]  (row-major)
  // T.ElementCount = 16384
  
  // Zero-copy reshape
  T2 := T.Reshape([32, 512]);  // Same data, different view
  
  // Zero-copy transpose
  T3 := T.Transpose(0, 1);     // Swaps first two dimensions
end;

ML.Ops.pas - Mathematical Operations

Contains three layers: Pure ASM Kernels, Parallel Execution, and High-Level Tensor Ops with Broadcasting Support.

1. TKernels - Pure Assembly Math Kernels Hand-written x64 SSE assembly for maximum performance. These are stateless functions that operate on raw pointers.

  • DotProduct(A, B, Count): SIMD dot product using MOVUPS, MULPS, ADDPS, HADDPS. Processes 4 floats at once.
  • VectorAdd(A, B, Out, Count): Element-wise addition with SSE ADDPS.
  • VectorMul(A, B, Out, Count): Element-wise multiplication with SSE MULPS.
  • Transpose(Src, Dst, Rows, Cols): Block-based matrix transpose (8x8 blocks) for cache efficiency.

2. TMLParallel - Thread Pool Wrapper Wraps System.Threading.TParallel.For with a threshold check. Only parallelizes if workload is substantial (>256 elements) to avoid overhead.

3. TOps - High-Level Tensor Operations with Broadcasting Combines kernels + parallelism + tensor management. Supports NumPy-style broadcasting.

Broadcasting Example:

// [32, 10] + [10] = [32, 10]  (bias broadcast across batch)
// [8, 1] + [1, 8] = [8, 8]    (outer product style)
TOps.Add(Arena, A, B, Out);     // Automatically broadcasts if shapes compatible

Batch MatMul:

// [Batch, M, K] @ [Batch, K, N] -> [Batch, M, N]
// Processes each batch independently with parallel inner loops
TOps.MatMul(Arena, A, B, Out);  // Works with 2D, 3D, or higher

Key Operations:

  • MatMul: Batched matrix multiplication. Transposes B for cache-friendly access, parallelizes rows, uses SIMD dot product.
  • Add / Mul: Element-wise with broadcasting support.
  • ReLU / LeakyReLU / Sigmoid: Activation functions.
  • MSE / CrossEntropy: Loss functions.
  • *Backward: Gradient computation with broadcasting-aware reduction.

ML.Graph.pas - Computation Graph & Autograd

The "brain" of NeuralDelphi. Implements automatic differentiation by building a computation graph.

Key Concepts:

1. Computation Graph: Each operation creates a TNode that records:

  • Operation type (opMatMul, opReLU, etc.)
  • Input node indices (parents)
  • Output tensor
  • Whether gradients are needed

2. Forward Pass: Operations are executed immediately as you build the graph:

W := Graph.Param([8, 2]);        // Creates param node with shape [8, 2]
X := Graph.Input([2, 1]);         // Creates input placeholder
H := Graph.MatMul(W, X);         // Executes MatMul, creates node
A := Graph.LeakyReLU(H);         // Executes LeakyReLU, creates node

3. Backward Pass: Traverses graph in reverse, computing gradients using chain rule:

Graph.Backward(LossNode);  // Computes gradients for all nodes requiring them

4. Memory Architecture:

  • MarkParamsEnd(): Called after all Param() calls. Marks the boundary between persistent parameters and temporary activations.
  • ResetActivations(): Resets arena to param savepoint. Wipes activations but keeps parameters intact.

Key Methods:

  • Param(Shape): Creates trainable parameter with N-D shape. Pre-allocates gradients.
  • Input(Shape): Creates input placeholder with N-D shape.
  • Step(LearningRate, GradClip): Updates parameters with gradient clipping.

🎯 XOR Demo

The included demo (XOR_Demo.dpr) trains a neural network to learn the XOR function with a real-time visual heatmap, live loss chart, and interactive control panel.

Interactive Controls

Control Default Description
Learning Rate 0.5 How fast to learn. Higher = faster but may overshoot
Hidden Neurons 16 Network capacity. More = better fit, slower training
Grad Clip 5.0 Maximum gradient magnitude. Prevents exploding gradients
Start/Stop β€” Toggle training on/off
Reset Network β€” Reinitialize with random weights
Save/Load Model β€” Persist trained weights to disk
Loss Chart β€” Live visualization of training loss

Tips:

  • Learning rate 0.5-1.0 works well for XOR
  • 8-32 hidden neurons is plenty
  • Watch the loss chart flatten as the network converges

The XOR Problem

XOR (exclusive OR) is a classic non-linearly separable problem that requires a hidden layer:

Input A Input B Expected Output Visual
0 0 0 Red
0 1 1 Blue
1 0 1 Blue
1 1 0 Red

Network Architecture

Input(2) β†’ MatMul(W1: Hx2) β†’ Add(B1: Hx1) β†’ LeakyReLU β†’ 
          MatMul(W2: 1xH) β†’ Add(B2: 1x1) β†’ Sigmoid β†’ Output(1)

Where H = Hidden Neurons (configurable via UI)

Visualization:

  • Heatmap shows network's prediction for every (x, y) coordinate
  • Red = predicts 0, Blue = predicts 1
  • Corners show actual XOR truth table
  • Updates in real-time as network learns

πŸ”§ Building

Requirements

  • RAD Studio 11+ (Delphi)
  • Platform: Windows x64 (for SIMD assembly)

Steps

  1. Open XOR_Demo.dpr in RAD Studio
  2. Select 64-bit Windows target
  3. Build and Run (F9)

Note: 32-bit builds use scalar fallbacks (no SIMD)

🎯 MNIST Demo

The MNIST_Demo.dpr demonstrates training a CNN on handwritten digits with live training visualization.

Features

  • Conv2D layers with im2col + GEMM optimization
  • Live loss chart showing training progress
  • Confusion matrix visualizing classification errors
  • Save/Load trained models
  • Bulk data loading for fast dataset initialization

Network Architecture

Input[1,28,28] β†’ Conv1(16ch, 3x3, stride=2) β†’ ReLU β†’
                 Conv2(32ch, 3x3, stride=2) β†’ ReLU β†’
                 Flatten β†’ Dense(128) β†’ ReLU β†’
                 Dense(10) β†’ Softmax β†’ Output

πŸ“ Project Structure

NeuralDelphi/
β”œβ”€β”€ ML.Arena.pas      # Memory arena allocator
β”œβ”€β”€ ML.Tensor.pas     # N-D tensor with Shape/Strides
β”œβ”€β”€ ML.Ops.pas        # Math operations + SIMD + im2col Conv2D
β”œβ”€β”€ ML.Graph.pas      # Computation graph + autograd
β”œβ”€β”€ XOR_Demo.dpr      # Interactive XOR visualization
β”œβ”€β”€ MNIST_Demo.dpr    # CNN digit classification demo
β”œβ”€β”€ MNIST_Loader.pas  # Fast MNIST dataset loader
β”œβ”€β”€ LICENSE           # MIT License
└── README.md

πŸ“Š Supported Operations

Core Operations

Operation Description Broadcasting Batched
MatMul Matrix multiplication C = A @ B No βœ… 3D+
Add Element-wise addition C = A + B βœ… βœ…
Mul Element-wise multiplication C = A * B βœ… βœ…
Conv2D 2D Convolution (im2col + GEMM) No βœ…
MaxPool2D Max pooling with gradient routing No βœ…
Dropout Inverted dropout regularization No βœ…

Activation Functions

Operation Formula Use Case
ReLU f(x) = max(0, x) Standard activation
LeakyReLU f(x) = max(Ξ±x, x) Prevents dying ReLU
Sigmoid f(x) = 1/(1+e^(-x)) Binary classification output
Tanh f(x) = tanh(x) Centered around 0
Softmax f(x_i) = e^(x_i) / Ξ£e^(x_j) Multi-class output

Loss Functions

Operation Formula Use Case
MSE L = (1/n)Ξ£(pred - target)Β² Regression
CrossEntropy L = -Ξ£(target * log(pred)) Classification
SoftmaxCrossEntropy Combined softmax + CE Numerically stable classification

πŸŽ›οΈ Broadcasting Rules

NeuralDelphi follows NumPy broadcasting semantics:

  1. Align shapes from the right: [32, 10] and [10] align as [32, 10] + [1, 10]
  2. Dimensions must match or be 1: [8, 1] + [1, 8] β†’ [8, 8]
  3. Result shape is element-wise max: [32, 1, 64] + [1, 8, 64] β†’ [32, 8, 64]

Helper Functions:

if CanBroadcast(ShapeA, ShapeB) then
  OutShape := BroadcastShapes(ShapeA, ShapeB);
  
// During element-wise ops, use BroadcastIndex to map output β†’ input indices

πŸ”’ Batch Matrix Multiplication

MatMul supports N-dimensional tensors where the last two dimensions are the matrix dimensions:

// A: [Batch, M, K]  Γ—  B: [Batch, K, N]  β†’  Out: [Batch, M, N]
// Each batch is multiplied independently

A := TTensor.Create(Arena, [32, 64, 128]);   // 32 matrices of 64x128
B := TTensor.Create(Arena, [32, 128, 256]);  // 32 matrices of 128x256
Out := TTensor.Create(Arena, [32, 64, 256]); // Result: 32 matrices of 64x256

TOps.MatMul(Arena, A, B, Out);  // Parallel across batches and rows

🧠 Performance Optimizations

1. SIMD (Single Instruction, Multiple Data)

  • SSE: Processes 4 floats simultaneously using SSE registers
  • AVX-512: Processes 16 floats simultaneously (when supported)
  • Automatic CPU feature detection with fallback chain
  • DotProduct: ~4x faster (SSE) or ~16x faster (AVX-512) than scalar code

2. Cache-Friendly Matrix Multiplication

  • Transposes matrix B before multiplication
  • Block-based transpose (8x8 blocks) for L1 cache efficiency
  • Reduces cache misses by ~80%

3. Thread Pool Parallelization

  • Uses Delphi RTL's TParallel.For (reuses threads)
  • Threshold: only parallelizes if >256 elements

4. Gradient Clipping

  • Configurable max gradient norm
  • Prevents exploding gradients during training

5. Persistent Parameters

  • Parameters allocated once, gradients pre-allocated
  • ResetActivations() only wipes temporary tensors

6. Model Persistence

  • Binary format for fast save/load
  • Preserves all parameters and architecture
  • Version-compatible format for future updates

🚧 Roadmap

  • N-dimensional tensor support
  • Broadcasting for element-wise ops
  • Batch matrix multiplication
  • Interactive hyperparameter tuning
  • Model save/load persistence
  • Conv2D with im2col + GEMM optimization
  • MNIST demo with live visualization
  • AVX-512/SSE kernels with CPUID detection
  • MaxPool2D and Dropout layers
  • Loss charts and confusion matrices
  • Proper Softmax Jacobian backward pass
  • He weight initialization (correct fan_in)
  • GPU acceleration (CUDA/OpenCL)
  • Additional optimizers (Adam, RMSprop)

🀝 Contributing

Contributions welcome! Areas of interest:

  • Additional layer types (Conv2D, BatchNorm, Dropout)
  • Performance optimizations
  • More demos and examples
  • Documentation

πŸ“œ License

MIT License β€” see LICENSE for details.


Built with ❀️ in Delphi

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages