The Hexagon NPU FastRPC backend provides hardware acceleration for GGML operations using Qualcomm's Hexagon NPU capabilities. This backend is built entirely from scratch using Qualcomm's FastRPC framework, providing a low-level alternative to QNN SDK-based implementations. By bypassing high-level abstractions and programming directly with HVX intrinsics, it is designed to offload compute-intensive operations from the CPU to Qualcomm's specialized Hexagon NPU hardware, enabling maximum performance and power efficiency for machine learning workloads on Snapdragon platforms.
- Built from Scratch: Complete custom implementation using Qualcomm's FastRPC framework for direct hardware control
- Minimal Abstraction Layer: Lightweight, efficient abstraction objects on the host side for seamless GGML graph/tensor/buffer offloading
- Zero-Copy Communication: FastRPC and ION buffers enable efficient data sharing between CPU and NPU domains
- Raw HVX Intrinsics: Critical operations like
mul_matandaddimplemented using direct Hexagon Vector Extensions - Custom Thread Pool: 4-thread parallel execution matching NPU hardware capabilities with intelligent load balancing
- VTCM Management: Thread-specific VTCM operations that fully leverage high-speed VTCM to reduce memory bandwidth pressure
- L2 Cache Optimization: Prefetching and cache-aware memory access patterns for maximum bandwidth utilization
- Hardware-Accelerated Formats: Q4_0, Q8_0, Q4_K quantized data types with custom HVX dequantization functions
- Mixed Precision Operations: Support for quantized (Q4_0, Q8_0, Q4_K) and FP32 mixed operations with efficient conversion tables
- Matrix Multiplication (
mul_mat): Custom HVX implementation with 4-thread parallelization for transformer workloads - Element-wise Operations: Add, multiply operations with broadcasting support using direct HVX intrinsics
- RMS Normalization: Hand-optimized kernels for layer normalization operations common in modern LLMs
- Graph-level Execution: Entire computation graphs executed on NPU to minimize CPU-NPU memory transfers and maximize efficiency
- Android: Primary target platform with Snapdragon SoCs featuring Hexagon NPU (8 Gen 1+, 8cx Gen 3+)
- Linux: Development and testing support for Hexagon-enabled platforms
- Windows on ARM: Cross-platform compatibility for Snapdragon-based Windows devices
- Minimal Overhead: FastRPC implementation provides direct hardware access with virtually no abstraction penalties
- Maximum Hardware Utilization: Custom implementation leverages all 4 hardware threads and HVX units on the Hexagon NPU
- Memory Bandwidth Optimization: Graph-level execution and zero-copy transfers reduce bottlenecks between CPU and NPU
- Power Efficiency: Direct NPU execution typically provides 2-5x better performance-per-watt compared to CPU execution
- Scalable Architecture: Designed to efficiently handle both small and large model workloads
- Host Device Management:
host_device.cpp- NPU device interface and lifecycle management - Host Graph Coordination:
graph.cpp- graph creation, caching, and execution coordination - Buffer Management:
buffer.cpp- RPC memory buffers, ION allocation, and zero-copy data transfer - Type Conversion Utilities: Efficient host-device data format conversion and RPC interface helpers
- Device Runtime:
device.cpp- core NPU-side runtime executing on Hexagon hardware - Graph Execution Engine:
graph.cpp- NPU-side graph computation with 4-thread parallelization - HVX Operation Kernels:
op_impl.cpp- hand-optimized HVX intrinsic implementations - Quantization Kernels:
quants.cpp- HVX-optimized dequantization for Q4_0, Q8_0, Q4_K - Thread Management:
thread_pool.hpp- custom 4-thread pool using QURT primitives - Memory Management:
vtcm_mem.hpp- VTCM allocation with RAII semantics
- Interface Definition:
hexagon_npu.idl- defines the RPC contract between host CPU and Hexagon NPU
We conducted some performance testing to evaluate the Hexagon NPU FastRPC backend against CPU-only execution. The benchmarks focus on large matrix multiplication operations—a critical compute bottleneck in transformer-based models.
We extended the test-backend-ops to include large matrix multiplication scenarios that represent typical LLM inference patterns:
diff --git tests/test-backend-ops.cpp tests/test-backend-ops.cpp
index 9ec24d9f23c5bc93b1b1e98e890e1186632358f7..584150154eee761f2d300504c525d38265fe3eb0 100644
--- tests/test-backend-ops.cpp
+++ tests/test-backend-ops.cpp
@@ -4239,6 +4239,8 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 1024, {3, 2}, {1, 1}));
test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 8, 1024, {3, 2}, {1, 1}));
test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 1024, {3, 2}, {1, 1}));
+ test_cases.emplace_back(new test_mul_mat(type_a, type_b, 8192, 1, 8192, {1, 1}, {1, 1}));
+ test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16384, 1, 16384, {1, 1}, {1, 1}));
}
}
for (ggml_type type_a : other_types) {The following table compares execution time (lower is better) across different precision formats for a large 16384×16384 matrix multiplied by a 16384×1 vector on some devices:
| devices | commit | type | src0_dim | src1_dim | cpu_time(us) | host_total(us) | host_param_update(us) | device_total(us) | device_dequant(us) | device_compute(us) |
|---|---|---|---|---|---|---|---|---|---|---|
| 8gen2 | 8409dd1e9 | F32 | 16384x16384 | 16384x1 | 38935 | 285529 | 296 | 89518 | 0 | 89518 |
| 8gen2 | 8409dd1e9 | Q8_0 | 16384x16384 | 16384x1 | 8930 | 327385 | 774 | 255894 | 245178 | 10716 |
| 8gen2 | 8409dd1e9 | Q4_0 | 16384x16384 | 16384x1 | 12503 | 143390 | 735 | 96932 | 86927 | 10005 |
The benchmark results reveal several important insights about the Hexagon NPU FastRPC implementation:
-
Dequantization Bottleneck:
- For quantized formats, 90-96% of NPU time is spent on dequantization, making this our primary optimization target
-
Compute Efficiency:
- When data resides in VTCM memory, the NPU shows excellent computational performance (~10,000 μs) for matrix multiplication operations
- This represents a ~4× improvement over CPU FP32 performance when comparing pure computation time
- However, the overall NPU performance is currently limited by memory transfers and dequantization overhead
-
Memory Access Patterns:
- Pure F32 computation on NPU (~90,000 μs) is slower than CPU (~39,000 μs) due to memory access patterns
- The significant performance difference between F32 and post-dequantization Q4_0/Q8_0 execution suggests optimization potential through better VTCM utilization
-
Communication Overhead:
- The host_param_update time (700-800 μs) represents FastRPC communication overhead
- While minimal for large operations, this overhead becomes proportionally significant for smaller tensor operations
- Batching operations into larger computation graphs would help amortize these costs
- Extended Operation Coverage: Additional custom HVX implementations for more GGML operations (attention, softmax, etc.)
- Dynamic Thread Scheduling: Runtime load balancing and work-stealing across the 4 hardware threads
- Performance Profiling Suite: Custom profiling tools for FastRPC execution paths and bottleneck analysis
- Advanced Graph Fusion: Sophisticated operation fusion techniques to minimize memory transfers and maximize NPU utilization