Enterprise-grade metrics validated at production scale
Operation: per evaluation of fixed AST // per successful root with tolerance
| Metric | Value | Notes |
|---|---|---|
| Performance Improvement | >10x faster | SIMD vectorization + race-to-idle |
| Energy Efficiency | >5x improvement | Reduced compute + cooling savings |
| ARM64 Latency | 1.6ms / 100K | Edge deployment (RPi 5, Jetson) |
| CPU Speedup | 13.7x vs baseline | SIMD on hot paths |
| CPU Energy Savings | 10-30% per op | Race-to-idle strategy |
| API Latency | Sub-1ms | End-to-end response time |
| Throughput (Peak) | ~193K ops/sec | Sustained on edge hardware |
| Throughput (Per Node) | ~15K ops/sec | Typical per-node |
| Energy Efficiency (CPU) | ~0.3 J/FLOP | Industry-leading class |
| Scale Verification | 10,000+ nodes | Enterprise verified |
Source: CPU benchmarks from benchmark_data.md
FMA kernel (x = 1.5 * x + 2.0) // 426M ops/J @ ~72W
| Metric | Value | Notes |
|---|---|---|
| Test Duration | 600.0013s | 10 min continuous FMA loop |
| Throughput | 30.7B ops/sec | FP16 precision, 50M-element vector |
| Energy Efficiency | 426M ops/J | Measured at ~72W via NVML |
| vs SIMD Baseline | 1,023x faster | 30.7B vs 30M ops/sec SIMD |
| Backend | CUDA FP16 | Vulkan also available (80% CUDA perf) |
The power-optimized "Gold Master" configuration achieves 53% power savings (33.64W vs 72W TDP) and 0.88 Billion ops/Joule energy efficiency—more than double the throughput configuration's efficiency. Choose this mode when energy cost and thermal constraints outweigh raw throughput needs.
| Metric | Target | Gold Master Actual | Improvement |
|---|---|---|---|
| Throughput | 8.3B Ops/Sec | 29.67 Billion Ops/Sec | 3.5x |
| Power Usage | < 72 W (TDP) | 33.64 Watts | 53% Savings |
| Energy Efficiency | 0.4 B Ops/J | 0.88 Billion Ops/Joule | 2.1x |
| Thermals | < 70C | < 37C (Sustained) | Cool-Running |
| Stability | 99.9% | 100% (0 Throttles/Errors) | Perfect |
Note: This Gold Master run used power-optimized settings achieving lower power draw at the cost of slightly lower throughput vs the 30.7B FMA benchmark. Both configurations are validated and available for deployment based on workload requirements.
SIMD Baseline
ops/sec (target)
GPU Throughput
ops/sec (L4 FP16, 10-min sustained)
Speedup
vs SIMD baseline
Source: GPU benchmarks measured 2025-11-07. See repository for details (if available).
Independent testing via Phoronix Test Suite on NVIDIA H100 80GB HBM3
FP64 85.8 TFLOPS — the raw compute that powers Black-Scholes, Monte Carlo, and deterministic financial models. Verified. Reproducible. Bit-exact.
Dec 29, 2025
Exceeds NVIDIA spec of 67 TFLOPS
View on OpenBenchmarking.org →Dec 30, 2025
Cross-validated repeatability
View on OpenBenchmarking.org →Dec 30, 2025
Cryptographic verification support
View on OpenBenchmarking.org →All benchmarks independently verified via OpenBenchmarking.org | Dec 29-30, 2025
Cross-platform SIMD with automatic detection and fallback
LuxiEdge automatically selects the best SIMD instruction set available on your hardware:
01. AVX-512
Newest Intel/AMD
02. AVX2
Most x86_64 systems
03. NEON
ARM devices
| Platform | SIMD | GPU | Status |
|---|---|---|---|
| x86_64 (Intel/AMD) | AVX-512, AVX2, SSE4.2 | CUDA, Vulkan | Production Ready |
| ARM64 (Apple M1/M2, RPi 5) | NEON | Metal (planned) | Production Ready |
| NVIDIA GPUs | N/A | CUDA FP16/FP32 | Production Ready |
| AMD GPUs | N/A | Vulkan | Beta |
| RISC-V | RVV (planned) | N/A | Roadmap |