1.6ms/100K // 10x Performance // 5x Efficiency // 30.7B ops/sec

Performance Benchmarks

Enterprise-grade metrics validated at production scale

Operation: per evaluation of fixed AST // per successful root with tolerance

Core Performance Metrics

Metric Value Notes
Performance Improvement >10x faster SIMD vectorization + race-to-idle
Energy Efficiency >5x improvement Reduced compute + cooling savings
ARM64 Latency 1.6ms / 100K Edge deployment (RPi 5, Jetson)
CPU Speedup 13.7x vs baseline SIMD on hot paths
CPU Energy Savings 10-30% per op Race-to-idle strategy
API Latency Sub-1ms End-to-end response time
Throughput (Peak) ~193K ops/sec Sustained on edge hardware
Throughput (Per Node) ~15K ops/sec Typical per-node
Energy Efficiency (CPU) ~0.3 J/FLOP Industry-leading class
Scale Verification 10,000+ nodes Enterprise verified

Source: CPU benchmarks from benchmark_data.md

GPU Benchmarks // NVIDIA L4

GPU Validation // FP16 FMA Loop // 10-min Sustained

30.7B ops/sec

FMA kernel (x = 1.5 * x + 2.0) // 426M ops/J @ ~72W

Metric Value Notes
Test Duration 600.0013s 10 min continuous FMA loop
Throughput 30.7B ops/sec FP16 precision, 50M-element vector
Energy Efficiency 426M ops/J Measured at ~72W via NVML
vs SIMD Baseline 1,023x faster 30.7B vs 30M ops/sec SIMD
Backend CUDA FP16 Vulkan also available (80% CUDA perf)
Gold Master // Power-Optimized Nov 24, 2025

'Investment Grade' L4 Verification

Two Deployment Configurations

[+]
Throughput Mode: 30.7B ops/sec @ ~72W, 426M ops/J
[+]
Power Mode: 29.67B ops/sec @ 33.64W, 880M ops/J

The power-optimized "Gold Master" configuration achieves 53% power savings (33.64W vs 72W TDP) and 0.88 Billion ops/Joule energy efficiency—more than double the throughput configuration's efficiency. Choose this mode when energy cost and thermal constraints outweigh raw throughput needs.

Metric Target Gold Master Actual Improvement
Throughput 8.3B Ops/Sec 29.67 Billion Ops/Sec 3.5x
Power Usage < 72 W (TDP) 33.64 Watts 53% Savings
Energy Efficiency 0.4 B Ops/J 0.88 Billion Ops/Joule 2.1x
Thermals < 70C < 37C (Sustained) Cool-Running
Stability 99.9% 100% (0 Throttles/Errors) Perfect

Note: This Gold Master run used power-optimized settings achieving lower power draw at the cost of slightly lower throughput vs the 30.7B FMA benchmark. Both configurations are validated and available for deployment based on workload requirements.

Performance Comparison

30M

SIMD Baseline

ops/sec (target)

30.7B

GPU Throughput

ops/sec (L4 FP16, 10-min sustained)

1,023x

Speedup

vs SIMD baseline

Source: GPU benchmarks measured 2025-11-07. See repository for details (if available).

OpenBenchmarking.org Verified

Third-Party Verified Benchmarks

Independent testing via Phoronix Test Suite on NVIDIA H100 80GB HBM3

FP64 85.8 TFLOPS — the raw compute that powers Black-Scholes, Monte Carlo, and deterministic financial models. Verified. Reproducible. Bit-exact.

GPU Compute Suite

Marketing Baseline

Dec 29, 2025

FP64 Double Precision 85,819 GFLOPS
Single Precision 53,949 GFLOPS
Half Precision (FP16) 27,218 GFLOPS
SHA-256 Hashcat 15.13B H/s

Exceeds NVIDIA spec of 67 TFLOPS

View on OpenBenchmarking.org →

GPU Compute Validation

Confirmed

Dec 30, 2025

FP64 Double Precision 78,655 GFLOPS
Single Precision 51,893 GFLOPS
Half Precision (FP16) 26,978 GFLOPS
Deviation 0.59% (13 samples)

Cross-validated repeatability

View on OpenBenchmarking.org →

SHA-256 Cryptographic

SHA-256 Verified

Dec 30, 2025

Throughput 15.08B H/s
Ranking 82nd percentile
vs 121 GPUs Tested 2.7x median

Cryptographic verification support

View on OpenBenchmarking.org →

System Configuration

Hardware: 2x Intel Xeon Platinum 8468 (96 cores), 1520GB RAM, NVIDIA H100 80GB HBM3
Software: Ubuntu 22.04, CUDA 12.4, GCC 11.4.0
Environment: Docker container on Lambda Labs cloud

All benchmarks independently verified via OpenBenchmarking.org | Dec 29-30, 2025

AVX-512 Performance (November 10, 2025)

Intel/AMD High-Performance SIMD // 25% Faster than AVX2

x86_64 Vector Extensions

Cross-platform SIMD with automatic detection and fallback

AVX-512 (Latest Intel/AMD)

  • Throughput: 2.83-3.40 Gelem/s
  • Improvement: 25% faster than AVX2
  • Platforms: Intel Xeon (Ice Lake+), AMD EPYC (Zen 4+)
  • Vector Width: 512-bit (8x f64)

AVX2 (General Purpose)

  • Throughput: 193k-30M ops/sec
  • Speedup: 13.7x vs scalar baseline
  • Platforms: Most Intel/AMD CPUs (2013+)
  • Vector Width: 256-bit (4x f64)

Auto-Detection & Fallback

LuxiEdge automatically selects the best SIMD instruction set available on your hardware:

01. AVX-512

Newest Intel/AMD

02. AVX2

Most x86_64 systems

03. NEON

ARM devices

Platform Support Matrix

Platform SIMD GPU Status
x86_64 (Intel/AMD) AVX-512, AVX2, SSE4.2 CUDA, Vulkan Production Ready
ARM64 (Apple M1/M2, RPi 5) NEON Metal (planned) Production Ready
NVIDIA GPUs N/A CUDA FP16/FP32 Production Ready
AMD GPUs N/A Vulkan Beta
RISC-V RVV (planned) N/A Roadmap