Enterprise-grade metrics validated at production scale
Operation: per evaluation of fixed AST // per successful root with tolerance
| Metric | Value | Notes |
|---|---|---|
| Cross-Platform Determinism | Bit-Exact | Verified ARM, x86, RISC-V (Planned), WASM (Planned) |
| Aggregate Throughput (H100) | 286.94B† ops/sec | Peak validated performance (TestFort). Validation scope: non-linear function suite, cross-platform determinism, and GPU endurance. Validation of remaining function categories is in progress. |
| Performance Improvement | >10x faster | SIMD vectorization + race-to-idle |
| Energy Efficiency | >5x improvement | Reduced compute + cooling savings |
| ARM64 Latency | 1.6ms / 100K | Edge deployment (RPi 5, Jetson) |
| CPU Speedup | 13.7x vs baseline | SIMD on hot paths |
| CPU Energy Savings | 10-30% per op | Race-to-idle strategy |
| API Latency | Sub-1ms | End-to-end response time |
| Throughput (Peak) | ~193K ops/sec | Sustained on edge hardware |
| Throughput (Per Node) | ~15K ops/sec | Typical per-node |
Source: CPU benchmarks from benchmark_data.md
† Metrics validated against non-linear function suite. Full engine validation in progress.
FMA kernel (x = 1.5 * x + 2.0) // 426M ops/J @ ~72W
| Metric | Value | Notes |
|---|---|---|
| Test Duration | 600.0013s | 10 min continuous FMA loop |
| Throughput | 30.7B ops/sec | FP16 precision, 50M-element vector |
| Energy Efficiency | 426M ops/J | Measured at ~72W via NVML |
| Backend | CUDA FP16 | Vulkan also available (80% CUDA perf) |
The power-optimized "Gold Master" configuration achieves 53% power savings (33.64W vs 72W TDP) and 0.88 Billion ops/Joule energy efficiency: more than double the throughput configuration's efficiency. Choose this mode when energy cost and thermal constraints outweigh raw throughput needs.
| Metric | Target | Gold Master Actual | Improvement |
|---|---|---|---|
| Throughput | 8.3B Ops/Sec | 29.67 Billion Ops/Sec | 3.5x |
| Power Usage | < 72 W (TDP) | 33.64 Watts | 53% Savings |
| Energy Efficiency | 0.4 B Ops/J | 0.88 Billion Ops/Joule | 2.1x |
| Thermals | < 70C | < 37C (Sustained) | Cool-Running |
| Stability | 99.9% | 100% (0 Throttles/Errors) | Perfect |
Note: This Gold Master run used power-optimized settings achieving lower power draw at the cost of slightly lower throughput vs the 30.7B FMA benchmark. Both configurations are validated and available for deployment based on workload requirements.
Source: GPU benchmarks measured 2025-11-07. See TestFort Report for details.
Independent testing via Phoronix Test Suite on NVIDIA H100 80GB HBM3
FP64 85.8 TFLOPS: the raw compute that powers Black-Scholes, Monte Carlo, and deterministic financial models. Verified. Reproducible. Bit-exact.
Dec 29, 2025
Exceeds NVIDIA spec of 67 TFLOPS
View on OpenBenchmarking.org →Dec 30, 2025
Cross-validated repeatability
View on OpenBenchmarking.org →Dec 30, 2025
Cryptographic verification support
View on OpenBenchmarking.org →All benchmarks independently verified via OpenBenchmarking.org | Dec 29-30, 2025
Lu(x)iEdge guarantees bit-identical results across diverse hardware architectures. Whether running on a data center H100 or an edge RISC-V controller, the output for a given expression and input vector is guaranteed to be identical down to the last bit.
Cross-platform SIMD with automatic detection and fallback
Lu(x)iEdge automatically selects the best SIMD instruction set available on your hardware:
01. AVX-512
Newest Intel/AMD
02. AVX2
Most x86_64 systems
03. NEON
ARM devices
| Platform | SIMD | GPU | Status |
|---|---|---|---|
| x86_64 (Intel/AMD) | AVX-512, AVX2, SSE4.2 | CUDA, Vulkan | Production Ready |
| ARM64 (Apple M1/M2, RPi 5) | NEON | Metal (planned) | Production Ready |
| NVIDIA GPUs | N/A | CUDA FP16/FP32 | Production Ready |
| AMD GPUs | N/A | Vulkan | Beta |
| RISC-V | RVV (planned) | N/A | Roadmap |