Enterprise-grade metrics validated at production scale. Data from .
"Operation" defined as: per evaluation of a fixed AST for expressions; per successful root with tolerance ε for root-finding
| Metric | Value | Notes |
|---|---|---|
| Performance Improvement | >10× faster | SIMD vectorization + race-to-idle optimization |
| Energy Efficiency | >5× improvement | Reduced compute time + cooling savings |
| ARM64 Latency | 1.6ms / 100K elements | Edge/embedded deployment (RPi 5, Jetson) |
| CPU Speedup | 13.7× vs baseline | SIMD vectorization on hot paths (TF Lite/ONNX baseline) |
| CPU Energy Savings | 10–30% per operation | Validated benchmarks, race-to-idle strategy |
| API Latency | Sub-1ms | End-to-end API response time |
| Throughput (Peak) | ~193k ops/sec | Sustained on edge hardware |
| Throughput (Per Node) | ~15k ops/sec | Typical per-node throughput |
| Energy Efficiency (CPU) | ~0.3 J/flop | Industry-leading efficiency class for CPU |
| Scale Verification | 10,000+ nodes | Hyperscale patterns verified at production scale |
Source: CPU benchmarks from repository's BENCHMARK_DATA.md. Raw data available in .
FP16 PTX kernels with batch optimization • 332M ops/J energy efficiency
| Metric | Value | Notes |
|---|---|---|
| Throughput | 8.3B ops/sec | FP16 precision, batched evaluator (PR #51) |
| Energy Efficiency | 332M ops/J | 75× improvement from previous 4.4M ops/J |
| vs SIMD Baseline | 277× faster | 8.3B ops/sec vs 30M ops/sec SIMD target |
| vs CPU Baseline | 5,808× faster | 8.3B GPU ops/sec vs 1.429M CPU ops/sec |
| Backend | CUDA FP16 (cudarc 0.17.7) | Vulkan backend available (80% CUDA perf, portable) |
SIMD Baseline
ops/sec (target)
GPU Throughput
ops/sec (L4 FP16)
Speedup
vs SIMD baseline
Source: GPU benchmarks measured 2025-11-07. See repository for (if available).
Cross-platform SIMD with automatic detection and fallback
Luxi automatically selects the best SIMD instruction set available on your hardware:
1️⃣ AVX-512
Newest Intel/AMD
2️⃣ AVX2
Most x86_64 systems
3️⃣ Scalar
Universal fallback
Source: AVX-512 benchmarks from November 10, 2025. Full data in .
Demonstrates Luxi's bisection capabilities for scientific computing applications
| Benchmark | Performance | Use Case |
|---|---|---|
| Direct TOF Calculation | 56.6 ns | 17.7M evaluations/second |
| Single-Revolution Solve | 420.9 µs | 2,375 solves/second (tol=1e-6) |
| High-Precision Solve | 496.3 µs | 2,015 solves/second (tol=1e-9) |
| 8-Revolution Swarm | 16.3 µs | 61,350 solve-sets/second ✅ |
8-Rev Swarm
Sub-millisecond solving
Solve-sets/sec
Batch optimization
Error Rate
Within 0.2 km accuracy
Test Vector: r₁=6980 km, r₂=10520 km, µ=398600 km³/s² (Earth). Solves for semi-major axis where TOF=1800s. See for implementation details.
Baseline CPU performance for comparison with GPU acceleration
| Metric | Value | Notes |
|---|---|---|
| Throughput (100 elements) | 1,492 elements/sec | Small batch performance |
| Throughput (4M elements) | 1.429M ops/sec | Large batch performance |
| Latency (4M elements) | 2.10 seconds | End-to-end processing time |
| Power Consumption | 25W | L4 GPU idle baseline |
| Energy Efficiency | 27.2K ops/Joule | CPU-only efficiency baseline |
| GPU vs CPU Speedup | 5,808× faster | 8.3B GPU ops/sec ÷ 1.429M CPU ops/sec (FP16 + batching) |
Source: CPU baseline measured on same L4 instance. Full methodology in .
Battery-optimized computation for embedded systems and space applications
| Platform | Power (W) | Theoretical Peak | Realistic (50% Util) | Use Case |
|---|---|---|---|---|
| Raspberry Pi 5 | 1.8W compute | 2.67B ops/J | 1.33B ops/J | IoT sensors, battery systems, edge AI |
| AWS Graviton3 | 3.5W compute | 1.49B ops/J | 743M ops/J | Cloud edge, serverless, green computing |
| Jetson Orin Nano | 5.0W compute | 800M ops/J | 400M ops/J | Robotics, autonomous vehicles, drones |
| Apple M2 | 14.5W compute | 483M ops/J | 241M ops/J | Developer workstations, local ML |
ops/Joule
Pi 5 - Theoretical Peak
ops/sec
ARM64 Neon Throughput
SIMD Speedup
vs ARM Scalar Baseline
Methodology: Theoretical peaks calculated from SIMD width (2× f64), clock frequency, and measured power draw. 50% utilization represents realistic performance with cache misses and pipeline dependencies. See for details.
8.3B ops/sec, 332M ops/J — 114× throughput and 75× efficiency improvements
Status: Implemented and merged (PR #51)
Status: Production (PR #51)
Status: Available (PR #51)
Status: Documented (PR #50)
References: | | Contact: e@ewaller.com
Next Milestone: Approaching 600M ops/J energy efficiency (1.8× improvement from current 332M ops/J)
This roadmap maintains focus on energy-efficient computation while expanding platform support and ensuring production-grade reliability at hyperscale.
Luxi™ supports a wide range of hardware platforms with varying levels of maturity. All platforms benefit from memory-safe Rust and deterministic execution guarantees.
| Platform | Architecture | Status | Performance Notes |
|---|---|---|---|
| x86_64 (Intel/AMD) | x86_64 SIMD (AVX2/AVX-512) | Production | ~193k ops/sec, 13.7× speedup, validated at scale |
| ARM64 (Cortex-A/M) | ARM NEON SIMD | Validated | Edge/embedded deployment (RPi 5, Jetson), 1.6ms/100K elements |
| NVIDIA GPU (CUDA) | sm_75+ (Turing+), FP16 PTX | Production | 8.3B ops/sec (L4 FP16), 332M ops/J, cudarc 0.17.7 |
| Vulkan Compute | Cross-vendor GPU (AMD, Intel, NVIDIA) | Production | 80% CUDA performance, portable, wgpu 0.19 backend (PR #51) |
| RISC-V | RVV (vector extension) | Planned | Future support for RISC-V vector architectures |
Battle-tested at scale with validated benchmarks
Functional with benchmarks, undergoing scale testing
In development or on roadmap for future release
Note: All platforms share the same REST API and deterministic JSON interface. Contact e@ewaller.com for specific platform requirements or custom architectures.
Choose the right hardware platform based on your deployment requirements, power budget, and performance targets.
| Factor | Raspberry Pi 5 / Jetson | AWS Graviton / Apple | Intel Xeon / AMD EPYC | NVIDIA L4/H100 GPU |
|---|---|---|---|---|
| Architecture | ARM Neon SIMD | ARM Neon SIMD | AVX2/AVX-512 | CUDA GPU |
| Power Budget | <5W ⚡ | 3.5-15W | 15-30W | 16-50W |
| Throughput | 1.2-2.7B ops/sec | 1.2-2.7B ops/sec | 193k-30M ops/sec | 8.3B ops/sec |
| Energy Efficiency | 2.67B ops/J 🏆 | 1.49B ops/J | 324k ops/J | 332M ops/J |
| Batch Size | <10k elements | <10k elements | <10k elements | >100k elements |
| Latency | <10ms | <10ms | <10ms | 50ms+ acceptable |
| Best For | IoT, battery, space | Cloud edge, serverless | General purpose, data centers | Maximum throughput |
| Deployment | Edge, embedded, satellites | AWS, Azure, local | On-prem, cloud | RunPod, GCP, AWS |
Race-to-idle is a power management strategy where CPUs complete tasks as quickly as possible at peak performance, then immediately drop into deep low-power idle states (1-2W). The concept: faster processors finish work faster and spend more time in energy-efficient idle modes, often consuming less total energy than slower chips running longer at reduced speeds.
Modern CPUs idle at 1-4W (chip-level) but consume 40-80W under load. By minimizing active time through 13.7× faster execution, Luxi™ enables significant energy savings through extended idle periods.
Example: 100-server deployment processing 1M operations/day
Baseline (slower tool):
With Luxi™ (13.7× faster):
💰 Direct compute savings: $420/year
Cooling savings (PUE 1.55):
🌟 Total annual savings: $651 (compute + cooling)
Note: These are conservative estimates. Actual savings depend on workload patterns, server efficiency, PUE, and electricity costs. Higher PUE environments (1.8-2.5) see proportionally greater cooling savings.
Energy saved annually (100-server example)
CO₂ emissions avoided (@ 0.7 kg/kWh grid mix)
Total savings from cooling reduction (PUE effect)
Benchmarks focus on SIMD-friendly numeric operations: expression evaluation (AST-based with vectorizable operations) and root-finding (bisection methods with convergence to ε tolerance).
Baseline comparisons use TensorFlow Lite and ONNX Runtime for equivalent operations. Energy measurements validated with hardware power meters on edge devices.
Commercial use requires LicenseRef-Luxi-Business-1.0 license.
See for details.