Luxi™ Benchmarks - Ultra-Efficient Mathematical Computation

Core Performance Metrics

Metric	Value	Notes
Performance Improvement	>10× faster	SIMD vectorization + race-to-idle optimization
Energy Efficiency	>5× improvement	Reduced compute time + cooling savings
ARM64 Latency	1.6ms / 100K elements	Edge/embedded deployment (RPi 5, Jetson)
CPU Speedup	13.7× vs baseline	SIMD vectorization on hot paths (TF Lite/ONNX baseline)
CPU Energy Savings	10–30% per operation	Validated benchmarks, race-to-idle strategy
API Latency	Sub-1ms	End-to-end API response time
Throughput (Peak)	~193k ops/sec	Sustained on edge hardware
Throughput (Per Node)	~15k ops/sec	Typical per-node throughput
Energy Efficiency (CPU)	~0.3 J/flop	Industry-leading efficiency class for CPU
Scale Verification	10,000+ nodes	Hyperscale patterns verified at production scale

Source: CPU benchmarks from repository's BENCHMARK_DATA.md. Raw data available in .

GPU Benchmarks (NVIDIA L4)

🎯 GPU MILESTONE ACHIEVED • FP16 + Kernel Fusion • November 2025

8.3B ops/sec on NVIDIA L4 (sm_89)

FP16 PTX kernels with batch optimization • 332M ops/J energy efficiency

Metric	Value	Notes
Throughput	8.3B ops/sec	FP16 precision, batched evaluator (PR #51)
Energy Efficiency	332M ops/J	75× improvement from previous 4.4M ops/J
vs SIMD Baseline	277× faster	8.3B ops/sec vs 30M ops/sec SIMD target
vs CPU Baseline	5,808× faster	8.3B GPU ops/sec vs 1.429M CPU ops/sec
Backend	CUDA FP16 (cudarc 0.17.7)	Vulkan backend available (80% CUDA perf, portable)

📊 Performance Comparison

30M

SIMD Baseline

ops/sec (target)

8.3B

GPU Throughput

ops/sec (L4 FP16)

277×

Speedup

vs SIMD baseline

Source: GPU benchmarks measured 2025-11-07. See repository for (if available).

🚀 AVX-512 Performance (November 10, 2025)

⚡ INTEL/AMD HIGH-PERFORMANCE SIMD • 25% Faster than AVX2

x86_64 Vector Extensions

Cross-platform SIMD with automatic detection and fallback

AVX-512 (Latest Intel/AMD)

Throughput: 2.83-3.40 Gelem/s
Improvement: 25% faster than AVX2
Platforms: Intel Xeon (Ice Lake+), AMD EPYC (Zen 4+)
Vector Width: 512-bit (8× f64)

AVX2 (General Purpose)

Throughput: 193k-30M ops/sec
Speedup: 13.7× vs scalar baseline
Platforms: Most Intel/AMD CPUs (2013+)
Vector Width: 256-bit (4× f64)

🔧 Auto-Detection & Fallback

Luxi automatically selects the best SIMD instruction set available on your hardware:

1️⃣ AVX-512

Newest Intel/AMD

2️⃣ AVX2

Most x86_64 systems

3️⃣ Scalar

Universal fallback

Source: AVX-512 benchmarks from November 10, 2025. Full data in .

🛰️ Lambert's Problem - Orbital Mechanics (November 10, 2025)

🚀 SPACE APPLICATIONS • Sub-Millisecond Multi-Revolution Trajectory Solving

Root-Finding for Satellite Trajectory Optimization

Demonstrates Luxi's bisection capabilities for scientific computing applications

Benchmark	Performance	Use Case
Direct TOF Calculation	56.6 ns	17.7M evaluations/second
Single-Revolution Solve	420.9 µs	2,375 solves/second (tol=1e-6)
High-Precision Solve	496.3 µs	2,015 solves/second (tol=1e-9)
8-Revolution Swarm	16.3 µs	61,350 solve-sets/second ✅

16.3 µs

8-Rev Swarm

Sub-millisecond solving

61.3k

Solve-sets/sec

Batch optimization

0.003%

Error Rate

Within 0.2 km accuracy

🌌 Space Mission Applications

✅ Swarm trajectory optimization: Solve for multiple transfer options simultaneously
✅ Mission planning: Evaluate multi-revolution transfers for fuel efficiency
✅ Real-time navigation: Sub-ms performance enables closed-loop guidance
✅ Batch analysis: 8-revolution swarm solving in 16.3 µs for rapid what-if scenarios

Test Vector: r₁=6980 km, r₂=10520 km, µ=398600 km³/s² (Earth). Solves for semi-major axis where TOF=1800s. See for implementation details.

CPU Baseline (Reference)

Baseline CPU performance for comparison with GPU acceleration

Metric	Value	Notes
Throughput (100 elements)	1,492 elements/sec	Small batch performance
Throughput (4M elements)	1.429M ops/sec	Large batch performance
Latency (4M elements)	2.10 seconds	End-to-end processing time
Power Consumption	25W	L4 GPU idle baseline
Energy Efficiency	27.2K ops/Joule	CPU-only efficiency baseline
GPU vs CPU Speedup	5,808× faster	8.3B GPU ops/sec ÷ 1.429M CPU ops/sec (FP16 + batching)

🚀 Key Performance Highlights

✅ 5× faster than PyTorch for FMA polynomial fusion workloads
✅ 30-50% cost savings vs traditional compute at scale
✅ Zero vendor lock-in - Vulkan fallback ready for AMD/Intel GPUs
✅ FP16 precision for optimal GPU utilization (roadmap)
✅ 100% validated accuracy across all benchmarks

Source: CPU baseline measured on same L4 instance. Full methodology in .

🔧 Technology Stack

Core Runtime

Runtime: Rust 1.91.0 (memory-safe, zero-cost abstractions)
Web Framework: Warp 0.3.7 (async HTTP/JSON API)
Expression Engine: Rhai 1.18.0 (dynamic evaluation)

GPU Acceleration

CUDA: cudarc 0.17.7 (NVIDIA GPU, Production)
Vulkan: wgpu 0.19 (Roadmap - AMD/Intel support)
Target Hardware: NVIDIA T4/L4 GPUs (sm_89)

Architecture

API: REST HTTP/gRPC (JSON payloads)
Deployment: Containerized microservice
Footprint: Minimal (~50MB container)

Status

GPU Acceleration: Production Ready
CPU SIMD: Production Ready
Accuracy: 100% Validated

⚡ ARM Neon Energy Efficiency (November 10, 2025)

🏆 ENERGY EFFICIENCY CHAMPION • Raspberry Pi 5 Leads at 2.67B ops/J

ARM Neon SIMD for Edge/IoT Deployments

Battery-optimized computation for embedded systems and space applications

Platform	Power (W)	Theoretical Peak	Realistic (50% Util)	Use Case
Raspberry Pi 5	1.8W compute	2.67B ops/J	1.33B ops/J	IoT sensors, battery systems, edge AI
AWS Graviton3	3.5W compute	1.49B ops/J	743M ops/J	Cloud edge, serverless, green computing
Jetson Orin Nano	5.0W compute	800M ops/J	400M ops/J	Robotics, autonomous vehicles, drones
Apple M2	14.5W compute	483M ops/J	241M ops/J	Developer workstations, local ML

2.67B

ops/Joule

Pi 5 - Theoretical Peak

1.2-2.7B

ops/sec

ARM64 Neon Throughput

1.5-2×

SIMD Speedup

vs ARM Scalar Baseline

🌱 Why ARM Neon Leads in Energy Efficiency

✅ Optimized for mobile/embedded: ARM architecture designed for battery-powered devices
✅ Lower power draw: 1.8W-5W compute vs 15-30W for x86_64 platforms
✅ Space applications: Ideal for rad-hard satellite systems with limited power budgets
✅ IoT scalability: Deploy millions of sensors with minimal energy infrastructure

Methodology: Theoretical peaks calculated from SIMD width (2× f64), clock frequency, and measured power draw. 50% utilization represents realistic performance with cache misses and pipeline dependencies. See for details.

✅ Recently Completed (November 2025)

🎉 PR #50 & PR #51 SHIPPED • BENCHMARKED • PRODUCTION READY

Major GPU Performance Breakthrough

8.3B ops/sec, 332M ops/J — 114× throughput and 75× efficiency improvements

✅ FP16 PTX Kernels

Status: Implemented and merged (PR #51)

• Optional CUDA FP16 support via cudarc
• Half-precision for 2× throughput improvement
• Available behind feature flag
• Sin/cos and expression evaluation kernels

✅ Kernel Fusion & Batching

Status: Production (PR #51)

• Batched Rhai evaluator implemented
• 20% speedup for large batch sizes
• Automatic batching via batch_eval_optimized()
• Reduced memory bandwidth, lower power draw

✅ Vulkan Backend

Status: Available (PR #51)

• Initial Vulkan (wgpu-rs 0.19) merged
• Cross-vendor GPU backend (AMD, Intel)
• 80% CUDA performance, fully portable
• Available via feature flag

✅ Updated Benchmarks

Status: Documented (PR #50)

• NVIDIA L4 GPU results: 8.3B ops/sec
• Energy efficiency: 332M ops/J
• M1 Pro and ARM platform updates
• Cross-referenced in docs/benchmarks/

References: | | Contact: e@ewaller.com

🎯 Next Steps & Future Work

Continuing Energy Efficiency Focus

Next Milestone: Approaching 600M ops/J energy efficiency (1.8× improvement from current 332M ops/J)

→ Dynamic power tuning: Frequency scaling and race-to-idle optimizations
→ INT8 precision: Lower precision for specific workloads, further efficiency gains
→ RISC-V vector support: RVV extension for emerging platforms
→ Cross-node orchestration: Distributed computation for hyperscale deployments
→ Automated validation: Continuous benchmarking and regression testing

This roadmap maintains focus on energy-efficient computation while expanding platform support and ensuring production-grade reliability at hyperscale.

🖥️ Platform Support Matrix

Luxi™ supports a wide range of hardware platforms with varying levels of maturity. All platforms benefit from memory-safe Rust and deterministic execution guarantees.

Platform	Architecture	Status	Performance Notes
x86_64 (Intel/AMD)	x86_64 SIMD (AVX2/AVX-512)	Production	~193k ops/sec, 13.7× speedup, validated at scale
ARM64 (Cortex-A/M)	ARM NEON SIMD	Validated	Edge/embedded deployment (RPi 5, Jetson), 1.6ms/100K elements
NVIDIA GPU (CUDA)	sm_75+ (Turing+), FP16 PTX	Production	8.3B ops/sec (L4 FP16), 332M ops/J, cudarc 0.17.7
Vulkan Compute	Cross-vendor GPU (AMD, Intel, NVIDIA)	Production	80% CUDA performance, portable, wgpu 0.19 backend (PR #51)
RISC-V	RVV (vector extension)	Planned	Future support for RISC-V vector architectures

Production

Battle-tested at scale with validated benchmarks

Validated

Functional with benchmarks, undergoing scale testing

Planned

In development or on roadmap for future release

Note: All platforms share the same REST API and deterministic JSON interface. Contact e@ewaller.com for specific platform requirements or custom architectures.

🎯 Platform Selection Guide

Choose the right hardware platform based on your deployment requirements, power budget, and performance targets.

Factor	Raspberry Pi 5 / Jetson	AWS Graviton / Apple	Intel Xeon / AMD EPYC	NVIDIA L4/H100 GPU
Architecture	ARM Neon SIMD	ARM Neon SIMD	AVX2/AVX-512	CUDA GPU
Power Budget	<5W ⚡	3.5-15W	15-30W	16-50W
Throughput	1.2-2.7B ops/sec	1.2-2.7B ops/sec	193k-30M ops/sec	8.3B ops/sec
Energy Efficiency	2.67B ops/J 🏆	1.49B ops/J	324k ops/J	332M ops/J
Batch Size	<10k elements	<10k elements	<10k elements	>100k elements
Latency	<10ms	<10ms	<10ms	50ms+ acceptable
Best For	IoT, battery, space	Cloud edge, serverless	General purpose, data centers	Maximum throughput
Deployment	Edge, embedded, satellites	AWS, Azure, local	On-prem, cloud	RunPod, GCP, AWS

🔋 Choose ARM for Energy Efficiency

✅ Battery-powered IoT sensors and edge devices
✅ Satellite/space applications with limited power
✅ Massive deployment scale (millions of nodes)
✅ Real-time control loops (1 kHz capable)
✅ Green computing initiatives

🚀 Choose GPU for Maximum Throughput

✅ Large-scale batch analytics (>100k elements)
✅ Scientific simulations and Monte Carlo
✅ AI/ML inference preprocessing pipelines
✅ High-throughput data transformation
✅ Cloud-native serverless functions

💡 Quick Decision Tree

1. Power budget <5W? → Raspberry Pi 5 or Jetson
2. Need >70M ops/sec? → NVIDIA L4/H100 GPU
3. Cloud edge deployment? → AWS Graviton3 or Apple M2
4. General purpose data center? → Intel Xeon or AMD EPYC

❄️ Energy Efficiency: Race-to-Idle & Cooling Savings

What is "Race-to-Idle"?

Race-to-idle is a power management strategy where CPUs complete tasks as quickly as possible at peak performance, then immediately drop into deep low-power idle states (1-2W). The concept: faster processors finish work faster and spend more time in energy-efficient idle modes, often consuming less total energy than slower chips running longer at reduced speeds.

Modern CPUs idle at 1-4W (chip-level) but consume 40-80W under load. By minimizing active time through 13.7× faster execution, Luxi™ enables significant energy savings through extended idle periods.

Direct Compute Savings

✓ 10-30% CPU energy reduction per operation (validated benchmarks)
✓ 13.7× faster completion means more idle time at 1-2W vs 40-80W active
✓ Sub-1ms latency enables rapid C-state transitions (deep sleep modes)

Cooling Cost Multiplier

→ 38-40% of data center energy goes to cooling (industry average)
→ PUE multiplier effect: Every watt saved in compute = (PUE - 1) watts saved in cooling
→ Typical PUE 1.55: Save 1W in compute → save 0.55W in cooling = 1.55W total

📐 Total Savings Calculation

Example: 100-server deployment processing 1M operations/day

Baseline (slower tool):

• Runtime: 140 seconds/day per server @ 50W = 1.94 Wh/day
• Annual energy: 71 kWh/server × 100 servers = 7,100 kWh
• Annual cost (@ $0.12/kWh): $852

With Luxi™ (13.7× faster):

• Runtime: 10.2 seconds/day per server @ 50W × 0.7 (30% savings) = 0.099 Wh/day
• Annual energy: 36 kWh/server × 100 servers = 3,600 kWh
• Annual cost (@ $0.12/kWh): $432

💰 Direct compute savings: $420/year

Cooling savings (PUE 1.55):

• Additional cooling reduction: $420 × 0.55 = $231/year

🌟 Total annual savings: $651 (compute + cooling)

Note: These are conservative estimates. Actual savings depend on workload patterns, server efficiency, PUE, and electricity costs. Higher PUE environments (1.8-2.5) see proportionally greater cooling savings.

🌍 Environmental Impact

3,500 kWh

Energy saved annually (100-server example)

2.4 tons

CO₂ emissions avoided (@ 0.7 kg/kWh grid mix)

40%+

Total savings from cooling reduction (PUE effect)

Deployment Profile

Binary Characteristics

Binary Size: < 5 MB
Memory Footprint: Minimal
Startup Time: < 100ms

Architecture Support

✓ ARM (Raspberry Pi, Jetson Nano)
✓ x86_64 (Intel, AMD)
✓ RISC-V (emerging)
✓ Hardware-agnostic design

Validation Methodology

Workload Selection

Benchmarks focus on SIMD-friendly numeric operations: expression evaluation (AST-based with vectorizable operations) and root-finding (bisection methods with convergence to ε tolerance).

Measurement Domains

• Latency: End-to-end API response time (HTTP/gRPC)
• Throughput: Operations per second sustained load
• Energy: CPU power consumption per operation
• Scale: Multi-node deployment patterns

Controls

Baseline comparisons use TensorFlow Lite and ONNX Runtime for equivalent operations. Energy measurements validated with hardware power meters on edge devices.

Technical Details

What It Is

✓ SIMD-accelerated numeric microservice
✓ Expression evaluation engine
✓ Root-finding algorithms
✓ Memory-safe Rust implementation
✓ Production-ready HTTP/gRPC APIs

Scope Boundaries

✗ Not a general-purpose compute framework
✗ Not a GPU replacement (CPU-focused)
✗ Not a machine learning framework
✗ Optimized for specific numeric workloads

Licensing & Evaluation

Commercial use requires LicenseRef-Luxi-Business-1.0 license.
See for details.

Contact for Evaluation

Interested in benchmarking Luxi™ for your use case?

Contact: e@ewaller.com

Performance Benchmarks