⚡ 243.9B ops/sec H100 (43°C Peak) • 30.7B L4 • 🔋 670M/426M ops/J • Deterministic JSON Math

LuxiEdge – Deterministic
Math at Scale

LuxiEdge is a Rust-based JSON math engine for dense numeric vector expressions (y = f(x)).

On real hardware, we have measured:

  • 243.86 BILLION ops/sec sustained for 21 minutes (NVIDIA H100 SXM FP16 FMA test, 50M‑element vector)
  • 30.7 BILLION ops/sec sustained for 10 minutes (NVIDIA L4 FP16 FMA test, 50M‑element vector)
  • 8.2× linear scaling from L4 (edge) to H100 (hyperscale) validated
  • 670M ops/J (H100) and 426M ops/J (L4) energy efficiency confirmed via NVML

Unlike typical stacks where "faster" usually means "more power," LuxiEdge's specialization lets you go faster and cheaper to power on the right workloads.

Same JSON input → same output, even under heavy HTTP load. No hidden state. No Python.

  • Stateless /evaluate and /health endpoints over HTTP
  • SIMD-optimized Rust with GPU acceleration for dense math
  • Designed for safety-critical and energy-constrained environments

Stateless microservice built with memory-safe Rust • Patent Pending

Simple Explanation

Most systems force a trade-off:

  • Faster math ⇒ more hardware, more power, more cost
  • Less power ⇒ slower math and fewer features

LuxiEdge was built to break that trade-off.

On the same hardware, you can either:

  • Do more math in the same time for similar power, or
  • Keep the same math and use less power

This is especially important at the edge, in vehicles, and in space, where watts and dollars both matter.

Current Validation

Third-party load testing confirms LuxiEdge stability and determinism under production conditions

PFLB Test 6143: Production Load Stability

Independent third-party validation of deterministic execution under concurrent load

200
Concurrent Virtual Users
0%
Error Rate
29,600
Total Requests
50-55ms
p95 Latency (8-element vectors)

Test Configuration

  • Platform: PFLB (Performance Lab) — independent third-party load testing
  • Target: Azure Container Apps (public endpoint, 1 CPU / 2Gi RAM)
  • Duration: 5-minute steady plateau at 200 concurrent users
  • Workload Mix: Scalar, 2-element, 8-element, and complex polynomial /evaluate requests
  • Result: 100% success rate — same inputs produced identical outputs under concurrent load

What This Proves

Determinism
Bit-exact results under concurrent HTTP load (no race conditions)
Stability
Zero errors across 29,600 requests under sustained load
Production-Ready
Handles 200 simultaneous users without degradation
Latency
p95 under 55ms for 8-element vector evaluations

Validation Partner

PFLB (QArea) — enterprise load testing platform used by Fortune 500 companies

View Grafana Snapshot →
Test Date: November 20, 2025
Additional Validation In Progress: TestFort (functional, security, compatibility) — formal report expected Q1 2026
🚀

GPU Validation – Hyperscale Tier (NVIDIA H100 SXM)

HYPERSCALE

Dojo Proxy Validation: Linear scaling from L4 to H100 confirmed

Primary Test: 21-Minute Marathon Burn (FP16 FMA, 50M-element vector)

  • Duration: 1259s (21 minutes, 1.2M iterations)
  • Sustained Throughput: 243.86 Billion ops/sec
  • Total Ops: ~307 Trillion
  • Avg Power: 365.8 W (52% of 700W TDP)
  • Peak Temp: 43°C (Zero throttling)
  • Efficiency: 0.67 Billion ops/J (670M ops/J)
  • Scaling: 8.2× vs L4

Investment Grade: Zero throttling. Stable thermals throughout 21-minute endurance test.

📊 L4 → H100 Scaling Matrix

Metric L4 (Edge) H100 (Hyperscale) Factor
Throughput 30.7B/s 243.9B/s 8.2×
Power 72W 366W 5.1×
Efficiency 426M/J 670M/J Stable ✅
Peak Temp 43°C Cool ❄️

Linear Scaling Proven

8.2× throughput increase from L4 to H100 with stable efficiency.

🔋

Power Efficiency

52% of TDP utilized—room for burst capacity when needed.

🏆

Dojo-Ready

Validated for Tesla Dojo proxy workloads at hyperscale.

GPU Validation – Edge Tier (NVIDIA L4)

EDGE

10-Minute Sustained FMA Test (FP16, 50M-element vector)

Primary test workload (simple, reproducible): in‑place fused multiply‑add

x = 1.5 * x + 2.0, FP16, 50M‑element vector

  • Measured time: 600.0013 s (10 minutes continuous)
  • Total element evaluations: 1.843×10¹³
  • Average throughput: 3.071×10¹⁰ ops/sec (≈ 30.7 BILLION ops/sec)
  • Average GPU power (NVML): ≈72.0 W
  • Energy efficiency: 4.264×10⁸ ops/J (≈ 426 MILLION operations per joule)

Confirms that LuxiEdge delivers tens of billions of operations per second and hundreds of millions of ops per joule on a single NVIDIA L4 over a realistic 10‑minute window.

This is a straightforward benchmark that can be reproduced with a simple FMA loop and NVIDIA's NVML power readings.


Additional Internal Stress Test – `sin(x) * cos(x)`, 18‑Minute RunPod L4 Run

Test workload: `sin(x) * cos(x)`, FP16 precision, 4M‑element batches (more complex transcendental kernel)

  • Average throughput: 58.77B ops/sec (≈815M ops/J)
  • Peak throughput: ≈61B ops/sec
  • Duration: 608 seconds (1M+ iterations)
  • Observed behavior: zero degradation in throughput over the full run

This stress test demonstrates the upper bound on a more complex kernel. The primary public reference remains the 10‑minute FMA result above.

🏆

'Gold Master' L4 Validation Complete

POWER-OPTIMIZED

November 24, 2025 | NVIDIA L4 | RunPod | Investment-Grade Verified

Two Validated Configurations: LuxiEdge offers both throughput-optimized (30.7B ops/sec @ ~72W) and power-optimized (29.67B ops/sec @ 33.64W) deployment profiles. Choose based on your workload: maximum throughput or maximum energy efficiency.

The 'Gold Master' power-optimized configuration achieves 53% power savings and 2.1× energy efficiency improvement over initial targets, while maintaining 29.67B ops/sec throughput. This validates our thesis: deterministic, high-precision math can be delivered with massive energy efficiency suitable for hyperscale environments.

Metric Target Gold Master Actual Improvement
Throughput 8.3B Ops/Sec 29.67 Billion Ops/Sec 3.5× 🚀
Power Usage < 72 W (TDP) 33.64 Watts 53% Savings ⚡
Efficiency 0.4 B Ops/J 0.88 Billion Ops/Joule 2.1× 🔋
Thermals < 70°C < 37°C (Sustained) Cool-Running ❄️
Stability 99.9% 100% (0 Throttles/Errors) Perfect ✅
💡

"Green Compute" Validated

53% power savings translates to millions in OpEx savings for hyperscale providers.

🔒

Proven Reliability

20-minute endurance test with zero thermal throttling, zero errors.

📈

Ready for H100/Blackwell

Architecture primed to scale linearly on next-gen GPU clusters.

Want to run your own validation?

LuxiEdge is live on Azure Container Apps for independent testing.

Request Access for Testing →

Production-Validated Stability

200
Concurrent Users
0%
Error Rate
1M+
GPU Iterations

PFLB Test 6143 (HTTP Load): 29,600 requests across 200 concurrent virtual users with zero errors. Production-ready architecture with deterministic latency guarantees under real-world conditions.

GPU Validation (10-Minute L4 FMA Test): Sustained 30.7B ops/sec for 600 seconds at ~72W with 426M ops/J energy efficiency. Validates production deployment reliability across realistic workloads.

Core Capabilities

🔒

Deterministic Execution

Same input → same output, even under concurrent HTTP load. Stateless by design: no hidden caches or cross-request state. Implemented in memory-safe Rust; hot paths avoid unsafe.

High-Throughput Vector Math

Specialized for explicit numeric expressions of the form y = f(x) over large vectors. SIMD-optimized CPU backend (AVX-512/AVX2/ARM Neon) with runtime selection. GPU kernels for FP16/FP32 dense math on NVIDIA T4/L4-class GPUs.

🔋

Energy-Conscious Design

Focus on operations per joule, not just peak FLOPs. Supports both GPU-accelerated servers and low-power edge ARM64 devices. Designed so that, on the right workloads, you can get more math per watt instead of trading speed for power.

Transformative Results (Internal Benchmarks)

On representative dense-math workloads we have measured:

30.7B–30.8B

≈426–428M ops/J sustained (10-min + 1-hour FMA on L4 FP16)

58.77B avg

sin(x)*cos(x), 4M-batches, 815M ops/J (internal stress test)

1,023×

Speedup vs 30M ops/sec SIMD baseline

3–7× original

8.3B planning target now exceeded by measured runs

426M ops/J (GPU)

NVIDIA L4 FP16 FMA @ ~72W, ~400M ARM Neon edge

No degradation

Sustained across 10‑min, 1‑hour, and stress test runs

Primary Benchmark (FP16 FMA): 10‑minute and 1‑hour sustained runs on NVIDIA L4 confirm ~30.7–30.8B ops/sec and ~4.3×10⁸ ops/J. Secondary Stress Test: Transcendental workload (sin(x)*cos(x), 4M‑element batches) achieves 58.77B avg / 815M ops/J. These are reproducible, purpose‑built kernels demonstrating what LuxiEdge achieves on real hardware.

The Practical Takeaway

For the right math workloads, LuxiEdge can deliver both much higher throughput and better energy efficiency than typical Python/ML stacks – especially at the edge.

Why This Matters for Edge and Embedded

The Constraints

If you run control loops, signal processing, or analytics at the edge, you are usually constrained by:

  • Power budget (battery, solar, thermal)
  • Hardware budget (cheap CPUs, limited GPUs)
  • Safety and predictability

The LuxiEdge Solution

LuxiEdge lets you:

  • Run more math per watt on the same device
  • Or keep the same behavior with less power and cheaper hardware
  • While staying deterministic and memory-safe (Rust)

What This Translates To:

  • Longer battery life
  • Fewer or smaller nodes for the same workload
  • Headroom to add new features without blowing the power budget

REST API – Simple, Deterministic JSON

/evaluate – Vectorized Expression Evaluation

Stateless JSON request:

POST /evaluate
{
  "expr": "2*x + cos(x)",
  "x": [1.0, 2.0, 3.0]
}

Response:

{
  "y": [2.5403, 4.5839, 6.0100]
}
  • expr is a restricted mathematical expression (addition, multiplication, powers, standard trig, etc.)
  • x is a scalar or vector of input values
  • • Same request always yields the same response, regardless of concurrency

/health

Service Health Check

  • Lightweight health probe for liveness/readiness checks
  • Used as part of the Azure/PFLB validation runs

Other endpoints (/bisect, /bisect_auto) exist for controlled root-finding, but /evaluate is the primary production path.

What LuxiEdge Is NOT (And Why That Matters)

Clear boundaries prevent feature creep and set accurate expectations. Specialization is why these performance/energy numbers are possible.

❌ Not a Symbolic Math Engine

Use SymPy instead if you need: Symbolic differentiation, equation solving, simplification

Why it matters: Symbolic preprocessing adds latency variance. LuxiEdge provides deterministic execution. Symbolic operations would break this guarantee.

Example:

• SymPy: d/dx(x² + sin(x)) → Symbolic derivative

• LuxiEdge: Evaluate 2x + cos(x) at GPU speed

❌ Not a General ML Framework

Use JAX/TensorFlow instead if you need: Automatic differentiation, graph compilation, training loops, research workflows

Why it matters: General frameworks add graph compilation overhead and Python layers. LuxiEdge is purpose-built for explicit expression evaluation—no graph, no compilation, just fast deterministic math.

Example:

• JAX: jax.grad(lambda x: x**2 + sin(x)) → Auto-diff

• LuxiEdge: Evaluate 2*x + cos(x) at 58.77B ops/sec GPU

❌ Not Python-Native

LuxiEdge is a stateless HTTP/gRPC microservice, not a Python library.

Why it matters: Embedding Python would eliminate the performance gains. We intentionally avoid Python overhead.

# NOT: pip install luxiedge

# YES: HTTP POST to /evaluate endpoint

curl -X POST http://localhost:8080/evaluate \

-d '{"expr": "sin(x)*cos(x)", "x": [1.0, 2.0]}'

❌ Not for Implicit Functions

LuxiEdge evaluates explicit expressions: y = f(x)

Why it matters: Implicit solving requires iterative methods with variable iteration counts = variable latency. This breaks deterministic execution guarantee.

Example:

• ❌ LuxiEdge can't solve: x² + y² = 1 (implicit)

• ✅ LuxiEdge can evaluate: y = sqrt(1 - x²) (explicit)

❌ Not a NumPy Replacement

Use NumPy for: General array operations, linear algebra, broadcasting

LuxiEdge complements NumPy for: Mathematical expression evaluation at GPU speed

import numpy as np

x = np.linspace(0, 2*np.pi, 1000) # NumPy

# LuxiEdge: Evaluate at GPU speed

response = requests.post('http://...', ...)

❌ Not for Custom SIMD Code

Use xsimd/Highway if you need: Custom SIMD intrinsics, hand-optimized kernels

LuxiEdge provides: Pre-optimized cross-platform SIMD (AVX-512/AVX2/ARM Neon) with automatic runtime selection. No custom coding needed.

✅ What LuxiEdge ACTUALLY IS

The world's fastest deterministic JSON math engine for dense numeric vector expressions (y = f(x))

  • Evaluates explicit expressions (y = f(x)) at ≈30.7B ops/sec sustained (10-min) on NVIDIA L4 GPU, with peak stress-test performance of 58.77B ops/sec
  • Guarantees deterministic execution (same input → same output, always)
  • Provides memory-safe Rust (zero unsafe code in hot paths)
  • Scales from edge to data center (ARM64 Neon to NVIDIA L4)
  • Integrates via REST/gRPC API (stateless microservice)
  • Delivers hundreds of millions of operations per joule

How LuxiEdge Compares

Understanding where LuxiEdge excels vs. general frameworks

vs. JAX (Google)

JAX excels at ML research with automatic differentiation and JIT compilation. LuxiEdge delivers production-grade deterministic execution for safety-critical systems where bit-exact reproducibility matters more than research flexibility.

LuxiEdge Advantage:

  • • Deterministic execution (no JIT variance)
  • • Memory-safe Rust (no Python overhead)
  • • 30.7–30.8B ops/sec GPU performance sustained (10-min + 1-hour L4 runs)

vs. CuPy (NVIDIA)

CuPy provides GPU-accelerated NumPy operations with Python integration. LuxiEdge delivers ≈30.7B ops/sec sustained with deterministic execution guarantees and cross-platform SIMD optimization (x86/ARM/GPU) beyond just GPU.

LuxiEdge Advantage:

  • • Deterministic execution (CuPy has Python variance)
  • • Cross-platform SIMD (not GPU-only)
  • • No Python overhead (REST API integration)

vs. TensorFlow

TensorFlow is a general ML framework with graph compilation overhead. LuxiEdge is purpose-built mathematical computation with thousands of times speedup over interpreted evaluation and memory-safe Rust architecture.

LuxiEdge Advantage:

  • • Specialized domain (no framework overhead)
  • • Deterministic execution (no graph compilation variance)
  • • Minimal binary footprint vs 500MB+ frameworks

vs. NumExpr

NumExpr optimizes CPU-based NumPy expressions (0.95-4x speedup). LuxiEdge delivers ≈30.7B ops/sec sustained with deterministic execution and cross-platform SIMD for real-time control systems.

LuxiEdge Advantage:

  • • GPU acceleration (30.7–30.8B ops/sec sustained vs CPU-only)
  • • Deterministic execution
  • • ARM64 edge deployment (≈400M ops/J)

vs. SymPy

SymPy solves symbolic math problems (differentiation, simplification, solving). LuxiEdge evaluates explicit expressions at GPU speed with deterministic results. Complementary, not competitive.

LuxiEdge Advantage:

  • • 1000× faster for explicit evaluation
  • • Deterministic execution
  • • GPU acceleration

Enterprise Use Cases

⚙️

Industrial Control

Real-time expression evaluation for manufacturing automation, process control systems, and robotics motion planning with deterministic execution guarantees.

  • Sub-millisecond response times
  • Embedded ARM64 deployment
  • Safety-critical applications
🤖

AI/ML Pipelines

Accelerate pre/post-processing for machine learning workloads, data normalization, feature engineering, and batch transformations.

  • ≈30.7B ops/sec GPU throughput (10-min sustained NVIDIA L4 FP16)
  • Deterministic preprocessing
  • Seamless TensorFlow/PyTorch integration
🚗

Autonomous Systems

Power path planning, sensor fusion calculations, and real-time decision-making for autonomous vehicles, drones, and navigation systems.

  • Edge-optimized deployment
  • Minimal power consumption
  • Predictable latency profiles
🛰️

Orbital Mechanics & Space

Fast computation for satellite trajectory optimization and orbital mechanics. Efficient math for power-limited space applications.

  • Deterministic trajectory calculations
  • ARM Neon for power-limited satellites
  • Memory-safe for mission-critical systems

Enterprise & Strategic Partnerships

🤝 NDA Partner Program Available

Luxi™ is available for white-label licensing, strategic partnerships, and custom enterprise deployments. Our NDA Partner Program provides early access to roadmap features and dedicated integration support.

📋

White-Label Licensing

Deploy under your brand with custom SLAs and support agreements

🔒

Protected IP

Patent Pending • Commercial license (LicenseRef-Luxi-Business-1.0) with NDA coverage

🚀

Dedicated Support

Direct engineering access and custom integration guidance