How It Works Benchmarks H100 Math Competitors Simulator Whitepaper Download

Non-provisional patent filed January 31, 2026  ·  Issuance expected early July 2026

Energy-efficient.
Auditable.
Transformer compute.

A Rust-based transformer engine with CUDA, Metal, and WebGPU support. Optimizes joules per useful token through minimized HBM data movement, online-softmax causal attention, and full-layer geodesic fusion.

TRADE Default Production Path

Geodesic fused transformer layer on GPU: LN → packed QKV GEMM → attention + Wo → residual + LN₂ → MLP. Focuses on measured ms/layer and joules/token. Supports fp16 by default. Validated from laptop CPU to data-center GPU (RunPod/H100 tested).

AUDIT CPU Reference Path

CPU f32 reference execution providing cryptographic SHA-256 receipts over output bit patterns. Enables reproducibility verification and compliance auditing independent of GPU hardware. Suitable for quant finance and high-trust workloads requiring a verifiable audit trail.

Waller Operator

Online-softmax causal attention with O(N) HBM usage for attention scores — no materialized N×N matrix. Delivers measurable memory and energy reductions at long contexts while maintaining mathematical equivalence to standard softmax attention.

WNSM

Waller Null-Space Multiplexing carries auxiliary payloads (e.g., scaling proofs) in MLP null-space with provable zero impact (0.00e+0 difference) on primary outputs. Enables in-band verification data without altering model behavior.

Deterministic Receipts

Bit-exact reproducibility across runs and hardware via SHA-256 over f32 bit patterns. 50+ passing tests. CPU full proof kit. GPU gates re-validated post recent QKV stride fixes.

Determinism and kernel performance have been validated in prior third-party testing (December 2025 TestFort report on foundational numeric kernels). Current full-layer TRADE/AUDIT benchmarks and quant receipt integration are available in the private repository for NDA review.

API Example
// Configure and run a decoder layer with WNSM + audit receipt
let config = Config {
    d_model: 512,
    n_heads:  8,
    d_ff:     2048,
    lane:     Lane::Trade,   // or Lane::Audit for CPU f32 reference
};

let decoder = WNSM_GAE_Decoder::new(&config)?;

// Forward pass — returns output tensor + optional WNSM payload
let (output, payload) = decoder.forward(&input, mask)?;

// Generate SHA-256 receipt over output bit patterns
let receipt = decoder.receipt(&output)?;
println!("SHA-256: {}", receipt.hex());
Energy Modeling

Honest HBM data-movement accounting, not theoretical FLOP counts. Strong reductions versus naive score-matrix attention at longer sequences. Full-layer benchmarks versus PyTorch baselines are available in the repository. Measured in ms/layer and joules/token.

Verification & Testing

50+ passing tests covering correctness, determinism, and WNSM null-space impact. CPU full proof kit validates the entire layer against independent reference implementations. GPU gate suite re-validates after each architectural change.

Technical Characteristics
  • 01 O(N) HBM attention — no materialized score matrix; memory and energy scale linearly with sequence length
  • 02 Dual execution lanes — TRADE (GPU performance) and AUDIT (CPU f32 + SHA-256 receipts) in one codebase
  • 03 Ultra-low footprint — smaller than a typical phone photo; minimal attack surface
  • 04 Cross-platform — CPU reference baseline plus CUDA, Metal, and WebGPU GPU backends

Designed for workloads where long-context efficiency, reproducibility, and measured energy performance are essential: edge deployment, scientific computing, defense, and high-trust inference.

The engine evolves while preserving its core verification and energy-modeling discipline. Ongoing work targets further single-kernel geodesic optimizations, additional dtype support (BF16/FP8), and deeper layer fusion as outlined in GEODESIC_SWEEP_DESIGN.md and LUXIEDGE_BUILD_ROADMAP.md.

All technical claims are based on code and benchmarks in the attention-transformer-v2 repository. Repository is currently private; full source available under NDA.

Current Status

The TRADE GPU path and AUDIT CPU path are production-ready. Core deterministic kernels are fully functional. The Waller Operator and WNSM components have demonstrated correct results in testing with verified zero null-space impact. GPU gate suite re-validated following recent QKV stride fixes. Recent integration of luxi-quant online statistics supports enhanced receipt backtesting for quantitative finance workloads. Full production integration and additional dtype support remain in active development.

Ready for confidential technical discussion

Contact: e@ewaller.com

Eric Waller  ·  Proprietary technology  ·  Full source available under NDA