The benchmark function tot computes p*q + p*q*r. It’s three multiplies and an add, implemented equivalently in ilo and a set of comparison languages. The first interpreter ran it in 1,696ns. The current JIT does it in 3ns. That’s a 565x speedup across five distinct layers, each one motivated by profiling the previous.
All five layers still exist as runtime options. The JIT compiles the register VM’s bytecode. Non-eligible functions fall back to the register VM. You can run any layer directly via CLI flags (--run-jit, --run-cranelift, etc.) or benchmark them all side by side with --bench.
Layer 1: Tree-walk interpreter (1,696ns - 16x slower than CPython)
The starting point. Parse the source into an AST, walk the tree recursively, evaluate each node.
fn eval_node(&mut self, expr: &Expr) -> Value {
match expr {
Expr::Call(name, args) => {
let vals: Vec<Value> = args.iter()
.map(|a| self.eval_node(a)).collect();
self.call_function(name, vals)
}
// ...
}
}
Every operation allocates: values are heap-allocated enums, and every function call builds a Vec of arguments. For tot(10, 20, 30), that’s 4 function calls (tot, multiply, multiply, add), each one cloning and matching.
1,696ns for three multiplies and an add.
Layer 2: Stack-based bytecode VM (~226ns, 7x faster)
Compile the AST to bytecode once, then execute a tight loop. The stack VM follows the classic model: push operands, pop for operations, push results.
PUSH p
PUSH q
MUL
PUSH p
PUSH q
MUL
PUSH r
MUL
ADD
RET
10 instructions for 3 operations. The stack pointer bounces on every instruction. But the dispatch loop is a simple match on u8 opcodes, and NaN boxing means every value is 8 bytes, Copy, no heap allocation for numbers.
NaN boxing is the foundation that makes everything after this possible. The stack becomes Vec<u64> and numbers are stored as raw double bits with no encoding overhead.
Layer 3: Register-based bytecode VM (171ns, 9.9x faster - 1.6x slower than CPython)
The register VM borrows LuaJIT’s 32-bit packed instruction format:
[OP:8 | A:8 | B:8 | C:8] - three registers
[OP:8 | A:8 | Bx:16] - register + 16-bit operand
The same function:
MUL r2, r0, r1 -- r2 = p * q
MUL r3, r2, r2 -- r3 = subtotal * rate
ADD r4, r2, r3 -- r4 = subtotal + tax
RET r4
4 instructions instead of 10, no stack pointer, operands stay in registers, and variable lookup is just a register index.
The compiler also tracks which registers hold numbers at compile time. When both operands are known numeric, it emits OP_ADD_NN instead of OP_ADD, skipping type checks entirely. For tot, every operation is numeric, so the entire execution is type-check-free.
Superinstructions fuse common pairs. LOADK + MUL becomes MULK_N, one dispatch instead of two.
Layer 4: VmState reuse (83ns, 20x faster - 1.3x faster than CPython)
A simple but effective change. Instead of allocating a fresh stack for every call, reuse a VmState struct across invocations:
Standard: allocate stack -> run -> deallocate (every call)
VmState: allocate once -> run -> run -> run (amortised)
The gap between 171ns and 83ns is almost entirely Vec::with_capacity. In a tight benchmark loop, allocation overhead is a significant fraction of total time.
Layer 5: JIT compilation (3ns, 565x faster - 36x faster than CPython)
The register VM still pays for dispatch: decode the u32 instruction, match on the opcode, read register indices, execute. For tot, that’s 4 dispatches for 4 instructions. The JIT eliminates dispatch entirely by emitting native ARM64 instructions.
ilo’s VM registers R0-R30 map 1:1 to ARM64 floating-point registers d0-d30. Function arguments arrive in d0-d7 per AAPCS64, which is exactly how the VM lays out parameters. So the “JIT” for numeric functions is mechanical: walk the bytecode, emit the corresponding ARM64 instruction.
ilo source: tot p:n q:n r:n>n;s=*p q;t=*s r;+s t
VM bytecode: MUL_NN R3,R0,R1 | MUL_NN R4,R3,R2 | ADD_NN R5,R3,R4 | RET R5
ARM64 native: fmul d3,d0,d1 | fmul d4,d3,d2 | fadd d0,d3,d4 | ret
4 native instructions. The compiled buffer is mmapped executable and called as a function pointer.
Three backends exist: hand-rolled ARM64 (zero dependencies, aarch64 only), Cranelift (cross-platform, behind a feature flag), and LLVM via inkwell (heaviest optimiser, requires LLVM 18). For this function, ARM64 and Cranelift produce identical performance. They both emit the same 4 floating-point instructions.
Where it sits
All numbers on Apple MacBook Pro M3 Max, release build, 10k iterations after warmup. Each language implements the same tot function. The benchmark scripts and implementations are in the research/explorations/ folder.
ilo backends Per call vs Interpreter
----------------------------------------------------------
Rust interpreter 1,696 ns 1.0x
Register VM 171 ns 9.9x
ilo → Python (transpiled) 107 ns 15.9x
Register VM (reused) 83 ns 20.4x
Custom JIT (arm64) 3 ns 565.3x ★
Cranelift JIT 2 ns 848.0x ★
External - interpreted Per call vs Interpreter
----------------------------------------------------------
CPython 108 ns 15.7x
Ruby 51 ns 33.3x
PHP 45 ns 37.7x
Lua 36 ns 47.1x
External - JIT Per call vs Interpreter
----------------------------------------------------------
PyPy3 149 ns 11.4x
V8 / Node.js 19 ns 89.3x
LuaJIT 2 ns 848.0x
External - AOT Per call vs Interpreter
----------------------------------------------------------
Go 1 ns 1,696.0x
Rust (rustc -O) 0.3 ns 5,653.3x
C (cc -O2) 0.3 ns 5,653.3x
The ARM64 JIT is 6.3x faster than V8 and within 1ns of LuaJIT on this benchmark. The gap to LuaJIT is likely calling convention overhead: LuaJIT’s trace compiler avoids the extern "C" function pointer call entirely.
The caveat
This is a three-multiply-and-add benchmark on a tiny instruction set. The JIT only handles pure-numeric functions: no strings, no lists, no control flow, no function calls. Non-eligible functions fall back to the register VM.
As ilo’s vocabulary and instruction set expand, these numbers will change. Branching, loops, and function calls each bring new costs. The 565x number is real but narrow. The register VM at 83ns is the more representative baseline for general code today.