From 1,696ns to 3ns: Five Layers of VM Optimisation - Writing

The benchmark function tot computes p*q + p*q*r. It’s three multiplies and an add, implemented equivalently in ilo and a set of comparison languages. The first interpreter ran it in 1,696ns. The current JIT does it in 3ns. That’s a 565x speedup across five distinct layers, each one motivated by profiling the previous.

All five layers still exist as runtime options. The JIT compiles the register VM’s bytecode. Non-eligible functions fall back to the register VM. You can run any layer directly via CLI flags (--run-jit, --run-cranelift, etc.) or benchmark them all side by side with --bench.

Layer 1: Tree-walk interpreter (1,696ns - 16x slower than CPython)

The starting point. Parse the source into an AST, walk the tree recursively, evaluate each node.

fn eval_node(&mut self, expr: &Expr) -> Value {
    match expr {
        Expr::Call(name, args) => {
            let vals: Vec<Value> = args.iter()
                .map(|a| self.eval_node(a)).collect();
            self.call_function(name, vals)
        }
        // ...
    }
}

Every operation allocates: values are heap-allocated enums, and every function call builds a Vec of arguments. For tot(10, 20, 30), that’s 4 function calls (tot, multiply, multiply, add), each one cloning and matching.

1,696ns for three multiplies and an add.

Layer 2: Stack-based bytecode VM (~226ns, 7x faster)

Compile the AST to bytecode once, then execute a tight loop. The stack VM follows the classic model: push operands, pop for operations, push results.

PUSH p
PUSH q
MUL
PUSH p
PUSH q
MUL
PUSH r
MUL
ADD
RET

10 instructions for 3 operations. The stack pointer bounces on every instruction. But the dispatch loop is a simple match on u8 opcodes, and NaN boxing means every value is 8 bytes, Copy, no heap allocation for numbers.

NaN boxing is the foundation that makes everything after this possible. The stack becomes Vec<u64> and numbers are stored as raw double bits with no encoding overhead.

Layer 3: Register-based bytecode VM (171ns, 9.9x faster - 1.6x slower than CPython)

The register VM borrows LuaJIT’s 32-bit packed instruction format:

[OP:8 | A:8 | B:8 | C:8]   - three registers
[OP:8 | A:8 | Bx:16]       - register + 16-bit operand

The same function:

MUL  r2, r0, r1    -- r2 = p * q
MUL  r3, r2, r2    -- r3 = subtotal * rate
ADD  r4, r2, r3    -- r4 = subtotal + tax
RET  r4

4 instructions instead of 10, no stack pointer, operands stay in registers, and variable lookup is just a register index.

The compiler also tracks which registers hold numbers at compile time. When both operands are known numeric, it emits OP_ADD_NN instead of OP_ADD, skipping type checks entirely. For tot, every operation is numeric, so the entire execution is type-check-free.

Superinstructions fuse common pairs. LOADK + MUL becomes MULK_N, one dispatch instead of two.

Layer 4: VmState reuse (83ns, 20x faster - 1.3x faster than CPython)

A simple but effective change. Instead of allocating a fresh stack for every call, reuse a VmState struct across invocations:

Standard:  allocate stack -> run -> deallocate  (every call)
VmState:   allocate once -> run -> run -> run   (amortised)

The gap between 171ns and 83ns is almost entirely Vec::with_capacity. In a tight benchmark loop, allocation overhead is a significant fraction of total time.

Layer 5: JIT compilation (3ns, 565x faster - 36x faster than CPython)

The register VM still pays for dispatch: decode the u32 instruction, match on the opcode, read register indices, execute. For tot, that’s 4 dispatches for 4 instructions. The JIT eliminates dispatch entirely by emitting native ARM64 instructions.

ilo’s VM registers R0-R30 map 1:1 to ARM64 floating-point registers d0-d30. Function arguments arrive in d0-d7 per AAPCS64, which is exactly how the VM lays out parameters. So the “JIT” for numeric functions is mechanical: walk the bytecode, emit the corresponding ARM64 instruction.

ilo source:    tot p:n q:n r:n>n;s=*p q;t=*s r;+s t
VM bytecode:   MUL_NN R3,R0,R1 | MUL_NN R4,R3,R2 | ADD_NN R5,R3,R4 | RET R5
ARM64 native:  fmul d3,d0,d1   | fmul d4,d3,d2   | fadd d0,d3,d4   | ret

4 native instructions. The compiled buffer is mmapped executable and called as a function pointer.

Three backends exist: hand-rolled ARM64 (zero dependencies, aarch64 only), Cranelift (cross-platform, behind a feature flag), and LLVM via inkwell (heaviest optimiser, requires LLVM 18). For this function, ARM64 and Cranelift produce identical performance. They both emit the same 4 floating-point instructions.

Where it sits

All numbers on Apple MacBook Pro M3 Max, release build, 10k iterations after warmup. Each language implements the same tot function. The benchmark scripts and implementations are in the research/explorations/ folder.

ilo backends               Per call     vs Interpreter
----------------------------------------------------------
Rust interpreter           1,696 ns          1.0x
Register VM                  171 ns          9.9x
ilo → Python (transpiled)    107 ns         15.9x
Register VM (reused)          83 ns         20.4x
Custom JIT (arm64)             3 ns        565.3x  ★
Cranelift JIT                  2 ns        848.0x  ★

External - interpreted     Per call     vs Interpreter
----------------------------------------------------------
CPython                      108 ns         15.7x
Ruby                          51 ns         33.3x
PHP                           45 ns         37.7x
Lua                           36 ns         47.1x

External - JIT             Per call     vs Interpreter
----------------------------------------------------------
PyPy3                        149 ns         11.4x
V8 / Node.js                 19 ns         89.3x
LuaJIT                         2 ns        848.0x

External - AOT             Per call     vs Interpreter
----------------------------------------------------------
Go                             1 ns      1,696.0x
Rust (rustc -O)              0.3 ns      5,653.3x
C (cc -O2)                   0.3 ns      5,653.3x

The ARM64 JIT is 6.3x faster than V8 and within 1ns of LuaJIT on this benchmark. The gap to LuaJIT is likely calling convention overhead: LuaJIT’s trace compiler avoids the extern "C" function pointer call entirely.

The caveat

This is a three-multiply-and-add benchmark on a tiny instruction set. The JIT only handles pure-numeric functions: no strings, no lists, no control flow, no function calls. Non-eligible functions fall back to the register VM.

As ilo’s vocabulary and instruction set expand, these numbers will change. Branching, loops, and function calls each bring new costs. The 565x number is real but narrow. The register VM at 83ns is the more representative baseline for general code today.

The progression matters more than any single number. Each layer was motivated by profiling the one before it, and each one addressed a specific bottleneck: heap allocation, instruction count, dispatch overhead, allocation amortisation, dispatch elimination.