From Stack VM to Register VM: A 31% Speedup by Changing the Model - Writing

The first version of ilo’s bytecode VM was stack-based. Push, pop, push, pop. It’s the classic approach. CPython does it, the JVM does it, Lua 4 did it. It’s simple to implement and simple to compile to.

Stack VMs: the comfortable default

A stack VM for tot(p, q, r) = p*q + p*q*r looks like this:

PUSH p
PUSH q
MUL
PUSH p
PUSH q
MUL
PUSH r
MUL
ADD
RET

That’s 10 instructions, lots of memory traffic. Every operation pops operands off the stack and pushes the result back. The stack pointer bounces up and down constantly.

Register VMs: fewer instructions, less overhead

LuaJIT uses a register-based design with 32-bit packed instructions. I borrowed the format:

[OP:8 | A:8 | B:8 | C:8]   - three registers
[OP:8 | A:8 | Bx:16]       - register + 16-bit operand

The same function in register form:

MUL  r2, r0, r1    -- r2 = p * q
MUL  r3, r2, r2    -- r3 = r2 * r  (actually r3 = subtotal * rate)
ADD  r4, r2, r3    -- r4 = r2 + r3
RET  r4

4 instructions instead of 10, with operands sitting in registers between operations rather than being pushed and popped through a stack pointer.

The variable Expr::Ref (looking up a local variable) becomes free in the register model. It just returns the register slot. In the stack VM, every reference needed a PUSH instruction.

The numbers

Before (stack-based VM):

tot(10, 20, 30):  ~156ns/call  (standard)
                   ~95ns/call  (VmState, reused allocations)

After (register-based VM):

tot(10, 20, 30):  ~125ns/call  (standard)
                   ~66ns/call  (VmState, reused allocations)

That’s a 31% improvement on the standard path and 30% on the optimised path. from executing fewer instructions and moving data less.

The compilation is trickier

The downside of register VMs is that the compiler is more complex. Instead of just pushing and popping, you need to allocate registers:

struct RegCompiler {
    chunks: Vec<Chunk>,
    func_names: Vec<String>,
    current: Chunk,
    locals: Vec<(String, u8)>,
    next_reg: u8,
    max_reg: u8,
    reg_is_num: [bool; 256],  // track which registers are known numeric
}

That reg_is_num array is key. It tracks which registers are known to hold numbers at compile time. When both operands of an ADD are known numeric, the compiler emits OP_ADD_NN instead of OP_ADD, which skips runtime type checks entirely.

Register allocation is currently a simple bump allocator. Grab the next slot. There’s no register pressure management, no spilling. With 256 available registers and most functions using fewer than 20, this hasn’t been a problem yet.

VmState: the other half of the win

A big chunk of the performance gain came from VmState, a struct that holds the stack and call frames across invocations:

Standard VM:   allocate stack → run → deallocate   (every call)
VmState:       allocate once → run → run → run     (amortised)

The gap between standard (125ns) and VmState (66ns) tells you how much allocation overhead matters in a tight loop. Almost half the total time was just Vec::with_capacity.

Type-specialized opcodes

Once the register compiler tracks numeric types, you get a cascade of optimisations:

// Generic ADD: check types, handle strings, handle numbers
OP_ADD => {
    let bv = reg!(b);
    let cv = reg!(c);
    if bv.is_number() && cv.is_number() {
        reg_set!(a, NanVal::number(bv.as_number() + cv.as_number()));
    } else if bv.is_string() && cv.is_string() {
        // string concatenation...
    }
}

// Specialized ADD_NN: both operands known numeric at compile time
OP_ADD_NN => {
    let bv = reg!(b);
    let cv = reg!(c);
    let result = NanVal::number(bv.as_number() + cv.as_number());
    // Skip drop_rc - numbers don't need it
    unsafe { *self.stack.as_mut_ptr().add(a) = result; }
}

The specialised path skips type checks and refcount management and goes straight to the arithmetic. For a function like tot where every operation is numeric, the entire execution is type-check-free.

Superinstructions

The final optimisation was fusing common instruction pairs. LOADK r1, const followed by ADD r0, r1, r2 is a pattern that shows up whenever you do arithmetic with a literal. Fusing them into a single ADDK_N r0, r1, const_idx eliminates one dispatch cycle and one register allocation:

// Before: 2 dispatches
LOADK r3, [pool: 30.0]
MUL_NN r4, r2, r3

// After: 1 dispatch
MULK_N r4, r2, [pool: 30.0]

This is straight from the LuaJIT playbook. Mike Pall’s insight was that reducing dispatch count matters more than reducing instruction complexity, because the dispatch itself (the match/switch, the jump table lookup) is the bottleneck.

The stack-to-register rewrite touched the compiler, the VM loop, the constant pool, the call convention. and showed up in the benchmark numbers above.

Don’t optimise the dispatch loop until you’ve minimised the number of dispatches. Making individual stack operations faster matters less than doing fewer of them.