Stealing From LuaJIT and V8: What Works in Rust and What Doesn't - Writing

Building a fast bytecode VM means reading LuaJIT source code and V8 blog posts. I tried to bring their ideas into a Rust VM. Some transplanted cleanly. Others fought the language at every step.

What I stole

NaN boxing (from LuaJIT)

LuaJIT packs every value into a 64-bit NaN. Numbers are stored as-is (just the raw double bits), everything else gets a type tag in the NaN payload bits and a pointer in the low 48 bits.

In Rust:

#[derive(Clone, Copy)]
pub(crate) struct NanVal(u64);

const QNAN: u64       = 0x7FFC_0000_0000_0000;
const TAG_STRING: u64  = 0x7FFD_0000_0000_0000;
const TAG_LIST: u64    = 0x7FFE_0000_0000_0000;
const PTR_MASK: u64    = 0x0000_FFFF_FFFF_FFFF;

It works, but everything is unsafe. The Copy derive is the whole point. Your stack becomes Vec<u64>, every value is 8 bytes, numbers are zero-cost. But you’re manually managing Rc reference counts through raw pointers.

Register-based bytecode (from LuaJIT)

LuaJIT uses 32-bit packed instructions with register operands instead of a stack. I used the same encoding:

[OP:8 | A:8 | B:8 | C:8]   - three registers
[OP:8 | A:8 | Bx:16]       - register + 16-bit operand

Translates directly. This is just data layout, no language friction at all. The register VM reduced tot(p, q, r) from 12 stack operations to 4 register instructions.

Superinstructions (from LuaJIT)

Fuse common instruction pairs into single opcodes. LOADK + ADD becomes ADDK_N, one dispatch instead of two.

Works, but Rust’s match makes it verbose. In C, you’d use computed gotos or a jump table. In Rust, each superinstruction is another arm in a giant match block. The compiler probably turns it into a jump table anyway, but you can’t verify that without checking the assembly.

Unchecked indexing in the dispatch loop (inspired by both)

Both LuaJIT and V8 avoid bounds checks in their innermost loops. In Rust:

macro_rules! reg {
    ($idx:expr) => { unsafe { *self.stack.get_unchecked($idx) } }
}

Necessary for performance, but dangerous. Bounds checks in the inner loop are measurably slow. Removing them gives you C-like speed but C-like risks. A bad register index will segfault instead of panicking.

What didn’t translate

Computed gotos

LuaJIT’s interpreter uses GCC’s computed goto extension. Each opcode handler ends with a direct jump to the next handler, avoiding the overhead of a central dispatch switch. It cuts dispatch overhead substantially in LuaJIT’s interpreter.

Rust doesn’t have computed gotos. You get a match statement. The compiler might lower it to a jump table (and usually does for dense opcode ranges), but you can’t force it. And you can’t do the “threaded dispatch” trick where each handler dispatches the next one directly.

Not expressible in safe Rust, and unsafe Rust can’t express it cleanly either. You’d need inline assembly or a C shim, at which point you’ve left Rust.

Tagged pointer tricks

V8 uses the low bit of pointers to distinguish small integers (Smis) from heap objects. Since heap objects are always aligned, the low bit is always 0 for pointers. V8 sets it to 0 for Smis and uses the other 31 bits for the integer value.

In Rust, raw pointer manipulation like this collides with Rc and Box which expect properly aligned, typed pointers. You can do it with raw *const but you lose all of Rust’s memory management.

Possible but you end up rewriting the allocator. NaN boxing turned out to be a better fit because it leverages the FPU hardware directly rather than fighting the pointer model.

Inline caching

V8’s hidden classes and inline caches are how V8 makes property access fast. The first time you access obj.name, V8 patches the call site to go directly to the right memory offset next time.

This requires self-modifying code or code patching, which is architecturally hostile to Rust. You’d need mutable access to the bytecode while executing it, or a separate IC table with unsafe pointer magic.

I didn’t attempt this. ilo doesn’t have objects/classes in the V8 sense, so inline caching isn’t relevant yet. But if I ever need it, it’ll be a fight.

Tracing JIT

LuaJIT’s trace compiler records hot paths through the interpreter, optimises them, and compiles to native code. V8’s TurboFan does something similar but at the function level.

Building a tracing JIT in Rust is theoretically possible but the complexity is enormous. You need to record traces (easy), optimise them (hard), emit native code (done that, see my ARM64 JIT post), and handle deoptimisation back to the interpreter when guards fail (this is the really hard part).

Out of scope. My hand-rolled ARM64 JIT handles numeric-only functions. A full tracing JIT would be a multi-year project.

The benchmarks

I wrote equivalent benchmarks in LuaJIT, V8, and ilo to see where things stand. Same function: tot(p, q, r) = p*q + p*q*r.

LuaJIT benchmark:

local function tot(p, q, r)
    local s = p * q
    local t = s * r
    return s + t
end

-- warmup with varying inputs to trigger JIT
for i = 1, 1000 do
    tot(i, i+1, i+2)
end

-- benchmark with fixed inputs
local start = os.clock()
for i = 1, 10000 do
    r = tot(10, 20, 30)
end
local elapsed_ns = (os.clock() - start) * 1e9

V8 benchmark:

function tot(p, q, r) {
    const s = p * q;
    const t = s * r;
    return s + t;
}

// warmup with varying inputs to trigger JIT
for (let i = 0; i < 1000; i++) {
    tot(i, i+1, i+2);
}

const start = process.hrtime.bigint();
for (let i = 0; i < 10000; i++) {
    r = tot(10, 20, 30);
}

LuaJIT and V8 both JIT-compile this to native code after warmup. My register VM with superinstructions is interpreted. The ARM64 JIT compiles to native but without the sophisticated optimisations those runtimes have.

The gap is real. LuaJIT and V8 are significantly faster after warmup because their JITs produce optimised machine code for the entire function. But the ilo VM holds up respectably for an interpreter, and the ARM64 JIT gets close on pure arithmetic.

You can borrow LuaJIT and V8’s ideas (NaN boxing, register bytecode, superinstructions) and get real wins. Their deepest tricks (computed gotos, tracing JIT, inline caches) are either not expressible or require dropping out of Rust into inline assembly or a C shim.

It’s a tradeoff. Rust gives you memory safety guarantees that C doesn’t. The price is that certain low-level performance tricks require unsafe or simply aren’t expressible. For a hobby VM, the tradeoff is fine. I’ll take the safety and accept the overhead. For a production runtime competing with V8? You’d need a very different approach.