NaN Boxing in Rust: Cramming Every Value Into 8 Bytes - Writing

LuaJIT packs every value into a 64-bit NaN. The technique is called NaN boxing. I implemented it in Rust for ilo’s VM.

The problem

A typical dynamic language VM needs to represent multiple types: numbers, strings, booleans, lists, nil, etc. The naive approach is an enum:

pub enum Value {
    Number(f64),
    Text(String),
    Bool(bool),
    Nil,
    List(Vec<Value>),
    Record { type_name: String, fields: HashMap<String, Value> },
    Ok(Box<Value>),
    Err(Box<Value>),
}

This is fine for an interpreter, and it’s what ilo’s tree-walker uses. But for a bytecode VM where the stack is the hot path, it’s expensive. Each Value is at least 24 bytes (enum discriminant + largest variant). Cloning requires matching on every variant. The stack becomes Vec<Value>, scattered allocations, poor cache behaviour.

The trick

IEEE 754 doubles use a specific bit pattern for NaN (Not a Number). But there are trillions of valid NaN bit patterns. The standard only requires certain bits to be set. The rest are “don’t care” bits.

NaN boxing exploits this: if the value is a regular number, store it as-is. If it’s anything else, encode a type tag and a pointer into the NaN payload bits.

const QNAN: u64       = 0x7FFC_0000_0000_0000;
const TAG_NIL: u64    = QNAN;
const TAG_TRUE: u64   = QNAN | 1;
const TAG_FALSE: u64  = QNAN | 2;
const TAG_STRING: u64 = 0x7FFD_0000_0000_0000;
const TAG_LIST: u64   = 0x7FFE_0000_0000_0000;
const TAG_RECORD: u64 = 0x7FFF_0000_0000_0000;
const TAG_OK: u64     = 0xFFFC_0000_0000_0000;
const TAG_ERR: u64    = 0xFFFD_0000_0000_0000;
const PTR_MASK: u64   = 0x0000_FFFF_FFFF_FFFF;

Now every value fits in a single u64:

#[derive(Clone, Copy)]
pub(crate) struct NanVal(u64);

That Copy derive is doing a lot of work. The stack becomes Vec<u64>. Contiguous memory, 8 bytes per slot, no heap chasing.

Number operations have no overhead

With NaN boxing, numbers are just floats. No encoding, no decoding. You check if a value is a number by testing whether it’s not a NaN:

#[inline]
pub(crate) fn is_number(self) -> bool {
    (self.0 & QNAN) != QNAN
}

#[inline]
pub(crate) fn as_number(self) -> f64 {
    f64::from_bits(self.0)
}

For a numeric workload, this means zero overhead on the hot path. No tag checks, no unboxing. The bits in the register are already the double you want.

The unsafe cost

Non-numeric values store Rc pointers in the NaN payload:

fn heap_string(s: String) -> Self {
    let rc = Rc::new(HeapObj::Str(s));
    let ptr = Rc::into_raw(rc) as u64;
    NanVal(TAG_STRING | (ptr & PTR_MASK))
}

We turn an Rc into a raw pointer, stuff it into the low 48 bits, and tag the high bits. To read it back:

unsafe fn as_heap_ref<'a>(self) -> &'a HeapObj {
    let ptr = (self.0 & PTR_MASK) as *const HeapObj;
    unsafe { &*ptr }
}

Reference counting is manual:

#[inline(always)]
fn clone_rc(self) {
    if self.is_heap() {
        let ptr = (self.0 & PTR_MASK) as *const HeapObj;
        unsafe { Rc::increment_strong_count(ptr); }
    }
}

#[inline(always)]
fn drop_rc(self) {
    if self.is_heap() {
        let ptr = (self.0 & PTR_MASK) as *const HeapObj;
        unsafe { Rc::decrement_strong_count(ptr); }
    }
}

This is far from idiomatic Rust. You’re manually managing reference counts, casting raw pointers, transmuting bit patterns. The borrow checker can’t help you here. One missed drop_rc and you’ve got a memory leak. One extra and you’ve got a use-after-free.

Where LuaJIT’s approach fights Rust

LuaJIT does this in C, where pointer casting is routine. In Rust, every one of these operations is unsafe. And not “technically unsafe but obviously fine” - unsafe in the sense that the compiler can’t verify correctness.

The dispatch loop is where it gets gnarly. Every register access goes through unchecked indexing:

macro_rules! reg {
    ($idx:expr) => { unsafe { *self.stack.get_unchecked($idx) } }
}
macro_rules! reg_set {
    ($idx:expr, $val:expr) => { unsafe {
        let slot = self.stack.as_mut_ptr().add($idx);
        (*slot).drop_rc();
        *slot = $val;
    } }
}

This is necessary for performance. Bounds checks in the inner loop are measurable. But it means a compiler bug or a bad register allocation will segfault instead of panicking.

Was it worth it?

Before NaN boxing, the VM was running tot(10, 20, 30) at about 172ns per call. After: 156ns, then down to 66ns after the register VM rewrite. The NaN boxing itself wasn’t the whole story. It enabled the register VM by making the stack representation uniform. It was a necessary foundation.

The real payoff was in what it unlocked. Once every value is a u64, you can skip reference counting for numbers entirely:

OP_MOVE => {
    let v = reg!(b);
    if !v.is_number() { v.clone_rc(); }
    reg_set!(a, v);
}

Numbers are the common case in arithmetic-heavy code. Skipping RC for them is a meaningful win.

NaN boxing is a proven technique but it’s hostile to Rust’s safety model. You’re opting out of everything the language gives you. The tests and RC balancing are the only safety net.

The performance win isn’t from NaN boxing alone. It’s from what it enables: a uniform stack representation, `Copy` semantics for the common case, and elimination of enum matching in the hot path.