Writing a JIT Compiler by Hand on Apple Silicon - Writing

ilo’s VM optimisation led to an obvious question: what if I just emitted raw ARM64 instructions and jumped to them? No LLVM, no Cranelift, no framework. Just bytes in memory, mapped executable.

The idea

ilo’s VM already compiles to a register-based bytecode where numeric functions use registers r0-r30. ARM64’s floating-point registers are d0-d30. The mapping is one-to-one. Function arguments arrive in d0-d7 per the AAPCS64 calling convention, which is exactly how the VM already lays out parameters.

So for a numeric-only function, the “JIT” is almost mechanical: walk the bytecode, emit the corresponding ARM64 floating-point instruction, done.

ARM64 instruction encoding

ARM64 instructions are fixed-width 32-bit words. The floating-point arithmetic ones are pleasantly regular:

fn arm64_fadd(rd: u8, rn: u8, rm: u8) -> u32 {
    0x1E602800 | ((rm as u32) << 16) | ((rn as u32) << 5) | rd as u32
}

fn arm64_fsub(rd: u8, rn: u8, rm: u8) -> u32 {
    0x1E603800 | ((rm as u32) << 16) | ((rn as u32) << 5) | rd as u32
}

fn arm64_fmul(rd: u8, rn: u8, rm: u8) -> u32 {
    0x1E600800 | ((rm as u32) << 16) | ((rn as u32) << 5) | rd as u32
}

fn arm64_fdiv(rd: u8, rn: u8, rm: u8) -> u32 {
    0x1E601800 | ((rm as u32) << 16) | ((rn as u32) << 5) | rd as u32
}

fn arm64_ret() -> u32 {
    0xD65F03C0
}

Each one is just a base encoding with register numbers shifted into place. You can write the whole instruction encoder in about 30 lines.

Compilation

The compiler walks the bytecode and emits native instructions:

match op {
    OP_ADD_NN => { self.emit(arm64_fadd(a, b, c)); }
    OP_SUB_NN => { self.emit(arm64_fsub(a, b, c)); }
    OP_MUL_NN => { self.emit(arm64_fmul(a, b, c)); }
    OP_DIV_NN => { self.emit(arm64_fdiv(a, b, c)); }
    OP_MOVE   => { if a != b { self.emit(arm64_fmov(a, b)); } }
    OP_NEG    => { self.emit(arm64_fneg(a, b)); }
    OP_RET    => {
        if a != 0 { self.emit(arm64_fmov(0, a)); }
        self.emit(arm64_ret());
    }
    _ => return None,  // not eligible
}

Constants are trickier. They get placed in a data section after the code and loaded via PC-relative addressing (ADR to compute the address, LDR to load the double).

The scary part: making memory executable

macOS on Apple Silicon has W^X enforcement and requires special JIT entitlements. The allocation dance:

// Allocate with MAP_JIT
let ptr = libc::mmap(
    std::ptr::null_mut(),
    alloc_size,
    libc::PROT_READ | libc::PROT_WRITE,
    libc::MAP_PRIVATE | libc::MAP_ANONYMOUS | libc::MAP_JIT,
    -1, 0,
);

// Allow writing to JIT memory (macOS-specific)
pthread_jit_write_protect_np(0);

// Copy code and constant pool into the buffer
std::ptr::copy_nonoverlapping(code_ptr, buffer, code_bytes);
std::ptr::copy_nonoverlapping(const_ptr, buffer.add(code_bytes), const_bytes);

// Re-enable write protection
pthread_jit_write_protect_np(1);

// Make executable
libc::mprotect(ptr, alloc_size, libc::PROT_READ | libc::PROT_EXEC);

// Flush instruction cache (critical on ARM!)
sys_icache_invalidate(ptr, alloc_size);

That last line is important. ARM has separate instruction and data caches. If you don’t flush, you might execute stale cache contents. On x86 you can get away without it because instruction cache coherency is maintained by hardware. On ARM, it’s your problem.

Calling the compiled code

The compiled buffer is a function pointer. Calling it means transmute-ing a raw pointer to an extern "C" fn:

fn call_3(&self, a0: f64, a1: f64, a2: f64) -> f64 {
    let f: extern "C" fn(f64, f64, f64) -> f64 =
        unsafe { std::mem::transmute(self.ptr) };
    f(a0, a1, a2)
}

I wrote variants for 0-8 arguments. The arguments land directly in d0-d7 per the calling convention, which is where the compiled code expects them.

What it can and can’t do

The JIT only handles numeric-only functions. No strings, no lists, no control flow, no function calls. There’s an eligibility check:

pub(crate) fn is_jit_eligible(chunk: &Chunk) -> bool {
    for &inst in &chunk.code {
        let op = (inst >> 24) as u8;
        match op {
            OP_ADD_NN | OP_SUB_NN | OP_MUL_NN | OP_DIV_NN |
            OP_ADDK_N | OP_SUBK_N | OP_MULK_N | OP_DIVK_N |
            OP_MOVE | OP_NEG | OP_RET => {}
            OP_LOADK => {
                if !matches!(chunk.constants[bx], Value::Number(_)) {
                    return false;
                }
            }
            _ => return false,
        }
    }
    true
}

If any instruction touches heap memory, control flow, or non-numeric types, it falls back to the bytecode VM. This is a tiny fraction of real programs, but it’s the fraction that matters for benchmarking pure arithmetic.

The whole thing is 351 lines

A working ARM64 JIT backend (instruction encoding, compilation, memory management, constant pools, cache flushing) in 351 lines of Rust. It’s minimal. No optimiser, no register allocator (it uses the bytecode’s register assignments directly), no debug info. It runs tot(10, 20, 30) at native speed.

When this approach makes sense

For a production JIT, use Cranelift or LLVM. Use Cranelift or LLVM if you need a production JIT. But for a restricted instruction set, the ARM64 encoding is clean and well-documented. The macOS JIT APIs are fiddly but workable. The whole thing is small enough to hold in your head.