ilo: A Programming Language for AI Agents, Not Humans
A performant language designed for LLMs to write. Optimised for token and character length, and rethought for non-human use cases.
Overview
ilo is a language that AI agents write code in. Not a framework for building agents. A compile target - the language an LLM outputs when it needs to express a program as cheaply and correctly as possible. The only metric is total tokens from intent to working code: spec loading + generation + context loading + error feedback + retries.
Problem
When an LLM writes Python, it wastes tokens on verbose syntax, ambiguous grammar causes retries, and human-readable formatting burns context window. Every token spent generating, reading, or retrying costs real time and money. Agents need a language optimised for them to write in, not for humans to read.
Constraints
- Must be learnable by an LLM from a short spec alone (until foundation models train on the language natively)
- Must achieve lower token cost than Python for equivalent programs
- Must maintain high generation accuracy (10/10) across diverse task types
Approach
I built nine syntax variants and benchmarked them all against Claude Haiku. Measured token count, character count, and cold-LLM generation accuracy - can Haiku write correct ilo programs from just a spec? Let the data pick the winner, then built a full VM in Rust. Four execution backends now: tree-walking interpreter, register VM, a hand-rolled ARM64 JIT, and Cranelift JIT (the default, with interpreter fallback).
Key Decisions
Prefix notation instead of infix
(a * b) + c → +*a b c Eliminates parentheses at every nesting level. Across 25 expression patterns: 22% fewer tokens, 42% fewer characters vs infix.
Positional arguments instead of named parameters
tot p:n q:n r:n>n;s=*p q;t=*s r;+s t Eliminates parens, colons, and repeated parameter names. Single largest token reduction across all variants. I expected parameter-swap errors but they never materialised - 10/10 accuracy across all task types.
Single-character sigils instead of English keywords
?<x 0 !neg ~pos - ? conditional, ! effect, ~ transform Sigils can't be confused with variable names or hallucinated into natural-language variations.
Static verifier before execution
verify: undefined variable 'y' in 'f'
hint: did you mean 'x'? All calls resolve, all types align, all dependencies exist - checked before running anything. Reports all errors at once with did-you-mean hints. Catches malformed programs before execution, cutting retry cycles.
NaN boxing for value representation
Every value fits in 8 bytes. Numbers are zero-cost (just raw double bits). The stack becomes Vec<u64> with contiguous memory and no heap chasing.
Register-based VM instead of stack-based
Reduced instruction count by 67% and improved performance by 31%. Fewer dispatches matters more than simpler instructions.
Tech Stack
Rust ilo Language Design AI Agents JIT
Result & Impact
- 0.33x Python (287 tokens vs 871)Token Efficiency
- 0.22x Python (787 chars vs 3635)Character Efficiency
- 10/10 across 4 task typesLLM Generation Accuracy
- 83ns/call register VM, 2ns/call JIT (tot benchmark)VM Performance
An LLM given only the spec writes correct ilo programs with no prior training. Tested across workflow, data pipeline, decision, and API orchestration tasks. Caveat: the current instruction set is small (arithmetic, matching, basic control flow). These numbers reflect a simple benchmark. As the vocabulary and instruction set expand, performance characteristics will change.
Learnings
- Positional arguments are the single biggest token saver. I expected parameter-swap errors but they never appeared. 10/10 accuracy across all task types.
- Prefix notation compounds savings at every nesting level. The deeper the expression, the more tokens saved.
- Abbreviations don't save tokens. Most tokenisers already encode common English words as a single token. What does cost tokens: hyphens. A hyphenated name is always 2 tokens because the hyphen forces a split.
- Spec quality matters more than syntax cleverness. Better examples in the spec moved accuracy from 8/10 to 10/10.
- Reporting all verification errors at once with hints is cheaper than letting agents discover them through execution retries.