ilo: A Programming Language for AI Agents, Not Humans

Nine Syntax Experiments: What I Learned Testing Language Designs Against LLMs

I built nine different syntax variants for ilo and tested them all against Claude Haiku. Most of my initial assumptions about what would save tokens were wrong.

The setup

Each variant implements the same five programs: a simple function, one with dependencies, a data transform, a tool interaction with error handling, and a multi-step workflow. I measured token count (using cl100k_base), character count, and then threw each one at Claude Haiku with zero prior knowledge to see if it could generate correct code from just a spec and examples.

The variants

Here’s what the same simple function looks like across a few of them.

idea1, readable, Haskell-ish:

total price:number quantity:number rate:number -> number
  subtotal = multiply price quantity
  tax = multiply subtotal rate
  add subtotal tax

idea7, dense wire format:

tot p:n q:n r:n>n;s=*p q;t=*s r;+s t

idea9, ultra-dense with short names (the winner):

tot p:n q:n r:n>n;s=*p q;t=*s r;+s t

The results

Idea                          Tokens   vs Py   Chars   vs Py   Score
--------------------------------------------------------------------
python-baseline                  871   1.00x    3635   1.00x       -
idea1                            921   1.06x    3108   0.86x    10.0
idea1-compact                    677   0.78x    2564   0.71x    10.0
idea2-tool-calling               983   1.13x    3203   0.88x    10.0
idea3-constrained-decoding       598   0.69x    2187   0.60x    10.0
idea4-ast-bytecode               584   0.67x    1190   0.33x     9.8
idea5-workflow-dag               710   0.82x    2603   0.72x    10.0
idea6-mcp-composition            956   1.10x    2978   0.82x     9.5
idea7-dense-wire                 351   0.40x    1292   0.36x    10.0
idea8-ultra-dense                285   0.33x     901   0.25x    10.0
idea9-ultra-dense-short          287   0.33x     787   0.22x    10.0

idea9 uses 0.33x the tokens and 0.22x the characters of Python, with 10/10 generation accuracy. idea4 (a bytecode-like format) was the only one that dipped below 10.

The surprises

Positional arguments are the biggest win. Going from charge(pid:pid, amt:amt) to just charge pid amt eliminates parens, colons, and repeated parameter names. This was the single largest token reduction across all variants. Across 10 variants and 4 task types, positional args scored 10/10 with no parameter-swap errors.

Short variable names don’t save tokens. order and ord are both single tokens in cl100k_base. The tokenizer already handles common English words efficiently. Abbreviating only saves characters, not tokens. That’s why idea8 (285 tokens) and idea9 (287 tokens) are nearly identical despite idea9 being 114 characters shorter.

Sigils beat keywords, but not by as much as you’d think. Replacing match with ? and for with @ saves characters more than tokens. The real win is disambiguation. A sigil can never be confused with a variable name, which reduces generation errors.

Spec quality trumps syntax cleverness. I went from 8/10 to 10/10 generation accuracy just by adding better operator examples to the spec. The spec is part of the prompt. If it’s ambiguous, the LLM will struggle no matter how clean your syntax is.

The bytecode format was a trap. idea4 (integer-ID based AST) had great character efficiency (0.33x) but was the only variant that dropped below 10/10 accuracy. Making things machine-optimal at the character level can make them harder for LLMs to work with.

The cold-LLM test

The real test was what I called the “cold-LLM” evaluation. Give Haiku the spec and examples, then ask it to write completely new programs in unfamiliar task domains. Not reformatting; understanding the language and producing novel code.

Per-task breakdown (Full test, /10):

Idea                        workflow  data_pipe  decision  api_orch
--------------------------------------------------------------------
idea7-dense-wire               10.0      10.0      10.0      10.0
idea8-ultra-dense               10.0      10.0      10.0      10.0
idea9-ultra-dense-short         10.0      10.0      10.0      10.0
idea4-ast-bytecode              10.0      10.0       9.0      10.0
idea6-mcp-composition            9.0      10.0       9.0      10.0

The dense formats were learnable, not just parseable.

What I’d do differently

idea1 through idea3 were designed by intuition, without proper benchmarks. The comparison harness in compare.py was what shifted the later variants from guesswork to evidence.