Data Pipelines in 50 Tokens - Writing

I built ilo for AI agents to write, but the use case I kept coming back to was data munging. Fetch something from an API, parse it, reshape it, write it somewhere. The kind of task that’s ten lines of Python with three imports, or forty tokens of boilerplate before you get to the actual logic.

ilo v0.7.0 closes that loop. With grp, flat, sum, avg, rgx, and structured wr, the full pipeline is now expressible.

Common data pipeline shape

A data task has five stages:

fetch → parse → transform → aggregate → write

Each stage maps to builtins that already exist or were added in v0.7.0:

Stage	Builtins
Fetch	`get` / `$`, `post`, `env`, `rd`
Parse	`jpar`, `jpth`, `rdb`, `spl`, `rgx`
Transform	`map`, `flt`, `fld`, `srt`, `unq`, `trm`, `flat`
Aggregate	`grp`, `sum`, `avg`
Write	`wr path data "csv"` / `"tsv"` / `"json"`

A real example

Pull JSON from an API, extract prices, group by category, average each group, write to CSV.

The human-readable version:

cat x:?>{a;b}=x;a              -- extract category from record
price x:?>{a;b}=x;b            -- extract price from record
fetch url:t>L ?                 -- fetch JSON, parse, extract items
  data=jpar! ($! url)           -- GET url, parse JSON, auto-unwrap errors
  jpth "$.items[*]" data        -- extract items array via JSON path

proc url:t out:t>R t t          -- full pipeline entry point
  items=fetch! url              -- call fetch, unwrap result
  groups=grp cat items          -- group items by category
  rows=map avg (mvals groups)   -- average each group's prices
  keys=mkeys groups             -- sorted list of category names
  wr out rows "csv"             -- write results as CSV

The version an AI agent would generate:

c x:?>{a;b}=x;a p x:?>{a;b}=x;b f u:t>L ?;d=jpar! ($! u);jpth "$.items[*]" d r u:t o:t>R t t;i=f! u;g=grp c i;v=map avg (mvals g);wr o v "csv"

Same program. The dense form is what the agent emits - no whitespace, no newlines, short names. ilo --expanded reformats it to the version above for humans to read. Both run identically.

What each new builtin does

grp fn xs groups a list by a key function. Call fn on each element to produce a key, return a map of key to list of elements. This is the operation that makes most data reshaping possible - without it you’re writing manual loops and map insertions.

cl x:n>t;>x 5{"big"}{"small"}   -- classify by threshold
f xs:L n>M t L n;grp cl xs     -- → {"small": [1,3], "big": [8,9]}

sum xs and avg xs are numeric aggregation. sum returns 0 for an empty list. avg errors on empty - there’s no meaningful average of nothing.

f xs:L n>n;sum xs    -- f 1,2,3,4,5 → 15
f xs:L n>n;avg xs    -- f 2,4,6 → 4

flat xs flattens one level of nesting. [[1,2],[3]] becomes [1,2,3]. Non-list elements pass through unchanged. Useful after operations that produce lists of lists - like splitting each line of a file.

rgx pat s does regex matching. Without capture groups it returns all matches. With capture groups it returns the captures from the first match. Empty list on no match.

f s:t>L t;rgx "\d+" s    -- f "abc 123 def 456" → ["123", "456"]

wr path data "csv" extends the existing wr builtin with a third argument for structured output. Pass "csv", "tsv", or "json". CSV and TSV handle quoting automatically. The two-argument form (wr path text) still works for raw text.

Why these six

I looked at what I was reaching for when writing data scripts and kept hitting the same gaps. I could fetch and parse fine. I could filter and map. But the moment I needed to group rows by a column, or sum a field, or write CSV output, I had to drop out of ilo and use something else.

These six builtins are the smallest set where not having them meant the pipeline couldn’t finish in ilo.

Token cost

The CSV pipeline above is around 50 tokens. The equivalent Python - requests.get, json.loads, itertools.groupby or a dict comprehension, statistics.mean, csv.writer - runs closer to 150, plus imports. For an AI agent paying per token, that’s a 3x difference on a routine task.

More importantly, each stage is a single call. There’s no setup, no context manager, no iterator protocol. The agent generates one line per stage and moves on.

Composing with >>

The proc function from the example above uses intermediate bindings - groups, rows, keys. With >>, the same logic chains without naming anything:

proc url:t out:t>R t t                             -- same pipeline, with >>
  rows=fetch! url >> grp cat >> mvals >> map avg   -- fetch → group → average
  wr out rows "csv"                                -- write results as CSV

The agent version:

c x:?>{a;b}=x;a p x:?>{a;b}=x;b f u:t>L ?;d=jpar! ($! u);jpth "$.items[*]" d r u:t o:t>R t t;v=f! u >> grp c >> mvals >> map avg;wr o v "csv"

Group by category, get the values (lists of items per category), average each group. Three lines instead of five, no intermediate names except the final result that gets written.

Named functions instead of lambdas means cat and avg are just names, one token each. Pipe removes the bindings between stages. The pipeline reads left to right with minimal syntax.