Trying DuckDB With an AI Query Chat

I came across Evidence.dev, an open-source BI tool that runs SQL in the browser via DuckDB-WASM. DuckDB is the part I had not used directly, so I built a small project to try it: an AI chat that turns plain-English questions into SQL and runs them locally against the data.

Repo at github.com/danieljohnmorris/duckdb-chat. The pipeline is: chat box, Claude emits SQL, DuckDB-WASM runs it in the browser, result renders as a table.

DuckDB-WASM runs in a Web Worker

const bundles = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(bundles);
const worker = await duckdb.createWorker(bundle.mainWorker);
const db = new duckdb.AsyncDuckDB(new duckdb.ConsoleLogger(), worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);
const conn = await db.connect();

getJsDelivrBundles() picks the right WASM bundle based on browser capabilities (parallel build if SharedArrayBuffer plus threads is available, single-threaded otherwise). The query engine runs off the main thread, so a slow query does not freeze the page.

It reads CSV, TSV, and parquet directly

DuckDB has built-in readers for the common file formats. No ETL step, no schema declaration:

CREATE OR REPLACE TABLE starlink AS
SELECT * FROM read_csv('starlink.tsv', delim='\t', header=true, all_varchar=true);

I bundled Jonathan McDowell’s GCAT Starlink subset (12,191 rows, 3.5 MB TSV) and loaded it on boot. The first run failed because the file’s first column header is #JCAT and DuckDB took the # as a comment marker. A one-line regex replace cleaned the header and the load worked.

It range-requests remote parquet from S3

DuckDB-WASM’s httpfs extension can register a remote parquet as a queryable view, with nothing pre-downloaded:

CREATE OR REPLACE VIEW speedtest AS
SELECT * FROM read_parquet('https://ookla-open-data.s3.amazonaws.com/.../tiles.parquet');

The view points at Ookla’s Open Data Q4 2024 fixed-broadband tiles, around 3 million rows in a 360 MB file. Geographic averages, no ISP column, unrelated to Starlink. I added it to test whether the remote-parquet path worked end-to-end.

When you ask “median fixed-broadband download speed in Q4 2024”, DuckDB-WASM:

  1. issues a HEAD to learn the file size,
  2. range-requests the last few KB to read the column-offset footer,
  3. range-requests only the avg_d_kbps column chunks across all row groups,
  4. computes the median in browser memory.

The 360 MB parquet stays in S3. The browser holds chunks for avg_d_kbps and discards the rest. The Ookla bucket sends Access-Control-Allow-Origin: * and exposes ETag, so range requests are not blocked by CORS. I checked with curl -I before wiring the view in.

It speaks Postgres-ish SQL

date_trunc, EXTRACT, try_strptime, try_cast, window functions, quantile_cont. Most of what Claude generated worked first try. The model occasionally writes a SQLite-style or BigQuery-style expression that DuckDB rejects, but the error renders verbatim in the chat and a re-ask usually fixes it.

The Anthropic call from the browser

The page has an API key input. The key writes to localStorage and is read on each submit. Each question becomes a fetch to api.anthropic.com:

fetch('https://api.anthropic.com/v1/messages', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'x-api-key': apiKey,
    'anthropic-version': '2023-06-01',
    'anthropic-dangerous-direct-browser-access': 'true'
  },
  body: JSON.stringify(payload)
});

The anthropic-dangerous-direct-browser-access header is Anthropic’s explicit opt-in for browser-origin requests. The system prompt has both schemas plus a few-shot block of question-to-SQL examples, marked cache_control: { type: 'ephemeral' }. After the first call the prompt is cached; subsequent input tokens are billed at the cached rate.

There’s no server in this stack: the static site sits on any static host, the Anthropic call goes browser-to-api.anthropic.com, and the Ookla parquet goes browser-to-S3 directly.

What I’d reuse

A static site that loads CSV or parquet into DuckDB-WASM and answers SQL questions about it is a small but interesting pattern. The whole stack is a static bundle, with no database server or API gateway behind it. Locally bundled data never leaves the browser; remote data goes browser-to-S3 directly. I expect to reuse this for any small-to-medium dataset I want to share without standing up a backend.