Parquet vs CSV file format

The short answer to why parquet beats CSV at scale: imagine a table with 100 columns and 10 million rows where you want the average of one column. A CSV reader has to walk every byte of all 100 columns in every row to find the one you want. A parquet reader fetches the column you asked for and ignores the other 99. Same data, same query, two orders of magnitude less data read. The rest of the post is the mechanism behind that.

Parquet is an open-source columnar file format, originally developed by Cloudera and Twitter and now maintained as Apache Parquet. Pandas, polars, DuckDB, Spark, BigQuery, Athena, Trino, ClickHouse, and Snowflake all read it natively.

CSV stores one row at a time:

id,name,price
1,foo,12.50
2,bar,9.99

Parquet stores the same data column-wise: all id values together, then all name, then all price, in compressed chunks. A metadata footer at the end of the file lists the schema and the byte offset of every column chunk in every row group:

id:    1, 2
name:  "foo", "bar"
price: 12.50, 9.99

footer: schema + byte offsets per column chunk

Each column is a contiguous chunk a reader can locate via the footer without scanning anything else.

On a large file, that layout pays off three ways:

  1. Queries get cheaper. A query that wants one column reads only that column. The footer tells the reader exactly where the column lives, so it jumps straight there. CSV makes the reader walk every row, character by character, to skip the columns it does not want.
  2. Files get smaller. Compression runs per column. A column of prices is all numbers in a row; a column of timestamps is all timestamps. Type-aware codecs like delta and zstd compress that tightly. CSV’s per-byte compression sees a row of mixed numbers, strings, and delimiters and has less structure to work with.
  3. Files become queryable over HTTP. A reader fetches the footer first, learns where the columns live, then range-requests only the byte ranges it needs. This is what makes DuckDB-WASM in the browser able to answer SQL questions against a 360 MB parquet on S3 without downloading the file.

AWS measured this directly on a TPC-H benchmark in Amazon Athena: the same query took 2.1 seconds scanning 2.0 GB against a parquet table, against 11.9 seconds scanning 23.7 GB on the equivalent gzipped CSV (AWS Athena performance tuning). With the parquet sorted on the filter column the same query dropped to 1.1 seconds scanning 38.8 MB.

CSV is fine for a few thousand rows you’ll open in a spreadsheet. Past that, the analytical tooling defaults to parquet because the bytes-on-disk and bytes-on-the-wire numbers stop being comparable.