Columnar storage is an on-disk data layout where each column of a table is stored contiguously, instead of each row. The layout reduces I/O for queries that read a subset of columns from a wide table, and enables compression ratios of 5–10× by grouping values of the same type adjacent on disk. ClickHouse, Snowflake, BigQuery, and DuckDB all use it as their default storage model.
How columnar storage works #
A row-oriented database writes each tuple as a contiguous record on a page — every column of row 1, then every column of row 2. A columnar layout inverts this: every value of column A is stored together, every value of column B is stored together. A query that touches three columns of a 50-column table reads ~6% of the bytes a row store would read for the same query.
Adjacent values in a column share a type and often share patterns — sorted timestamps, repeated categoricals, monotonic IDs — so encodings like dictionary, run-length, delta, and Gorilla compress them far better than the mixed-type tuples in a row store. A general-purpose codec (LZ4, ZSTD) sits on top of the encoded column. See how columnar compression works for the encoding stack in detail.
Most engines store per-block metadata — min, max, count, and sometimes Bloom filters — so the planner can skip whole blocks during predicate evaluation without decompressing them. This is "predicate pushdown into the storage layer" and is central to how columnar engines stay fast on selective scans.
| Layout | Example engines |
|---|---|
| Row | Postgres heap, MySQL InnoDB |
| Column | Snowflake, Parquet, ORC, ClickHouse, DuckDB, IBM DB2 BLU |
Inside columnar storage formats #
PAX (Partition Attributes Across) was introduced by Ailamaki et al. in the 2001 VLDB paper Weaving Relations for Cache Performance. It groups rows into pages like a row store, then arranges columns within each page contiguously, preserving cache locality for full-row reads while keeping the I/O wins for column-subset scans. This inspired many of today's columnar formats.
Most modern columnar formats — Snowflake micro-partitions, Parquet, ORC, DB2 BLU — are PAX-inspired and use row groups. A file or partition is split into bounded chunks of rows (anywhere from a few thousand rows to hundreds of MB), and within each chunk, columns are stored contiguously. The row group is a first-class boundary: it has its own metadata, its own min/max stats, and compression and encoding reset at every row group boundary. Critically, these boundaries are fixed at write time — once a Parquet file is on disk, the row group sizing is baked in. Your only recourse if it turns out wrong is to rewrite the file.
ClickHouse MergeTree doesn't do this within a part. Inside a Wide-format part, each column is a single continuous file — colA.bin contains every value of colA for the entire part, end to end. Compression blocks span the whole column. Granules (8192 rows by default) exist as index marks, not as container boundaries.
That said, if you zoom out to the table level, parts themselves look a bit like row groups. Each INSERT creates a new part, every part holds a bounded set of rows, and a column across the whole table lives in many separate files (part1/colA.bin, part2/colA.bin, …). But unlike Parquet row groups, parts aren't a fixed write-time decision — they're a continuous operational concern. Background merges combine small parts into larger ones over time, so the on-disk layout you have today isn't the layout you'll have next week. As parts grow, column streams get longer and compression improves. Sizing thresholds (min_bytes_for_wide_part, max_bytes_to_merge_at_max_space_in_pool) are tunable, and partitioning bounds part size implicitly above that.
When to use columnar storage #
Columnar storage wins when:
- Queries scan many rows but read few columns (the typical analytical pattern).
- Aggregations, filters, and group-bys dominate the workload.
- Data is append-mostly — bulk inserts followed by reads, with few per-row updates.
- Storage cost matters: 5–10× compression typical, up to 30× on low-cardinality columns.
Columnar storage loses when:
- Queries are point lookups by primary key returning whole rows.
- The dominant write pattern is single-row updates.
- Concurrency is in the thousands of small transactional queries per second per node.
For the head-to-head against row stores, see row-oriented vs column-oriented databases.
How columnar storage differs from a columnar database #
Columnar storage is a file layout. A columnar database is a complete system that uses the layout. The database adds:
- A query planner aware of the column-major layout.
- A vectorised execution engine that processes batches of column values per call.
- An indexing strategy (sparse primary indexes, skip indexes, Bloom filters).
- A write path that handles the merge and compaction overhead columnar layouts introduce.
Parquet is a columnar storage format. ClickHouse and DuckDB are columnar databases. A columnar database like ClickHouse reads a columnar storage format like MergeTree and Parquet. See columnar storage formats for the format-vs-engine distinction in detail, and the columnar database overview for the wider primer.