Vectorised query execution processes batches of values, typically 1024 to 4096 per operator call, instead of one row at a time. The batch model amortises interpreter overhead, lets compilers emit SIMD instructions, and keeps tight working sets in CPU cache. It is the dominant execution model for analytical databases including ClickHouse, DuckDB, Snowflake, and Databricks Photon.
The technique was introduced in the MonetDB/X100 paper by Peter Boncz, Marcin Zukowski, and Niels Nes at CIDR 2005, which measured the older Volcano iterator model spending 50–80% of CPU time on interpretation rather than real work. Modern columnar databases pair vectorised execution with column storage to compound the I/O and CPU wins.
Why does vectorised query execution matter? #
Vectorised execution is why an analytical engine like ClickHouse can query billions of rows in under a second on a single laptop core. For database users, the benefit shows up as lower cloud-compute bills, dashboards that update in real time instead of overnight, and modern CPUs spending cycles on real work instead of the 50–80% interpreter tax the iterator model carries.
For someone choosing a database for analytics in 2026, the decision is mostly between vectorised engines. ClickHouse, DuckDB, Snowflake, Databricks Photon, Apache DataFusion, and Velox all use the model. Row-oriented systems like PostgreSQL, MySQL, and SQLite stay with the iterator model and pay an order-of-magnitude penalty on aggregation-heavy queries.
How does vectorised query execution work? #
Vectorised execution turns each query operator into a function that takes a batch of column values, processes them in a tight inner loop, and returns the resulting batch. Batches are sized to fit in L1 or L2 cache; 1024 to 4096 values is typical. The compiler emits SIMD instructions automatically because the loop has no virtual calls or row-by-row branches.
The hot loop looks like a numerical kernel rather than a database operator:
// Filter: keep rows where price > 100. Operates on a 1024-wide column.
for (int i = 0; i < BATCH_SIZE; i++) {
selection_vector[count] = i;
count += (price[i] > 100); // branchless; auto-vectorises to AVX2 vpcmpgtd
}
AVX2 processes 8 32-bit integers per instruction; AVX-512 processes 16; ARM NEON processes 4. The original MonetDB/X100 paper measured speedups of 4–30× across TPC-H queries against a row-at-a-time interpreter baseline once batching, cache locality, and SIMD compounded.
Components of a vectorised operator #
A vectorised operator has three moving parts.
- Batch (or chunk): A contiguous slice of one or more columns sized to fit in L1/L2 cache. DuckDB fixes its
DataChunkat 2048 rows; ClickHouse'sBlockis operator-tunable; Apache DataFusion uses an ArrowRecordBatch. - Selection vector: An array of row indices marking which rows in the batch survived the most recent predicate. Filters write to the selection vector instead of copying data, so downstream operators skip eliminated rows without rebuilding the batch.
- Operator fusion: Adjacent expressions are merged into a single pass —
a * b + c + 1runs as one kernel rather than three. ClickHouse fuses hot expressions at runtime via LLVM-based query compilation.
How does vectorised execution differ from the Volcano iterator model? #
The Volcano iterator model returns one row per next() call through a chain of virtual function calls; vectorisation pulls the loop inside the operator and processes 1024 or more rows per call. The batch model amortises interpreter overhead, which Volcano's authors measured at 50–80% of CPU time, and lets the compiler emit SIMD.
The Volcano model, introduced by Goetz Graefe in 1994, is simple to reason about but carries two costs. Virtual dispatch and per-row branch prediction dominate the runtime on analytical queries. And hot inner loops cannot auto-vectorise because the compiler cannot see across the virtual call boundary.
| Property | Volcano iterator | Vectorised |
|---|---|---|
Rows per next() call | 1 | 1024–4096 |
| Per-row interpreter cost | 50–80% of CPU time | Negligible |
| SIMD-friendly | No | Yes |
| Cache locality | Poor | Tight column buffer |
| Engines | Postgres, MySQL, SQLite | ClickHouse, DuckDB, Snowflake |
Is vectorisation the same as SIMD? #
No. Vectorisation is a batch-processing execution model; SIMD is one implementation choice for the inner loop. A vectorised engine running scalar code still beats the iterator model because batching amortises dispatch overhead and improves cache locality. SIMD adds another 4–16× on top, depending on the data type and instruction set.
The third option is row-at-a-time JIT compilation: instead of batching, compile the entire query plan into native code at runtime so virtual calls disappear. PostgreSQL has shipped optional LLVM JIT since version 11 (2018); SingleStore uses code generation throughout. JIT eliminates dispatch overhead but does not naturally produce SIMD without further machinery, which is why hybrid engines like Databricks Photon JIT hot expressions and run them over batches. ClickHouse uses runtime CPU dispatch to pick between non-vectorised, AVX2, and AVX-512 kernels per host, with SSE 4.2 as the floor.
When does vectorised execution help? #
Vectorised execution helps when each operator processes hundreds of rows per call doing similar work — analytical scans, filters, hash builds, and aggregations all fit. It pays poorly on workloads that defeat batching, where the cost of pulling rows into a batch isn't recovered by the inner loop. Most analytical engines default to vectorised pipelines anyway because the overhead in the bad cases is small.
Databricks Photon, a C++ rewrite of Spark's row-at-a-time JVM execution, reported up to 12× speedups on TPC-DS at SIGMOD 2022. The gap sits on top of whatever columnar storage already buys, and lands inside the X100 paper's 4–30× range from twenty years earlier on different hardware.
Vectorisation pays poorly on:
- Point lookups:
WHERE id = 42touches one row, so there is nothing to amortise across a batch. - Disk-bound scans: when a query is bottlenecked on I/O, CPU savings are invisible.
- Per-row branching: regex evaluation or JSON-path traversal where each row takes a different code path force SIMD lanes to diverge, and the compiler can't pack them.
Which engines use vectorised execution? #
Most modern analytical databases use vectorised execution, including ClickHouse, DuckDB, Snowflake, Databricks Photon, Apache DataFusion, and Velox. Older row-oriented databases like PostgreSQL, MySQL, and SQLite stay with the Volcano iterator model, though PostgreSQL has shipped optional LLVM JIT compilation since version 11 (2018) as a partial alternative.
| Engine | Execution model | Notes |
|---|---|---|
| ClickHouse | Vectorised | Custom batch sizes per operator; runtime SIMD dispatch |
| DuckDB | Vectorised | 2048-row chunks; auto-vectorised C++ |
| Snowflake | Vectorised | Closed-source; documented in the 2016 SIGMOD paper |
| Databricks Photon | Vectorised | C++ engine replacing Spark's row-at-a-time JVM execution (SIGMOD 2022) |
| PostgreSQL | Iterator + optional JIT | LLVM JIT optional from version 11 |
| SQLite | Iterator | Pure Volcano model |