Videos

Alexey Milovidov at Web Summit 2025: Working with massive datasets at scale

Name: Alexey Milovidov at Web Summit 2025: Working with massive datasets at scale
Uploaded: 2025-11-14T09:38:39.994Z

Alexey Milovidov

A practical deep-dive into working with massive datasets in ClickHouse, using training data collections as fascinating case studies. This presentation gets hands-on with real data engineering challenges—loading, analyzing, and comparing datasets that range from Common Crawl's web pages to GitHub repositories, Wikipedia, Reddit, and beyond.

The demo walks through ingesting the FineWeb dataset (81.5 terabytes of web data) directly from Hugging Face into ClickHouse, then explores creative analytical approaches across different data sources. You'll see techniques for calculating "style fingerprints" of websites, tracking word trends across platforms, and mapping a billion photos geographically - all using SQL.

Loading 81.5TB of FineWeb dataset from Hugging Face in 72 minutes using parallel ingestion
Automatic schema inference from Parquet files with better compression ratios than the source format
Comparing writing styles across platforms (Wikipedia vs Reddit vs Hacker News vs Bluesky) using token analysis
Creating website "style fingerprints" with hash-based vectorization to find similar domains
Tracking word trends over time and analyzing multimodal datasets like 1 billion photos with geographic queries
Practical examples working with GitHub (300TB), Wikipedia (45GB), and other major open datasets

Recent videos

Open House

Scaling ClickHouse to petabytes of logs at OpenAI

Open House

How ClickHouse helps Anthropic scale observability

Open House, User stories

How Capital One Slingshot cut infrastructure costs by 50%

Engineering leaders at Capital One Slingshot share how they cut infrastructure costs by 50% and reduced average dashboard load time from 5+ to under 500ms with ClickHouse Cloud.