Alexey Milovidov at Web Summit 2025: Working with massive datasets at scale

Alexey Milovidov

A practical deep-dive into working with massive datasets in ClickHouse, using training data collections as fascinating case studies. This presentation gets hands-on with real data engineering challenges—loading, analyzing, and comparing datasets that range from Common Crawl's web pages to GitHub repositories, Wikipedia, Reddit, and beyond.

The demo walks through ingesting the FineWeb dataset (81.5 terabytes of web data) directly from Hugging Face into ClickHouse, then explores creative analytical approaches across different data sources. You'll see techniques for calculating "style fingerprints" of websites, tracking word trends across platforms, and mapping a billion photos geographically - all using SQL.

  • Loading 81.5TB of FineWeb dataset from Hugging Face in 72 minutes using parallel ingestion
  • Automatic schema inference from Parquet files with better compression ratios than the source format
  • Comparing writing styles across platforms (Wikipedia vs Reddit vs Hacker News vs Bluesky) using token analysis
  • Creating website "style fingerprints" with hash-based vectorization to find similar domains
  • Tracking word trends over time and analyzing multimodal datasets like 1 billion photos with geographic queries
  • Practical examples working with GitHub (300TB), Wikipedia (45GB), and other major open datasets
Follow us
X imageBluesky imageSlack image
GitHub imageTelegram imageMeetup image
Rss image