Amazon 客户评论
该数据集包含超过 1.5 亿条 Amazon 商品的客户评论。数据以存储在 AWS S3 中的 snappy 压缩 Parquet 文件形式提供,压缩后总大小为 49GB。下面我们逐步演示如何将其导入 ClickHouse。
注意
下面的查询是在 Production 环境的 ClickHouse Cloud 实例上执行的。更多信息请参阅 "Playground 规格说明"。
加载数据集
- 在不将数据插入 ClickHouse 的情况下,我们可以直接在原处对其进行查询。先取出几行数据,看看它们的样子:
这些行如下所示:
- 让我们在 ClickHouse 中定义一个名为
amazon_reviews的新MergeTree表来存储这些数据:
- 下面的
INSERT命令使用了s3Cluster表函数,它可以利用集群中所有节点并行处理多个 S3 文件。我们还使用通配符来插入所有名称以https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet开头的文件:
提示
在 ClickHouse Cloud 中,集群名称为 default。请将 default 更改为你的集群名称……或者如果你没有集群,可以使用 s3 表函数(而不是 s3Cluster)。
- 该查询执行时间很短——平均每秒大约处理 300,000 行数据。大约 5 分钟内你就应该能看到所有行都已插入:
- Let's see how much space our data is using:
The original data was about 70G, but compressed in ClickHouse it takes up about 30G.
Example queries
- Let's run some queries. Here are the top 10 most-helpful reviews in the dataset:
注意
This query is using a projection to speed up performance.
- Here are the top 10 products in Amazon with the most reviews:
- Here are the average review ratings per month for each product (an actual Amazon job interview question!):
- Here are the total number of votes per product category. This query is fast because
product_categoryis in the primary key:
- Let's find the products with the word "awful" occurring most frequently in the review. This is a big task - over 151M strings have to be parsed looking for a single word:
Notice the query time for such a large amount of data. The results are also a fun read!
- We can run the same query again, except this time we search for awesome in the reviews: