Processing one trillion rows
Compute the minimum, maximum and mean temperatures per weather station, sorted alphabetically for one trillion rows.
The example is inspired from the One Trillion Row Challenge by Coiled - a company behind popular distributed pythonic open-source library Dask.
Source code in our Github repository at https://github.com/kubox-ai/1trc
This repository is work in progress as we iterate and learn before the final submission. It contains the code to quickly spin up a Kubox cluster in AWS us-east-1
and process 1 trillion rows of data. It give you two options for tackling this challenge:
- ClickHouse – A powerful, high-performance analytics database.
- Daft and Ray – A dynamic duo for distributed computing and cutting-edge data processing.
We’re excited to have you explore and experiment with Kubox. Feel free to dive in and share your feedback as we continue to enhance this project!
Metric/Framework | Daft + Ray | Clickhouse |
---|---|---|
Startup time | 320s | 313s |
Running time | 1189s | 527s |
Delete time | 122s | 123s |
Estimate cost | $2.75 | $1.37 |
Work in progress..
Dataset
The One Trillion Row Challenge originated as an ambitious benchmark task:
- Goal: Compute the minimum, mean, and maximum temperatures per weather station, sorted alphabetically.
- Dataset:
- Format: Parquet
- Size: 2.5 TB (100,000 files, each
24 MiB
in size with 10 million rows) - Location:
s3://coiled-datasets-rp/1trc
(AWS S3 Requester Pays Bucket)
Here is how to download one of the 24 MiB
file: