At its core, Kubox offers a freemium CLI that empowers you to effortlessly spin up Kubernetes clusters in your own cloud. You have the flexibility to deploy your own data processing frameworkds and tools.
The source code is available in our Github repository at https://github.com/kubox-ai/1trcThis repository is a work in progress as we iterate and learn before the final submission. It contains the code to quickly spin up a Kubox cluster in AWS
us-east-1
and process 1 trillion rows of data. It give you two options for tackling this challenge:
- ClickHouse – A powerful, high-performance analytics database.
- Daft and Ray – A dynamic duo for distributed computing and cutting-edge data processing.
USD prices for AWS EC2 spot instances are used for the below calculations. The prices are subject to change based on the current market rates.
Metric/Framework | Daft + Ray | Clickhouse |
---|---|---|
Startup time | 320s | 313s |
Running time | 1189s | 527s |
Delete time | 122s | 123s |
Estimate cost | $2.75 | $1.37 |
Dataset
The One Trillion Row Challenge originated as an ambitious benchmark task:- Goal: Compute the minimum, mean, and maximum temperatures per weather station, sorted alphabetically.
- Dataset:
- Format: Parquet
- Size: 2.5 TB (100,000 files, each
24 MiB
in size with 10 million rows) - Location:
s3://coiled-datasets-rp/1trc
(AWS S3 Requester Pays Bucket)
24 MiB
file: