The example is inspired from the One Trillion Row Challenge by Coiled - a company behind popular distributed pythonic open-source library Dask.

At its core, Kubox offers a freemium CLI that empowers you to effortlessly spin up Kubernetes clusters in your own cloud. You have the flexibility to deploy your own data processing frameworkds and tools.

Source code in our Github repository at https://github.com/kubox-ai/1trc

This repository is work in progress as we iterate and learn before the final submission. It contains the code to quickly spin up a Kubox cluster in AWS us-east-1 and process 1 trillion rows of data. It give you two options for tackling this challenge:

  1. ClickHouse – A powerful, high-performance analytics database.
  2. Daft and Ray – A dynamic duo for distributed computing and cutting-edge data processing.

We’re excited to have you explore and experiment with Kubox. Feel free to dive in and share your feedback as we continue to enhance this project!

AWS EC2 spot instances USD prices are used for below calculations:
Metric/FrameworkDaft + RayClickhouse
Startup time320s313s
Running time1189s527s
Delete time122s123s
Estimate cost$2.75$1.37

Work in progress..

Dataset

The One Trillion Row Challenge originated as an ambitious benchmark task:

  • Goal: Compute the minimum, mean, and maximum temperatures per weather station, sorted alphabetically.
  • Dataset:
    • Format: Parquet
    • Size: 2.5 TB (100,000 files, each 24 MiB in size with 10 million rows)
    • Location: s3://coiled-datasets-rp/1trc (AWS S3 Requester Pays Bucket)

Here is how to download one of the 24 MiB file:

aws s3 cp s3://coiled-datasets-rp/1trc/measurements-0.parquet . --request-payer requester