Exploratory Data Analysis on Three Billion Rows

Problem

Traditional Data Platform as a Service (DPaaS) solutions like Databricks, Amazon Sagemaker or Google Vertex AI often lack the agility needed for multi-cloud and hybrid environments. The issue is compounded in edge and fog environment that produce terabytes of data daily. For example, running a query on the New York Taxi dataset (300GB) stored in AWS S3 from a Google Colab Notebook would result in 20 USD in egress costs and take over an hour just to download the data.

Solution

This example demonstrates how Kubox allows users to deploy an ephemeral Kubernetes cluster within their own AWS account and region, co-locating compute with data for faster Exploratory Data Analysis (EDA).

The Source code is available in in our Github repository at https://github.com/kubox-ai/notebooks

Clone the Kubox notebooks repository to your local machine:

git clone https://github.com/kubox-ai/notebooks.git

Create a Kubox cluster in your own AWS account by:

kubox create -f cluster-basic.yaml

An example cluster-basic.yaml:

metadata:
  clusterName: notebook15
  clusterConfigDir: "./basic/cluster/config"
  pulumi:
    orgName: kubox-ai
    projectName: playground

tags:
  - key: "Environment"
    value: "dev"

# These will be passed to Flux
gitOps:
  directoryName: "./basic/cluster"

aws:
  region: "ap-southeast-2"
  nodes:
    - vmType: t2.medium
      count: 1
      role: control-plane
      spotInstance: false
    - vmType: c4.4xlarge
      count: 1
      role: worker
      # By default Kubox will create EC2 instances as spot.
      spotInstance: false
      awsIAMInstanceProfile: KuboxEC2InstanceRole

Deploying compute close to the data with an ephemeral approach not only optimises costs and time but also ensures up-to-date security, making environments resilient against advanced threats and enabling a new era of efficient, cloud-agnostic data platforms.

This notebook analyses NYC taxi trips from multiple data sources (Yellow Taxi, Green Taxi, and For-Hire Vehicle trips) between 2014 and 2023. Using Daft - a Unified Engine for Data Engineering, Analytics & ML/AI for data manipulation and Seaborn for visualisation. The data is sourced from public NYC taxi datasets in Parquet format, and provides insights into trip trends over time.

import daft
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Assuming yellow_df.count() returns an integer
total_rows = yellow_df.count_rows() + green_df.count_rows() + fhvhv_df.count_rows()

# Format the row count in millions
formatted_count = f"{total_rows / 1_000_000:0.2f}"

# Print the formatted count
print(f"Rows count (Millions): {formatted_count}")

# Rows count (Millions): 1957.38

# Group by year and count the number of trips
yellow_result = (
    yellow_df.select(
        # Extract year from the pickup datetime column
        yellow_df["tpep_pickup_datetime"].dt.date().alias("date"),
        yellow_df["tpep_pickup_datetime"].alias("pick_datetime"),
        yellow_df["tpep_pickup_datetime"],
        (yellow_df["fare_amount"] * 0 + 1).alias("type"),
    )
    .groupby(["date", "type"])
    .agg(yellow_df["tpep_pickup_datetime"].count().alias("trip_count"))
)

Background

Data Locality is the practice of storing data near the processing unit that will consume it, reducing data movement and latency in data-intensive applications. As the amount of data increases, the concept of data locality becomes increasingly important. As more data is produced outside traditional data centres, particularly in edge and fog environments, having an agile, versatile platform that can be spun up anywhere (including cloud) within minutes is transformative. This approach not only reduces the time from data to insights but also slashes costs dramatically.

Dataset

The NYC Taxi dataset is one of the most popular and widely used datasets for analysing urban transportation trends. It contains detailed trip records of yellow taxis, green taxis, and for-hire vehicles (e.g., Uber and Lyft) operating in New York City since 2009. The data includes pickup and drop-off times, locations, fares, passenger counts, and much more, providing a rich resource for data analysis and visualisation.

How to Get It Easily?

aws s3 ls s3://kubox-public-datasets/nyc-tlc/

To make the dataset more accessible, it is available through a Requester Pays S3 bucket configuration. This means that the requester (you) pays for the data transfer and download costs, while Kubox continues to bear the storage costs. For more details, refer to the AWS Requester Pays documentation.

The dataset was historically available on the Registry of Open Data on AWS. However, due to this known issue, the raw data had to be manually downloaded for this analysis.

What is the Official Source?

The official NYC Taxi and Limousine Commission (TLC) dataset can be downloaded from the TLC Trip Record Data page. This dataset has been the backbone of numerous transportation studies, including the renowned analysis by Todd W. Schneider, who explored over 3 billion taxi and for-hire vehicle trips. Check out his blog post, Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance, for deeper insights into this dataset.

Use Cases

​Problem

​Solution

​Background

​Dataset

​How to Get It Easily?

​What is the Official Source?

Problem

Solution

Background

Dataset

How to Get It Easily?

What is the Official Source?