Exploratory data analysis on three billion rows
Analysing New York taxi dataset.
Problem
Traditional Data Platform as a Service (DPaaS) solutions like Databricks, Amazon Sagemaker or Google Vertex AI often lack the agility needed for multi-cloud and hybrid environments. The issue is compounded in edge and fog environment that produces terabytes of data daily.
For example, running a query on the New York Taxi dataset (300GB) stored in AWS S3 from a Google Colab Notebook would result in USD 20 in egress costs and take over an hour just to download the data.
Solution
This example demonstrates how Kubox allows users to deploy an ephemeral Kubernetes cluster within their own AWS account and region, co-locating compute with data for faster Exploratory Data Analysis (EDA).
Source code in our Github repository at https://github.com/kubox-ai/notebooks
Clone the Kubox notebooks repository to your local machine:
Create a Kubox cluster in your own AWS account by:
An example cluster-basic.yaml
:
This notebook analyses NYC taxi trips from multiple data sources (Yellow Taxi, Green Taxi, and For-Hire Vehicle trips) between 2014 and 2023. Using Daft - a Unified Engine for Data Engineering, Analytics & ML/AI for data manipulation and Seaborn for visualisation. The data is sourced from public NYC taxi datasets in Parquet format, and provides insights into trip trends over time.
Background
Data Locality is the practice of storing data near the processing unit that will consume it, reducing data movement and latency in data-intensive applications. As the amount of data increases, the concept of data locality becomes increasingly important. As more data is produced outside traditional data centres, particularly in edge and fog environments, having an agile, versatile platform that can be spun up anywhere (including cloud) within minutes is transformative. This approach not only reduces the time from data to insights but also slashes costs dramatically.
Dataset
The NYC Taxi dataset is one of the most popular and widely used datasets for analysing urban transportation trends. It contains detailed trip records of yellow taxis, green taxis, and for-hire vehicles (e.g., Uber and Lyft) operating in New York City since 2009. The data includes pickup and drop-off times, locations, fares, passenger counts, and much more, providing a rich resource for data analysis and visualisation.
How to get it easily?
To make the dataset more accessible, it is available through a Requester Pays S3 bucket configuration. This means that the requester (you) pays for the data transfer and download costs, while Kubox continues to bear the storage costs. For more details, refer to the AWS Requester Pays documentation.
What is the Official Source?
The official NYC Taxi and Limousine Commission (TLC) dataset can be downloaded from the TLC Trip Record Data page. This dataset has been the backbone of numerous transportation studies, including the renowned analysis by Todd W. Schneider, who explored over 3 billion taxi and for-hire vehicle trips. Check out his blog post, Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance, for deeper insights into this dataset.