Problem
Traditional Data Platform as a Service (DPaaS) solutions like Databricks, Amazon Sagemaker or Google Vertex AI often lack the agility needed for multi-cloud and hybrid environments. The issue is compounded in edge and fog environment that produce terabytes of data daily. For example, running a query on the New York Taxi dataset (300GB) stored in AWS S3 from a Google Colab Notebook would result in 20 USD in egress costs and take over an hour just to download the data.Solution
This example demonstrates how Kubox allows users to deploy an ephemeral Kubernetes cluster within their own AWS account and region, co-locating compute with data for faster Exploratory Data Analysis (EDA).The Source code is available in in our Github repository at https://github.com/kubox-ai/notebooksClone the Kubox notebooks repository to your local machine:
cluster-basic.yaml
:
Deploying compute close to the data with an ephemeral approach not only optimises costs and time but also ensures up-to-date security, making environments resilient against advanced threats and enabling a new era of efficient, cloud-agnostic data platforms.


Background
Data Locality is the practice of storing data near the processing unit that will consume it, reducing data movement and latency in data-intensive applications. As the amount of data increases, the concept of data locality becomes increasingly important. As more data is produced outside traditional data centres, particularly in edge and fog environments, having an agile, versatile platform that can be spun up anywhere (including cloud) within minutes is transformative. This approach not only reduces the time from data to insights but also slashes costs dramatically.Dataset
The NYC Taxi dataset is one of the most popular and widely used datasets for analysing urban transportation trends. It contains detailed trip records of yellow taxis, green taxis, and for-hire vehicles (e.g., Uber and Lyft) operating in New York City since 2009. The data includes pickup and drop-off times, locations, fares, passenger counts, and much more, providing a rich resource for data analysis and visualisation.How to Get It Easily?
The dataset was historically available on the Registry of Open Data on AWS. However, due to this known issue, the raw data had to be manually downloaded for this analysis.