Unified Engine for Data
Engineering, Analytics & ML/AI
Daft exposes both SQL and Python DataFrame interfaces as first-class citizens and is written in Rust. Daft provides a snappy and delightful local interactive experience, but also seamlessly scales to petabyte-scale distributed workloads.
USE CASES
Daft exposes both SQL and Python
DataFrame interfaces for:
Combine the performance of DuckDB, Pythonic UX of Polars and scalability of Apache Spark for data engineering from MB to PB scale
- Scale ETL workflows effortlessly from local to distributed environments
- Enjoy a Python-first experience without JVM dependency hell
- Leverage native integrations with cloud storage, open catalogs, and data formats

# Aggregate data in an ETL job and write to Delta Lake
import daft
df = daft.from_pydict(
{
"product": ["keyboard", "mouse", "monitor", "keyboard"],
"price": [28.5, 10.99, 100.99, 23.8],
"quantity": [2, 5, 1, 3],
}
)
df = df.groupby("product").agg(
daft.col("price").mean().alias("avg_price"),
daft.col("quantity").sum().alias("total_quantity"),
)
df.write_deltalake("aggregate_sales")
Blend the snappiness of DuckDB with the scalability of Spark/Trino for unified local and distributed analytics
- Utilize complementary SQL and Python interfaces for versatile analytics
- Perform snappy local exploration with DuckDB-like performance
- Seamlessly scale to the cloud, outperforming distributed engines like Spark and Trino

# Query parquet data from S3 using SQL
import daft
daft.set_planning_config(
default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))
)
df = daft.read_parquet(
"s3://daft-public-data/nyc-taxi-dataset-2023-jan-deltalake/tpep_pickup_day=1/"
)
df = daft.sql(
"""SELECT
VendorID,
trip_distance,
total_amount
FROM
df
WHERE
trip_distance > 1.0
AND total_amount > 10.0"""
)
df.show()
Streamline ML/AI workflows with efficient dataloading from open formats like Parquet and JPEG
- Load data efficiently from open formats directly into PyTorch or NumPy
- Schedule large-scale model batch inference on distributed GPU clusters
- Optimize data curation with advanced clustering, deduplication, and filtering

# Load images from URLs and decode them into PyTorch tensors
import daft
daft.set_planning_config(
default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))
)
df = daft.from_pydict(
{
"image_urls": [
"s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg",
"https://www.getdaft.io/_static/daftLogo.png",
"https://www.getdaft.io/_static/use-cases-1.png",
],
}
)
df = df.with_column("data", df["image_urls"].url.download())
df = df.with_column("images", df["data"].image.decode())
pytorch_ds = df.to_torch_iter_dataset()
for batch in pytorch_ds:
images = batch["images"]
print(images.shape)
Combine the performance of DuckDB, Pythonic UX of Polars and scalability of Apache Spark for data engineering from MB to PB scale
- Scale ETL workflows effortlessly from local to distributed environments
- Enjoy a Python-first experience without JVM dependency hell
- Leverage native integrations with cloud storage, open catalogs, and data formats

# Aggregate data in an ETL job and write to Delta Lake
import daft
df = daft.from_pydict(
{
"product": ["keyboard", "mouse", "monitor", "keyboard"],
"price": [28.5, 10.99, 100.99, 23.8],
"quantity": [2, 5, 1, 3],
}
)
df = df.groupby("product").agg(
daft.col("price").mean().alias("avg_price"),
daft.col("quantity").sum().alias("total_quantity"),
)
df.write_deltalake("aggregate_sales")
Blend the snappiness of DuckDB with the scalability of Spark/Trino for unified local and distributed analytics
- Utilize complementary SQL and Python interfaces for versatile analytics
- Perform snappy local exploration with DuckDB-like performance
- Seamlessly scale to the cloud, outperforming distributed engines like Spark and Trino

# Query parquet data from S3 using SQL
import daft
daft.set_planning_config(
default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))
)
df = daft.read_parquet(
"s3://daft-public-data/nyc-taxi-dataset-2023-jan-deltalake/tpep_pickup_day=1/"
)
df = daft.sql(
"""SELECT
VendorID,
trip_distance,
total_amount
FROM
df
WHERE
trip_distance > 1.0
AND total_amount > 10.0"""
)
df.show()
Streamline ML/AI workflows with efficient dataloading from open formats like Parquet and JPEG
- Load data efficiently from open formats directly into PyTorch or NumPy
- Schedule large-scale model batch inference on distributed GPU clusters
- Optimize data curation with advanced clustering, deduplication, and filtering

# Load images from URLs and decode them into PyTorch tensors
import daft
daft.set_planning_config(
default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))
)
df = daft.from_pydict(
{
"image_urls": [
"s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg",
"https://www.getdaft.io/_static/daftLogo.png",
"https://www.getdaft.io/_static/use-cases-1.png",
],
}
)
df = df.with_column("data", df["image_urls"].url.download())
df = df.with_column("images", df["data"].image.decode())
pytorch_ds = df.to_torch_iter_dataset()
for batch in pytorch_ds:
images = batch["images"]
print(images.shape)
"Amazon uses Daft to manage exabytes of Apache Parquet in our Amazon S3-based data catalog. Daft improved the efficiency of one of our most critical data processing jobs by over 24%, saving over 40,000 years of Amazon EC2 vCPU computing time annually."
Patrick Ames
Principal Engineer @ Amazon

"Daft has dramatically improved our 100TB+ text data pipelines, speeding up workloads such as fuzzy deduplication by 10x. Jobs previously built using custom code on Ray/Polars has been replaced by simple Daft queries, running on internet-scale unstructured datasets."
Maurice Weber
PhD AI Researcher @ Together AI

"Daft was incredible at large volumes of abnormally shaped workloads - I pointed it at 16,000 small Parquet files in a self-hosted S3 service and it just worked! It's the data engine built for the cloud and AI workloads."
Tony Wang
Data @ Anthropic, PhD @ Stanford
"Daft as an alternative to Spark has changed the way we think about data on our ML Platform. Its tight integrations with Ray lets us maintain a unified set of infrastructure while improving both query performance and developer productivity. Less is more."
Alexander Filipchik
Head Of Infrastructure at City Storage Systems (CloudKitchens)

CAPABILITIES
Blazing efficiency, designed for multimodal data.
Blazing efficiency, designed for multimodal data.
Integrate with ML/AI libraries.
Daft plugs directly into your ML/AI stack through efficient zero-copy integrations with essential Python libraries such as Pytorch and Ray. It also allows requesting GPUs as a resource for running models.

Go Distributed and Out-of-Core.
Daft runs locally with a lightweight multithreaded backend. When your local machine is no longer sufficient, it scales seamlessly to run out-of-core on a distributed cluster.

Execute Complex Operations.
Daft can handle User-Defined Functions (UDFs) on DataFrame columns, allowing you to apply complex expressions and operations on Python objects with full flexibility required for ML/AI.

Native Support for Cloud Storage.
Daft runs locally with a lightweight multithreaded backend. When your local machine is no longer sufficient, it scales seamlessly to run out-of-core on a distributed cluster.

Deliver Unmatched Speed.
Underneath its Python API, Daft is built in blazing fast Rust code. Rust powers Daft’s vectorized execution and async I/O, allowing Daft to outperform frameworks such as Spark.

COMMUNITY
Get updates,
contribute code, or say hi!

We hold contributor syncs on the last Thursday of every month to discuss new features and technical deep dives. Add it to your calendar here
Daft Engineering Blog

Join us as we explore innovative ways to handle vast datasets, optimize performance, and revolutionize your data workflows!

Take the next step with an easy tutorial.
MNIST Digit Classification
Use a simple deep learning model to run classification on the MNIST image dataset.
Running LLMs on the Red Pajamas Dataset
Perform a similarity search on Stack Exchange questions using language models and embeddings.
Querying Images with UDFs
Query the Open Images dataset to retrieve the top N “reddest” images using Numpy and Pillow inside Daft UDFs.
Image Generation on GPUs
Generate images from text prompts using a deep learning model (Mini DALL-E) and Daft UDFs