Unified Engine for Data
Engineering, Analytics & ML/AI

Daft exposes both SQL and Python DataFrame interfaces as first-class citizens and is written in Rust. Daft provides a snappy and delightful local interactive experience, but also seamlessly scales to petabyte-scale distributed workloads.

USE CASES

Daft exposes both SQL and Python
DataFrame interfaces for:

Combine the performance of DuckDB, Pythonic UX of Polars and scalability of Apache Spark for data engineering from MB to PB scale

  • Scale ETL workflows effortlessly from local to distributed environments
  • Enjoy a Python-first experience without JVM dependency hell
  • Leverage native integrations with cloud storage, open catalogs, and data formats
                          
# Aggregate data in an ETL job and write to Delta Lake

import daft

df = daft.from_pydict(
    {
        "product": ["keyboard", "mouse", "monitor", "keyboard"],
        "price": [28.5, 10.99, 100.99, 23.8],
        "quantity": [2, 5, 1, 3],
    }
)
df = df.groupby("product").agg(
    daft.col("price").mean().alias("avg_price"),
    daft.col("quantity").sum().alias("total_quantity"),
)
df.write_deltalake("aggregate_sales")
                          
                          

Blend the snappiness of DuckDB with the scalability of Spark/Trino for unified local and distributed analytics

  • Utilize complementary SQL and Python interfaces for versatile analytics
  • Perform snappy local exploration with DuckDB-like performance
  • Seamlessly scale to the cloud, outperforming distributed engines like Spark and Trino
                          
# Query parquet data from S3 using SQL

import daft

daft.set_planning_config(
    default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))
)
df = daft.read_parquet(
    "s3://daft-public-data/nyc-taxi-dataset-2023-jan-deltalake/tpep_pickup_day=1/"
)
df = daft.sql(
    """SELECT
    VendorID,
    trip_distance,
    total_amount
FROM
    df
WHERE
    trip_distance > 1.0
    AND total_amount > 10.0"""
)
df.show()
                          
                          

Streamline ML/AI workflows with efficient dataloading from open formats like Parquet and JPEG

  • Load data efficiently from open formats directly into PyTorch or NumPy
  • Schedule large-scale model batch inference on distributed GPU clusters
  • Optimize data curation with advanced clustering, deduplication, and filtering
                          
# Load images from URLs and decode them into PyTorch tensors

import daft

daft.set_planning_config(
    default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))
)
df = daft.from_pydict(
    {
        "image_urls": [
            "s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg",
            "https://www.getdaft.io/_static/daftLogo.png",
            "https://www.getdaft.io/_static/use-cases-1.png",
        ],
    }
)
df = df.with_column("data", df["image_urls"].url.download())
df = df.with_column("images", df["data"].image.decode())

pytorch_ds = df.to_torch_iter_dataset()
for batch in pytorch_ds:
    images = batch["images"]
    print(images.shape)
                          
                          

Combine the performance of DuckDB, Pythonic UX of Polars and scalability of Apache Spark for data engineering from MB to PB scale

  • Scale ETL workflows effortlessly from local to distributed environments
  • Enjoy a Python-first experience without JVM dependency hell
  • Leverage native integrations with cloud storage, open catalogs, and data formats
              
# Aggregate data in an ETL job and write to Delta Lake

import daft

df = daft.from_pydict(
    {
        "product": ["keyboard", "mouse", "monitor", "keyboard"],
        "price": [28.5, 10.99, 100.99, 23.8],
        "quantity": [2, 5, 1, 3],
    }
)
df = df.groupby("product").agg(
    daft.col("price").mean().alias("avg_price"),
    daft.col("quantity").sum().alias("total_quantity"),
)
df.write_deltalake("aggregate_sales")
              
              

Blend the snappiness of DuckDB with the scalability of Spark/Trino for unified local and distributed analytics

  • Utilize complementary SQL and Python interfaces for versatile analytics
  • Perform snappy local exploration with DuckDB-like performance
  • Seamlessly scale to the cloud, outperforming distributed engines like Spark and Trino
                      
# Query parquet data from S3 using SQL

import daft

daft.set_planning_config(
    default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))
)
df = daft.read_parquet(
    "s3://daft-public-data/nyc-taxi-dataset-2023-jan-deltalake/tpep_pickup_day=1/"
)
df = daft.sql(
    """SELECT
    VendorID,
    trip_distance,
    total_amount
FROM
    df
WHERE
    trip_distance > 1.0
    AND total_amount > 10.0"""
)
df.show()
                      
                      

Streamline ML/AI workflows with efficient dataloading from open formats like Parquet and JPEG

  • Load data efficiently from open formats directly into PyTorch or NumPy
  • Schedule large-scale model batch inference on distributed GPU clusters
  • Optimize data curation with advanced clustering, deduplication, and filtering
                      
# Load images from URLs and decode them into PyTorch tensors

import daft

daft.set_planning_config(
    default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True))
)
df = daft.from_pydict(
    {
        "image_urls": [
            "s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg",
            "https://www.getdaft.io/_static/daftLogo.png",
            "https://www.getdaft.io/_static/use-cases-1.png",
        ],
    }
)
df = df.with_column("data", df["image_urls"].url.download())
df = df.with_column("images", df["data"].image.decode())

pytorch_ds = df.to_torch_iter_dataset()
for batch in pytorch_ds:
    images = batch["images"]
    print(images.shape)
                      
                      

TRUSTED BY

"Amazon uses Daft to manage exabytes of Apache Parquet in our Amazon S3-based data catalog. Daft improved the efficiency of one of our most critical data processing jobs by over 24%, saving over 40,000 years of Amazon EC2 vCPU computing time annually."

Patrick Ames

Principal Engineer @ Amazon

"Daft has dramatically improved our 100TB+ text data pipelines, speeding up workloads such as fuzzy deduplication by 10x. Jobs previously built using custom code on Ray/Polars has been replaced by simple Daft queries, running on internet-scale unstructured datasets."

Maurice Weber

PhD AI Researcher @ Together AI

"Daft was incredible at large volumes of abnormally shaped workloads - I pointed it at 16,000 small Parquet files in a self-hosted S3 service and it just worked! It's the data engine built for the cloud and AI workloads."

Tony Wang

Data @ Anthropic, PhD @ Stanford

"Daft as an alternative to Spark has changed the way we think about data on our ML Platform. Its tight integrations with Ray lets us maintain a unified set of infrastructure while improving both query performance and developer productivity. Less is more."

Alexander Filipchik

Head Of Infrastructure at City Storage Systems (CloudKitchens)

CAPABILITIES

Blazing efficiency, designed for multimodal data.

Blazing efficiency, designed for multimodal data.

Integrate with ML/AI libraries.

Daft plugs directly into your ML/AI stack through efficient zero-copy integrations with essential Python libraries such as Pytorch and Ray. It also allows requesting GPUs as a resource for running models.

read more

Go Distributed and Out-of-Core.

Daft runs locally with a lightweight multithreaded backend. When your local machine is no longer sufficient, it scales seamlessly to run out-of-core on a distributed cluster.

read more

Execute Complex Operations.

Daft can handle User-Defined Functions (UDFs) on DataFrame columns, allowing you to apply complex expressions and operations on Python objects with full flexibility required for ML/AI.

read more

Native Support for Cloud Storage.

Daft runs locally with a lightweight multithreaded backend. When your local machine is no longer sufficient, it scales seamlessly to run out-of-core on a distributed cluster.

read more

Deliver Unmatched Speed.

Underneath its Python API, Daft is built in blazing fast Rust code. Rust powers Daft’s vectorized execution and async I/O, allowing Daft to outperform frameworks such as Spark.

read more

ECOSYSTEM

Integrations? We'll build it.

Data Science and Machine Learning

Storage and Infrastructure

Integrations?
We'll build it.

COMMUNITY

Get updates,
contribute code, or say hi!

We hold contributor syncs on the last Thursday of every month to discuss new features and technical deep dives. Add it to your calendar here

Daft Engineering Blog

Join us as we explore innovative ways to handle vast datasets, optimize performance, and revolutionize your data workflows!

Take the next step with an easy tutorial.

MNIST Digit Classification

Use a simple deep learning model to run classification on the MNIST image dataset.

see tutorial

Running LLMs on the Red Pajamas Dataset

Perform a similarity search on Stack Exchange questions using language models and embeddings.

see tutorial

Querying Images with UDFs

Query the Open Images dataset to retrieve the top N “reddest” images using Numpy and Pillow inside Daft UDFs.

see tutorial

Image Generation on GPUs

Generate images from text prompts using a deep learning model (Mini DALL-E) and Daft UDFs

see tutorial