Overview#
Welcome to Daft!
Daft is a unified data engine for data engineering, analytics, and ML/AI. It exposes both SQL and Python DataFrame interfaces as first-class citizens and is written in Rust. Daft provides a snappy and delightful local interactive experience, but also seamlessly scales to petabyte-scale distributed workloads.
Use Cases#
Data Engineering
Combine the performance of DuckDB, Pythonic UX of Polars and scalability of Apache Spark for data engineering from MB to PB scale
- Scale ETL workflows effortlessly from local to distributed environments
- Enjoy a Python-first experience without JVM dependency hell
- Leverage native integrations with cloud storage, open catalogs, and data formats
Data Analytics
Blend the snappiness of DuckDB with the scalability of Spark/Trino for unified local and distributed analytics
- Utilize complementary SQL and Python interfaces for versatile analytics
- Perform snappy local exploration with DuckDB-like performance
- Seamlessly scale to the cloud, outperforming distributed engines like Spark and Trino
ML/AI
Streamline ML/AI workflows with efficient dataloading from open formats like Parquet and JPEG
- Load data efficiently from open formats directly into PyTorch or NumPy
- Schedule large-scale model batch inference on distributed GPU clusters
- Optimize data curation with advanced clustering, deduplication, and filtering
Technology#
Daft boasts strong integrations with technologies common across these workloads:
- Cloud Object Storage: Record-setting I/O performance for integrations with S3 cloud storage, battle-tested at exabyte-scale at Amazon
- ML/AI Python Ecosystem: First-class integrations with PyTorch and NumPy for efficient interoperability with your ML/AI stack
- Data Catalogs/Table Formats: Capabilities to effectively query table formats such as Apache Iceberg, Delta Lake and Apache Hudi
- Seamless Data Interchange: Zero-copy integration with Apache Arrow
- Multimodal/ML Data: Native functionality for data modalities such as tensors, images, URLs, long-form text and embeddings
Learning Daft#
This user guide aims to help Daft users master the usage of Daft for all your data needs.
Looking to get started with Daft ASAP?
The Daft User Guide is a useful resource to take deeper dives into specific Daft concepts, but if you are ready to jump into code you may wish to take a look at these resources:
-
Quickstart: Itching to run some Daft code? Hit the ground running with our 10 minute quickstart notebook.
-
API Documentation: Searchable documentation and reference material to Daft’s public API.
Get Started#
-
Install Daft from your terminal and discover more advanced installation options.
-
Install Daft, create your first DataFrame, and get started with common DataFrame operations.
-
Learn about the terminology related to Daft, such as DataFrames, Expressions, Query Plans, and more.
-
Understand the different components to Daft under-the-hood.
Daft in Depth#
-
Learn how to perform core DataFrame operations in Daft, including selection, filtering, joining, and sorting.
-
Daft expressions enable computations on DataFrame columns using Python or SQL for various operations.
-
How to use Daft to read data from diverse sources like files, databases, and URLs.
-
How to use Daft to write data DataFrames to files or other destinations.
-
Daft DataTypes define the types of data in a DataFrame, from simple primitives to complex structures.
-
Daft supports SQL for constructing query plans and expressions, while integrating with Python expressions.
-
Daft supports aggregations and grouping across entire DataFrames and within grouped subsets of data.
-
Daft allows you to define custom UDFs to process data at scale with flexibility in input and output.
-
Daft is built to work with multimodal data types, including URLs and images.
More Resources#
Contribute to Daft#
If you're interested in hands-on learning about Daft internals and would like to contribute to our project, join us on Github 🚀
Take a look at the many issues tagged with good first issue
in our repo. If there are any that interest you, feel free to chime in on the issue itself or join us in our Distributed Data Slack Community and send us a message in #daft-dev. Daft team members will be happy to assign any issue to you and provide any guidance if needed!