Daft: The Distributed Python Dataframe#

Daft is a fast and scalable Python dataframe for Complex Data and Machine Learning workloads.

Daft python dataframes make it easy to load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

Get Started#

Installing Daft is simple with pip:

pip install getdaft

Community#


More Resources#

10-minutes to Daft

10-minute walkthrough of all of Daft's major functionality.

View Walkthrough

Tutorials

Hosted examples using Daft in various common use-cases.

View Tutorials

Docs

Developer documentation for referencing Daft APIs.

View Docs

Integrations#

Daft is open-sourced and you can use any Python library when processing data in a dataframe. It integrates with many other open-sourced technologies as well, plugging directly into your current infrastructure and systems.

Data Science and Machine Learning

numpy the Python numerical library Pandas a python dataframe library Polars a python dataframe library Ray the Python distributed systems framework Jupyter notebooks for interactive computing

Storage

Apache Parquet file formats Apache Arrow for efficient data serialization AWS S3 for cloud storage Google Cloud Storage for cloud storage Azure Blob Store for cloud storage

Use-Cases#

Data Science Experimentation

Daft enables data scientists/engineers to work from their preferred Python notebook environment for interactive experimentation on complex data

Complex Data Warehousing

The Daft Python dataframe efficiently pipelines complex data from raw data lakes to clean, queryable datasets for analysis and reporting.

Machine Learning Training Dataset Curation

Modern Machine Learning is data-driven and relies on clean data. The Daft Python dataframe integrates with dataloading frameworks such as Ray and PyTorch to feed data to distributed model training.

Machine Learning Model Evaluation

Evaluating the performance of machine learning systems is challenging, but Daft Python dataframes make it easy to run models and SQL-style analyses at scale.


Key Features#

Python UDF

Daft supports running User-Defined Functions (UDF) on columns of Python objects - if Python supports it Daft can handle it!

Interactive Computing

Daft embraces Python's dynamic and interactive nature, enabling fast, iterative experimentation on data in your notebook and on your laptop.

Distributed Computing

Daft integrates with frameworks such as Ray to run large petabyte-scale dataframes on a cluster of machines in the cloud.