Introduction#
Daft is a distributed query engine with a DataFrame API. The two key concepts to Daft are:
DataFrame
: a Table-like structure that represents rows and columns of dataExpression
: a symbolic representation of computation that transforms columns of the DataFrame to a new one.
With Daft, you create DataFrame
from a variety of sources (e.g. reading data from files, data catalogs or from Python dictionaries) and use Expression
to manipulate data in that DataFrame. Let’s take a closer look at these two abstractions!
DataFrame#
Conceptually, a DataFrame is a “table” of data, with rows and columns.
Using this abstraction of a DataFrame, you can run common tabular operations such as:
Filtering rows:
df.where(...)
Creating new columns as a computation of existing columns:
df.with_column(...)
Joining two tables together:
df.join(...)
Sorting a table by the values in specified column(s):
df.sort(...)
Grouping and aggregations:
df.groupby(...).agg(...)
Daft DataFrames are:
Distributed: your data is split into Partitions and can be processed in parallel/on different machines
Lazy: computations are enqueued in a query plan which is then optimized and executed only when requested
Multimodal: columns can contain complex datatypes such as tensors, images and Python objects
Since Daft is lazy, it can actually execute the query plan on a variety of different backends. By default, it will run computations locally using Python multithreading. However if you need to scale to large amounts of data that cannot be processed on a single machine, using the Ray runner allows Daft to run computations on a Ray cluster instead.
Expressions#
The other important concept to understand when working with Daft are expressions.
Because Daft is “lazy”, it needs a way to represent computations that need to be performed on its data so that it can execute these computations at some later time. The answer to this is an Expression
!
The simplest Expressions are:
The column expression:
col("a")
which is used to refer to “some column named ‘a’”Or, if you already have an existing DataFrame
df
with a column named “a”, you can refer to its column with Python’s square bracket indexing syntax:df["a"]
The literal expression:
lit(100)
which represents a column that always takes on the provided value
Daft then provides an extremely rich Expressions library to allow you to compose different computations that need to happen. For example:
from daft import col, DataType
# Take the column named "a" and add 1 to each element
col("a") + 1
# Take the column named "a", cast it to a string and check each element, returning True if it starts with "1"
col("a").cast(DataType.string()).str.startswith("1")
Expressions are used in DataFrame operations, and the names of these Expressions are resolved to column names on the DataFrame that they are running on. Here is an example:
import daft
# Create a dataframe with a column "a" that has values [1, 2, 3]
df = daft.from_pydict({"a": [1, 2, 3]})
# Create new columns called "a_plus_1" and "a_startswith_1" using Expressions
df = df.select(
col("a"),
(col("a") + 1).alias("a_plus_1"),
col("a").cast(DataType.string()).str.startswith("1").alias("a_startswith_1"),
)
df.show()
+---------+------------+------------------+
| a | a_plus_1 | a_startswith_1 |
| Int64 | Int64 | Boolean |
+=========+============+==================+
| 1 | 2 | true |
+---------+------------+------------------+
| 2 | 3 | false |
+---------+------------+------------------+
| 3 | 4 | false |
+---------+------------+------------------+
(Showing first 3 rows)