User Defined Functions (UDFs)
User Defined Functions (UDFs)#
- daft.udf(*, return_dtype: daft.types.ExpressionType, input_columns: dict[str, type], **kwargs) Callable [source]#
Decorator for creating a UDF. This decorator wraps any custom Python code into a function that can be used to process columns of data in a Daft DataFrame.
Each UDF will process a batch of columnar data, and output a batch of columnar data as well. At runtime, Daft runs your UDF on a partition of data at a time, and your UDF will receive input batches of length equal to the partition size.
Note
UDFs are much slower than native Daft expressions because they run Python code instead of Daft’s optimized Rust kernels. You should only use UDFs when performing operations that are not supported by Daft’s native expressions, or when you need to run custom Python code.
The following example UDF, while a simple example, will be much slower than
df["x"] + 100
since it is run as Python instead of as a Rust addition kernel using the Expressions API.Example:
>>> @udf( >>> # Annotate the return dtype as an integer >>> return_dtype=int, >>> # Mark the `x` input parameter as a column, and tell Daft to pass it in as a list >>> input_columns={"x": list}, >>> ) >>> def add_val(x, val=1): >>> # Your custom Python code here >>> return [x + val for value in x]
To invoke your UDF, you can use the
DataFrame.with_column
method:>>> df = DataFrame.from_pydict({"x": [1, 2, 3]}) >>> df = df.with_column("x_add_100", add_val(df["x"], val=100))
UDF Function Outputs
The
return_dtype
argument specifies what type of column your UDF will return. For user convenience, you may specify a Python type such as str, int, float, bool and datetime.date, which will be converted into a Daft dtype for you.Python types that are not recognized as Daft types will be represented as a Daft Python object dtype. For example, if you specify
return_dtype=np.ndarray
, then your returned column will have typePY[np.ndarray]
.>>> @udf( >>> # Annotate the return dtype as an numpy array >>> return_dtype=np.ndarray, >>> input_columns={"x": list}, >>> ) >>> def create_np_zero_arrays(x, dim=128): >>> return [np.zeros((dim,)) for i in range(len(x))]
Your UDF needs to return a batch of columnar data, and can do so as any one of the following array types:
Numpy Arrays (
np.ndarray
)Pandas Series (
pd.Series
)Polars Series (
polars.Series
)PyArrow Arrays (
pa.Array
) or (pa.ChunkedArray
)Python lists (
list
ortyping.List
)
UDF Function Inputs
The
input_columns
argument is a dictionary. The keys specify which input parameters of your functions are columns. The values specify the array container type that Daft should use when passing data into your function.Here’s an example where the same column is used multiple times in a UDF, but passed in as different types to illustrate how this works!
>>> @udf( >>> return_dtype=int, >>> input_columns={"x_as_list": list, "x_as_numpy": np.ndarray, "x_as_pandas": pd.Series}, >>> ) >>> def example_func(x_as_list, x_as_numpy, x_as_pandas): >>> assert isinstance(x_as_list, list) >>> assert isinstance(x_as_numpy, np.ndarray) >>> assert isinstance(x_as_pandas, pd.Series) >>> >>> df = df.with_column("foo", example_func(df["x"], df["x"], df["x"]))
In the above example, when your DataFrame is executed, Daft will pass in batches of column data as equal-length arrays into your function. The actual type of those arrays will take on the types indicated by
input_columns
.Input types supported by Daft UDFs and their respective type annotations:
Numpy Arrays (
np.ndarray
)Pandas Series (
pd.Series
)Polars Series (
polars.Series
)PyArrow Arrays (
pa.Array
) or (pa.ChunkedArray
)Python lists (
list
ortyping.List
)
Note
Certain array formats have some restrictions around the type of data that they can handle:
1. Null Handling: In Pandas and Numpy, nulls are represented as NaNs for numeric types, and Nones for non-numeric types. Additionally, the existence of nulls will trigger a type casting from integer to float arrays. If null handling is important to your use-case, we recommend using one of the other available options.
Python Objects: PyArrow array formats cannot support object-type columns.
We recommend using Python lists if performance is not a major consideration, and using the arrow-native formats such as PyArrow arrays and Polars series if performance is important.
Stateful UDFs
UDFs can also be created on Classes, which allow for initialization on some expensive state that can be shared between invocations of the class, for example downloading data or creating a model.
>>> @udf(return_dtype=int, input_columns={"features_col": np.ndarray}) >>> class RunModel: >>> def __init__(self): >>> # Perform expensive initializations >>> self._model = create_model() >>> >>> def __call__(self, features_col): >>> return self._model(features_col)
- Parameters
f – Function to wrap as a UDF, accepts column inputs as Numpy arrays and returns a column of data as a Polars Series/Numpy array/Python list/Pandas series.
return_dtype – The return dtype of the UDF
input_columns – Optional dictionary of input parameter names to their types. If provided, this will override type hints provided using the function’s type annotations.