daft.DataFrame

daft.DataFrame#

class daft.DataFrame(builder: LogicalPlanBuilder)[source]#

A Daft DataFrame is a table of data. It has columns, where each column has a type and the same number of items (rows) as all other columns.

__init__(builder: LogicalPlanBuilder) None[source]#

Constructs a DataFrame according to a given LogicalPlan. Users are expected instead to call the classmethods on DataFrame to create a DataFrame.

Parameters:

plan – LogicalPlan describing the steps required to arrive at this DataFrame

Methods

__init__(builder)

Constructs a DataFrame according to a given LogicalPlan.

agg(*to_agg)

Perform aggregations on this DataFrame.

agg_concat(*cols)

Performs a global list concatenation agg on the DataFrame

agg_list(*cols)

Performs a global list agg on the DataFrame

any_value(*cols)

Returns an arbitrary value on this DataFrame.

collect([num_preview_rows])

Executes the entire DataFrame and materializes the results

concat(other)

Concatenates two DataFrames together in a "vertical" concatenation.

count(*cols)

Performs a global count on the DataFrame

count_rows()

Executes the Dataframe to count the number of rows.

distinct()

Computes unique rows, dropping duplicates

drop_nan(*cols)

drops rows that contains NaNs.

drop_null(*cols)

drops rows that contains NaNs or NULLs.

exclude(*names)

Drops columns from the current DataFrame by name

explain([show_all, simple])

Prints the (logical and physical) plans that will be executed to produce this DataFrame.

explode(*columns)

Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows

groupby(*group_by)

Performs a GroupBy on the DataFrame for aggregation

into_partitions(num)

Splits or coalesces DataFrame to num partitions.

iter_partitions()

Begin executing this dataframe and return an iterator over the partitions.

join(other[, on, left_on, right_on, how, ...])

Column-wise join of the current DataFrame with an other DataFrame, similar to a SQL JOIN

limit(num)

Limits the rows in the DataFrame to the first N rows, similar to a SQL LIMIT

max(*cols)

Performs a global max on the DataFrame

mean(*cols)

Performs a global mean on the DataFrame

min(*cols)

Performs a global min on the DataFrame

num_partitions()

repartition(num, *partition_by)

Repartitions DataFrame to num partitions

sample(fraction[, with_replacement, seed])

Samples a fraction of rows from the DataFrame

schema()

Returns the Schema of the DataFrame, which provides information about each column

select(*columns)

Creates a new DataFrame from the provided expressions, similar to a SQL SELECT

show([n])

Executes enough of the DataFrame in order to display the first n rows

sort(by[, desc])

Sorts DataFrame globally

sum(*cols)

Performs a global sum on the DataFrame

to_arrow([cast_tensors_to_ray_tensor_dtype])

Converts the current DataFrame to a pyarrow Table.

to_dask_dataframe([meta])

Converts the current Daft DataFrame to a Dask DataFrame.

to_pandas([cast_tensors_to_ray_tensor_dtype])

Converts the current DataFrame to a pandas DataFrame.

to_pydict()

Converts the current DataFrame to a python dictionary.

to_ray_dataset()

Converts the current DataFrame to a Ray Dataset which is useful for running distributed ML model training in Ray

to_torch_iter_dataset()

Convert the current DataFrame into a Torch IterableDataset for use with PyTorch.

to_torch_map_dataset()

Convert the current DataFrame into a map-style Torch Dataset for use with PyTorch.

where(predicate)

Filters rows via a predicate expression, similar to SQL WHERE.

with_column(column_name, expr[, ...])

Adds a column to the current DataFrame with an Expression, equivalent to a select with all current columns and the new one

write_csv(root_dir[, partition_cols, io_config])

Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written

write_iceberg(table[, mode])

Writes the DataFrame to an Iceberg Table, returning a new DataFrame with the operations that occurred.

write_parquet(root_dir[, compression, ...])

Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written

Attributes

column_names

Returns column names of DataFrame as a list of strings.

columns

Returns column of DataFrame as a list of Expressions.