daft.DataFrame

daft.DataFrame#

class DataFrame(builder: LogicalPlanBuilder)[source]#

A Daft DataFrame is a table of data.

It has columns, where each column has a type and the same number of items (rows) as all other columns.

__init__(builder: LogicalPlanBuilder) None[source]#

Constructs a DataFrame according to a given LogicalPlan.

Users are expected instead to call the classmethods on DataFrame to create a DataFrame.

Parameters:

plan – LogicalPlan describing the steps required to arrive at this DataFrame

Methods

__init__(builder)

Constructs a DataFrame according to a given LogicalPlan.

agg(*to_agg)

Perform aggregations on this DataFrame.

agg_concat(*cols)

Performs a global list concatenation agg on the DataFrame.

agg_list(*cols)

Performs a global list agg on the DataFrame.

agg_set(*cols)

Performs a global set agg on the DataFrame (ignoring nulls).

any_value(*cols)

Returns an arbitrary value on this DataFrame.

collect([num_preview_rows])

Executes the entire DataFrame and materializes the results.

concat(other)

Concatenates two DataFrames together in a "vertical" concatenation.

count(*cols)

Performs a global count on the DataFrame.

count_rows()

Executes the Dataframe to count the number of rows.

describe()

Returns the Schema of the DataFrame, which provides information about each column, as a new DataFrame.

distinct()

Computes unique rows, dropping duplicates.

drop_nan(*cols)

Drops rows that contains NaNs.

drop_null(*cols)

Drops rows that contains NaNs or NULLs.

except_all(other)

Returns the set difference of two DataFrames, considering duplicates.

except_distinct(other)

Returns the set difference of two DataFrames.

exclude(*names)

Drops columns from the current DataFrame by name.

explain([show_all, format, simple, file])

Prints the (logical and physical) plans that will be executed to produce this DataFrame.

explode(*columns)

Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows.

filter(predicate)

Filters rows via a predicate expression, similar to SQL WHERE.

groupby(*group_by)

Performs a GroupBy on the DataFrame for aggregation.

intersect(other)

Returns the intersection of two DataFrames.

intersect_all(other)

Returns the intersection of two DataFrames, including duplicates.

into_partitions(num)

Splits or coalesces DataFrame to num partitions.

iter_partitions([results_buffer_size])

Begin executing this dataframe and return an iterator over the partitions.

iter_rows([results_buffer_size, column_format])

Return an iterator of rows for this dataframe.

join(other[, on, left_on, right_on, how, ...])

Column-wise join of the current DataFrame with an other DataFrame, similar to a SQL JOIN.

limit(num)

Limits the rows in the DataFrame to the first N rows, similar to a SQL LIMIT.

max(*cols)

Performs a global max on the DataFrame.

mean(*cols)

Performs a global mean on the DataFrame.

melt(ids[, values, variable_name, value_name])

Alias for unpivot.

min(*cols)

Performs a global min on the DataFrame.

num_partitions()

pivot(group_by, pivot_col, value_col, agg_fn)

Pivots a column of the DataFrame and performs an aggregation on the values.

repartition(num, *partition_by)

Repartitions DataFrame to num partitions.

sample(fraction[, with_replacement, seed])

Samples a fraction of rows from the DataFrame.

schema()

Returns the Schema of the DataFrame, which provides information about each column, as a Python object.

select(*columns)

Creates a new DataFrame from the provided expressions, similar to a SQL SELECT.

show([n])

Executes enough of the DataFrame in order to display the first n rows.

sort(by[, desc, nulls_first])

Sorts DataFrame globally.

stddev(*cols)

Performs a global standard deviation on the DataFrame.

sum(*cols)

Performs a global sum on the DataFrame.

summarize()

Returns column statistics for the DataFrame.

to_arrow()

Converts the current DataFrame to a pyarrow Table.

to_arrow_iter([results_buffer_size])

Return an iterator of pyarrow recordbatches for this dataframe.

to_dask_dataframe([meta])

Converts the current Daft DataFrame to a Dask DataFrame.

to_pandas([coerce_temporal_nanoseconds])

Converts the current DataFrame to a pandas DataFrame.

to_pydict()

Converts the current DataFrame to a python dictionary.

to_pylist()

Converts the current Dataframe into a python list.

to_ray_dataset()

Converts the current DataFrame to a Ray Dataset which is useful for running distributed ML model training in Ray.

to_torch_iter_dataset()

Convert the current DataFrame into a Torch IterableDataset for use with PyTorch.

to_torch_map_dataset()

Convert the current DataFrame into a map-style Torch Dataset for use with PyTorch.

transform(func, *args, **kwargs)

Apply a function that takes and returns a DataFrame.

union(other)

Returns the distinct union of two DataFrames.

union_all(other)

Returns the union of two DataFrames, including duplicates.

union_all_by_name(other)

Returns the union of two DataFrames, including duplicates, with columns matched by name.

union_by_name(other)

Returns the distinct union by name.

unpivot(ids[, values, variable_name, value_name])

Unpivots a DataFrame from wide to long format.

where(predicate)

Filters rows via a predicate expression, similar to SQL WHERE.

with_column(column_name, expr)

Adds a column to the current DataFrame with an Expression, equivalent to a select with all current columns and the new one.

with_column_renamed(existing, new)

Renames a column in the current DataFrame.

with_columns(columns)

Adds columns to the current DataFrame with Expressions, equivalent to a select with all current columns and the new ones.

with_columns_renamed(cols_map)

Renames multiple columns in the current DataFrame.

write_csv(root_dir[, write_mode, ...])

Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written.

write_deltalake(table[, partition_cols, ...])

Writes the DataFrame to a Delta Lake table, returning a new DataFrame with the operations that occurred.

write_iceberg(table[, mode, io_config])

Writes the DataFrame to an Iceberg table, returning a new DataFrame with the operations that occurred.

write_lance(uri[, mode, io_config])

Writes the DataFrame to a Lance table.

write_parquet(root_dir[, compression, ...])

Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written.

Attributes

column_names

Returns column names of DataFrame as a list of strings.

columns

Returns column of DataFrame as a list of Expressions.