daft.DataFrame#
- class DataFrame(builder: LogicalPlanBuilder)[source]#
A Daft DataFrame is a table of data.
It has columns, where each column has a type and the same number of items (rows) as all other columns.
- __init__(builder: LogicalPlanBuilder) None [source]#
Constructs a DataFrame according to a given LogicalPlan.
Users are expected instead to call the classmethods on DataFrame to create a DataFrame.
- Parameters:
plan – LogicalPlan describing the steps required to arrive at this DataFrame
Methods
__init__
(builder)Constructs a DataFrame according to a given LogicalPlan.
agg
(*to_agg)Perform aggregations on this DataFrame.
agg_concat
(*cols)Performs a global list concatenation agg on the DataFrame.
agg_list
(*cols)Performs a global list agg on the DataFrame.
agg_set
(*cols)Performs a global set agg on the DataFrame (ignoring nulls).
any_value
(*cols)Returns an arbitrary value on this DataFrame.
collect
([num_preview_rows])Executes the entire DataFrame and materializes the results.
concat
(other)Concatenates two DataFrames together in a "vertical" concatenation.
count
(*cols)Performs a global count on the DataFrame.
Executes the Dataframe to count the number of rows.
describe
()Returns the Schema of the DataFrame, which provides information about each column, as a new DataFrame.
distinct
()Computes unique rows, dropping duplicates.
drop_nan
(*cols)Drops rows that contains NaNs.
drop_null
(*cols)Drops rows that contains NaNs or NULLs.
except_all
(other)Returns the set difference of two DataFrames, considering duplicates.
except_distinct
(other)Returns the set difference of two DataFrames.
exclude
(*names)Drops columns from the current DataFrame by name.
explain
([show_all, format, simple, file])Prints the (logical and physical) plans that will be executed to produce this DataFrame.
explode
(*columns)Explodes a List column, where every element in each row's List becomes its own row, and all other columns in the DataFrame are duplicated across rows.
filter
(predicate)Filters rows via a predicate expression, similar to SQL
WHERE
.groupby
(*group_by)Performs a GroupBy on the DataFrame for aggregation.
intersect
(other)Returns the intersection of two DataFrames.
intersect_all
(other)Returns the intersection of two DataFrames, including duplicates.
into_partitions
(num)Splits or coalesces DataFrame to
num
partitions.iter_partitions
([results_buffer_size])Begin executing this dataframe and return an iterator over the partitions.
iter_rows
([results_buffer_size, column_format])Return an iterator of rows for this dataframe.
join
(other[, on, left_on, right_on, how, ...])Column-wise join of the current DataFrame with an
other
DataFrame, similar to a SQLJOIN
.limit
(num)Limits the rows in the DataFrame to the first
N
rows, similar to a SQLLIMIT
.max
(*cols)Performs a global max on the DataFrame.
mean
(*cols)Performs a global mean on the DataFrame.
melt
(ids[, values, variable_name, value_name])Alias for unpivot.
min
(*cols)Performs a global min on the DataFrame.
num_partitions
()pivot
(group_by, pivot_col, value_col, agg_fn)Pivots a column of the DataFrame and performs an aggregation on the values.
repartition
(num, *partition_by)Repartitions DataFrame to
num
partitions.sample
(fraction[, with_replacement, seed])Samples a fraction of rows from the DataFrame.
schema
()Returns the Schema of the DataFrame, which provides information about each column, as a Python object.
select
(*columns)Creates a new DataFrame from the provided expressions, similar to a SQL
SELECT
.show
([n])Executes enough of the DataFrame in order to display the first
n
rows.sort
(by[, desc, nulls_first])Sorts DataFrame globally.
stddev
(*cols)Performs a global standard deviation on the DataFrame.
sum
(*cols)Performs a global sum on the DataFrame.
Returns column statistics for the DataFrame.
to_arrow
()Converts the current DataFrame to a pyarrow Table.
to_arrow_iter
([results_buffer_size])Return an iterator of pyarrow recordbatches for this dataframe.
to_dask_dataframe
([meta])Converts the current Daft DataFrame to a Dask DataFrame.
to_pandas
([coerce_temporal_nanoseconds])Converts the current DataFrame to a pandas DataFrame.
Converts the current DataFrame to a python dictionary.
Converts the current Dataframe into a python list.
Converts the current DataFrame to a Ray Dataset which is useful for running distributed ML model training in Ray.
Convert the current DataFrame into a Torch IterableDataset for use with PyTorch.
Convert the current DataFrame into a map-style Torch Dataset for use with PyTorch.
transform
(func, *args, **kwargs)Apply a function that takes and returns a DataFrame.
union
(other)Returns the distinct union of two DataFrames.
union_all
(other)Returns the union of two DataFrames, including duplicates.
union_all_by_name
(other)Returns the union of two DataFrames, including duplicates, with columns matched by name.
union_by_name
(other)Returns the distinct union by name.
unpivot
(ids[, values, variable_name, value_name])Unpivots a DataFrame from wide to long format.
where
(predicate)Filters rows via a predicate expression, similar to SQL
WHERE
.with_column
(column_name, expr)Adds a column to the current DataFrame with an Expression, equivalent to a
select
with all current columns and the new one.with_column_renamed
(existing, new)Renames a column in the current DataFrame.
with_columns
(columns)Adds columns to the current DataFrame with Expressions, equivalent to a
select
with all current columns and the new ones.with_columns_renamed
(cols_map)Renames multiple columns in the current DataFrame.
write_csv
(root_dir[, write_mode, ...])Writes the DataFrame as CSV files, returning a new DataFrame with paths to the files that were written.
write_deltalake
(table[, partition_cols, ...])Writes the DataFrame to a Delta Lake table, returning a new DataFrame with the operations that occurred.
write_iceberg
(table[, mode, io_config])Writes the DataFrame to an Iceberg table, returning a new DataFrame with the operations that occurred.
write_lance
(uri[, mode, io_config])Writes the DataFrame to a Lance table.
write_parquet
(root_dir[, compression, ...])Writes the DataFrame as parquet files, returning a new DataFrame with paths to the files that were written.
Attributes
Returns column names of DataFrame as a list of strings.
Returns column of DataFrame as a list of Expressions.