GroupBy#
When performing aggregations such as sum, mean and count, you may often want to group data by certain keys and aggregate within those keys.
Calling df.groupby()
returns a GroupedDataFrame
object which is a view of the original DataFrame but with additional context on which keys to group on. You can then call various aggregation methods to run the aggregation within each group, returning a new DataFrame.
- class GroupedDataFrame(df: daft.dataframe.dataframe.DataFrame, group_by: daft.expressions.expressions.ExpressionsProjection)[source]#
- agg(*to_agg: Union[Expression, Iterable[Expression]]) DataFrame [source]#
Perform aggregations on this GroupedDataFrame. Allows for mixed aggregations.
For a full list of aggregation expressions, see Aggregation Expressions
Example
>>> import daft >>> from daft import col >>> df = daft.from_pydict({ ... "pet": ["cat", "dog", "dog", "cat"], ... "age": [1, 2, 3, 4], ... "name": ["Alex", "Jordan", "Sam", "Riley"] ... }) >>> grouped_df = df.groupby("pet").agg( ... col("age").min().alias("min_age"), ... col("age").max().alias("max_age"), ... col("pet").count().alias("count"), ... col("name").any_value() ... ) >>> grouped_df.show() ╭──────┬─────────┬─────────┬────────┬────────╮ │ pet ┆ min_age ┆ max_age ┆ count ┆ name │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ Utf8 ┆ Int64 ┆ Int64 ┆ UInt64 ┆ Utf8 │ ╞══════╪═════════╪═════════╪════════╪════════╡ │ cat ┆ 1 ┆ 4 ┆ 2 ┆ Alex │ ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤ │ dog ┆ 2 ┆ 3 ┆ 2 ┆ Jordan │ ╰──────┴─────────┴─────────┴────────┴────────╯ (Showing first 2 of 2 rows)
- Parameters:
*to_agg (Union[Expression, Iterable[Expression]]) – aggregation expressions
- Returns:
DataFrame with grouped aggregations
- Return type:
- agg_concat(*cols: Union[Expression, str]) DataFrame [source]#
Performs grouped concat on this GroupedDataFrame.
- Returns:
DataFrame with grouped concatenated list per column.
- Return type:
- agg_list(*cols: Union[Expression, str]) DataFrame [source]#
Performs grouped list on this GroupedDataFrame.
- Returns:
DataFrame with grouped list per column.
- Return type:
- any_value(*cols: Union[Expression, str]) DataFrame [source]#
Returns an arbitrary value on this GroupedDataFrame. Values for each column are not guaranteed to be from the same row.
- Parameters:
*cols (Union[str, Expression]) – columns to get
- Returns:
DataFrame with any values.
- Return type:
- count(*cols: Union[Expression, str]) DataFrame [source]#
Performs grouped count on this GroupedDataFrame.
- Returns:
DataFrame with grouped count per column.
- Return type:
- map_groups(udf: Expression) DataFrame [source]#
Apply a user-defined function to each group. The name of the resultant column will default to the name of the first input column.
Example
>>> import daft, statistics >>> >>> df = daft.from_pydict({"group": ["a", "a", "a", "b", "b", "b"], "data": [1, 20, 30, 4, 50, 600]}) >>> >>> @daft.udf(return_dtype=daft.DataType.float64()) ... def std_dev(data): ... return [statistics.stdev(data.to_pylist())] >>> >>> df = df.groupby("group").map_groups(std_dev(df["data"])) >>> df.show() ╭───────┬────────────────────╮ │ group ┆ data │ │ --- ┆ --- │ │ Utf8 ┆ Float64 │ ╞═══════╪════════════════════╡ │ a ┆ 14.730919862656235 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ b ┆ 331.62026476076517 │ ╰───────┴────────────────────╯ (Showing first 2 of 2 rows)
- Parameters:
udf (Expression) – User-defined function to apply to each group.
- Returns:
DataFrame with grouped aggregations
- Return type:
- max(*cols: Union[Expression, str]) DataFrame [source]#
Performs grouped max on this GroupedDataFrame.
- Parameters:
*cols (Union[str, Expression]) – columns to max
- Returns:
DataFrame with grouped max.
- Return type:
- mean(*cols: Union[Expression, str]) DataFrame [source]#
Performs grouped mean on this GroupedDataFrame.
- Parameters:
*cols (Union[str, Expression]) – columns to mean
- Returns:
DataFrame with grouped mean.
- Return type:
- min(*cols: Union[Expression, str]) DataFrame [source]#
Perform grouped min on this GroupedDataFrame.
- Parameters:
*cols (Union[str, Expression]) – columns to min
- Returns:
DataFrame with grouped min.
- Return type:
- stddev(*cols: Union[Expression, str]) DataFrame [source]#
Performs grouped standard deviation on this GroupedDataFrame.
Example
>>> import daft >>> df = daft.from_pydict({"keys": ["a", "a", "a", "b"], "col_a": [0,1,2,100]}) >>> df = df.groupby("keys").stddev() >>> df.show() ╭──────┬───────────────────╮ │ keys ┆ col_a │ │ --- ┆ --- │ │ Utf8 ┆ Float64 │ ╞══════╪═══════════════════╡ │ a ┆ 0.816496580927726 │ ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ b ┆ 0 │ ╰──────┴───────────────────╯ (Showing first 2 of 2 rows)
- Parameters:
*cols (Union[str, Expression]) – columns to stddev
- Returns:
DataFrame with grouped standard deviation.
- Return type: