GroupBy#

When performing aggregations such as sum, mean and count, you may often want to group data by certain keys and aggregate within those keys.

Calling df.groupby() returns a GroupedDataFrame object which is a view of the original DataFrame but with additional context on which keys to group on. You can then call various aggregation methods to run the aggregation within each group, returning a new DataFrame.

class GroupedDataFrame(df: daft.dataframe.dataframe.DataFrame, group_by: daft.expressions.expressions.ExpressionsProjection)[source]#
agg(*to_agg: Union[Expression, Iterable[Expression]]) DataFrame[source]#

Perform aggregations on this GroupedDataFrame. Allows for mixed aggregations.

For a full list of aggregation expressions, see Aggregation Expressions

Example

>>> import daft
>>> from daft import col
>>> df = daft.from_pydict({
...     "pet": ["cat", "dog", "dog", "cat"],
...     "age": [1, 2, 3, 4],
...     "name": ["Alex", "Jordan", "Sam", "Riley"]
... })
>>> grouped_df = df.groupby("pet").agg(
...     col("age").min().alias("min_age"),
...     col("age").max().alias("max_age"),
...     col("pet").count().alias("count"),
...     col("name").any_value()
... )
>>> grouped_df.show()
╭──────┬─────────┬─────────┬────────┬────────╮
│ pet  ┆ min_age ┆ max_age ┆ count  ┆ name   │
│ ---  ┆ ---     ┆ ---     ┆ ---    ┆ ---    │
│ Utf8 ┆ Int64   ┆ Int64   ┆ UInt64 ┆ Utf8   │
╞══════╪═════════╪═════════╪════════╪════════╡
│ cat  ┆ 1       ┆ 4       ┆ 2      ┆ Alex   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ dog  ┆ 2       ┆ 3       ┆ 2      ┆ Jordan │
╰──────┴─────────┴─────────┴────────┴────────╯

(Showing first 2 of 2 rows)
Parameters:

*to_agg (Union[Expression, Iterable[Expression]]) – aggregation expressions

Returns:

DataFrame with grouped aggregations

Return type:

DataFrame

agg_concat(*cols: Union[Expression, str]) DataFrame[source]#

Performs grouped concat on this GroupedDataFrame.

Returns:

DataFrame with grouped concatenated list per column.

Return type:

DataFrame

agg_list(*cols: Union[Expression, str]) DataFrame[source]#

Performs grouped list on this GroupedDataFrame.

Returns:

DataFrame with grouped list per column.

Return type:

DataFrame

any_value(*cols: Union[Expression, str]) DataFrame[source]#

Returns an arbitrary value on this GroupedDataFrame. Values for each column are not guaranteed to be from the same row.

Parameters:

*cols (Union[str, Expression]) – columns to get

Returns:

DataFrame with any values.

Return type:

DataFrame

count(*cols: Union[Expression, str]) DataFrame[source]#

Performs grouped count on this GroupedDataFrame.

Returns:

DataFrame with grouped count per column.

Return type:

DataFrame

map_groups(udf: Expression) DataFrame[source]#

Apply a user-defined function to each group. The name of the resultant column will default to the name of the first input column.

Example

>>> import daft, statistics
>>>
>>> df = daft.from_pydict({"group": ["a", "a", "a", "b", "b", "b"], "data": [1, 20, 30, 4, 50, 600]})
>>>
>>> @daft.udf(return_dtype=daft.DataType.float64())
... def std_dev(data):
...     return [statistics.stdev(data.to_pylist())]
>>>
>>> df = df.groupby("group").map_groups(std_dev(df["data"]))
>>> df.show()
╭───────┬────────────────────╮
│ group ┆ data               │
│ ---   ┆ ---                │
│ Utf8  ┆ Float64            │
╞═══════╪════════════════════╡
│ a     ┆ 14.730919862656235 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b     ┆ 331.62026476076517 │
╰───────┴────────────────────╯

(Showing first 2 of 2 rows)
Parameters:

udf (Expression) – User-defined function to apply to each group.

Returns:

DataFrame with grouped aggregations

Return type:

DataFrame

max(*cols: Union[Expression, str]) DataFrame[source]#

Performs grouped max on this GroupedDataFrame.

Parameters:

*cols (Union[str, Expression]) – columns to max

Returns:

DataFrame with grouped max.

Return type:

DataFrame

mean(*cols: Union[Expression, str]) DataFrame[source]#

Performs grouped mean on this GroupedDataFrame.

Parameters:

*cols (Union[str, Expression]) – columns to mean

Returns:

DataFrame with grouped mean.

Return type:

DataFrame

min(*cols: Union[Expression, str]) DataFrame[source]#

Perform grouped min on this GroupedDataFrame.

Parameters:

*cols (Union[str, Expression]) – columns to min

Returns:

DataFrame with grouped min.

Return type:

DataFrame

stddev(*cols: Union[Expression, str]) DataFrame[source]#

Performs grouped standard deviation on this GroupedDataFrame.

Example

>>> import daft
>>> df = daft.from_pydict({"keys": ["a", "a", "a", "b"], "col_a": [0,1,2,100]})
>>> df = df.groupby("keys").stddev()
>>> df.show()
╭──────┬───────────────────╮
│ keys ┆ col_a             │
│ ---  ┆ ---               │
│ Utf8 ┆ Float64           │
╞══════╪═══════════════════╡
│ a    ┆ 0.816496580927726 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b    ┆ 0                 │
╰──────┴───────────────────╯

(Showing first 2 of 2 rows)
Parameters:

*cols (Union[str, Expression]) – columns to stddev

Returns:

DataFrame with grouped standard deviation.

Return type:

DataFrame

sum(*cols: Union[Expression, str]) DataFrame[source]#

Perform grouped sum on this GroupedDataFrame.

Parameters:

*cols (Union[str, Expression]) – columns to sum

Returns:

DataFrame with grouped sums.

Return type:

DataFrame