daft.Expression.str.extract_all

daft.Expression.str.extract_all#

Expression.str.extract_all(pattern: str | daft.expressions.expressions.Expression, index: int = 0) Expression[source]#

Extracts the specified match group from all regex matches in each string in a string column.

Notes

This expression always returns a list of strings. If index is 0, the entire match is returned. If the pattern does not match or the group does not exist, an empty list is returned.

Example

>>> regex = r"(\d)(\d*)"
>>> df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
>>> df.with_column("match", df["x"].str.extract_all(regex))
╭─────────┬────────────╮
│ x       ┆ matches    │
│ ---     ┆ ---        │
│ Utf8    ┆ List[Utf8] │
╞═════════╪════════════╡
│ 123-456 ┆ [123, 456] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 789-012 ┆ [789, 012] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 345-678 ┆ [345, 678] │
╰─────────┴────────────╯

Extract the first capture group

>>> df.with_column("match", df["x"].str.extract_all(regex, 1)).collect()
╭─────────┬────────────╮
│ x       ┆ matches    │
│ ---     ┆ ---        │
│ Utf8    ┆ List[Utf8] │
╞═════════╪════════════╡
│ 123-456 ┆ [1, 4]     │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 789-012 ┆ [7, 0]     │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 345-678 ┆ [3, 6]     │
╰─────────┴────────────╯
Parameters:
  • pattern – The regex pattern to extract

  • index – The index of the regex match group to extract

Returns:

a List[Utf8] expression with the extracted regex matches

Return type:

Expression

See also

extract