daft.Expression.str.extract_all#
- Expression.str.extract_all(pattern: str | daft.expressions.expressions.Expression, index: int = 0) Expression [source]#
Extracts the specified match group from all regex matches in each string in a string column.
Notes
This expression always returns a list of strings. If index is 0, the entire match is returned. If the pattern does not match or the group does not exist, an empty list is returned.
Example
>>> import daft >>> regex = r"(\d)(\d*)" >>> df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]}) >>> df.with_column("match", df["x"].str.extract_all(regex)).collect() ╭─────────┬────────────╮ │ x ┆ match │ │ --- ┆ --- │ │ Utf8 ┆ List[Utf8] │ ╞═════════╪════════════╡ │ 123-456 ┆ [123, 456] │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 789-012 ┆ [789, 012] │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 345-678 ┆ [345, 678] │ ╰─────────┴────────────╯ (Showing first 3 of 3 rows)
Extract the first capture group
>>> df.with_column("match", df["x"].str.extract_all(regex, 1)).collect() ╭─────────┬────────────╮ │ x ┆ match │ │ --- ┆ --- │ │ Utf8 ┆ List[Utf8] │ ╞═════════╪════════════╡ │ 123-456 ┆ [1, 4] │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 789-012 ┆ [7, 0] │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 345-678 ┆ [3, 6] │ ╰─────────┴────────────╯ (Showing first 3 of 3 rows)
- Parameters:
pattern – The regex pattern to extract
index – The index of the regex match group to extract
- Returns:
a List[Utf8] expression with the extracted regex matches
- Return type:
Expression
See also