daft.Expression.str.extract

daft.Expression.str.extract#

Expression.str.extract(pattern: str | daft.expressions.expressions.Expression, index: int = 0) Expression[source]#

Extracts the specified match group from the first regex match in each string in a string column.

Notes

If index is 0, the entire match is returned. If the pattern does not match or the group does not exist, a null value is returned.

Example

>>> regex = r"(\d)(\d*)"
>>> df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
>>> df.with_column("match", df["x"].str.extract(regex))
╭─────────┬─────────╮
│ x       ┆ match   │
│ ---     ┆ ---     │
│ Utf8    ┆ Utf8    │
╞═════════╪═════════╡
│ 123-456 ┆ 123     │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 789-012 ┆ 789     │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 345-678 ┆ 345     │
╰─────────┴─────────╯

Extract the first capture group

>>> df.with_column("match", df["x"].str.extract(regex, 1)).collect()
╭─────────┬─────────╮
│ x       ┆ match   │
│ ---     ┆ ---     │
│ Utf8    ┆ Utf8    │
╞═════════╪═════════╡
│ 123-456 ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 789-012 ┆ 7       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 345-678 ┆ 3       │
╰─────────┴─────────╯
Parameters:
  • pattern – The regex pattern to extract

  • index – The index of the regex match group to extract

Returns:

a String expression with the extracted regex match

Return type:

Expression

See also

extract_all