daft.Expression.str.extract

daft.Expression.str.extract#

Expression.str.extract(pattern: str | daft.expressions.expressions.Expression, index: int = 0) Expression[source]#

Extracts the specified match group from the first regex match in each string in a string column.

Notes

If index is 0, the entire match is returned. If the pattern does not match or the group does not exist, a null value is returned.

Example

>>> import daft
>>> regex = r"(\d)(\d*)"
>>> df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
>>> df.with_column("match", df["x"].str.extract(regex)).collect()
╭─────────┬───────╮
│ x       ┆ match │
│ ---     ┆ ---   │
│ Utf8    ┆ Utf8  │
╞═════════╪═══════╡
│ 123-456 ┆ 123   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 789-012 ┆ 789   │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 345-678 ┆ 345   │
╰─────────┴───────╯

(Showing first 3 of 3 rows)

Extract the first capture group

>>> df.with_column("match", df["x"].str.extract(regex, 1)).collect()
╭─────────┬───────╮
│ x       ┆ match │
│ ---     ┆ ---   │
│ Utf8    ┆ Utf8  │
╞═════════╪═══════╡
│ 123-456 ┆ 1     │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 789-012 ┆ 7     │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 345-678 ┆ 3     │
╰─────────┴───────╯

(Showing first 3 of 3 rows)
Parameters:
  • pattern – The regex pattern to extract

  • index – The index of the regex match group to extract

Returns:

a String expression with the extracted regex match

Return type:

Expression

See also

extract_all