daft.Expression.str.normalize

daft.Expression.str.normalize#

Expression.str.normalize(*, remove_punct: bool = False, lowercase: bool = False, nfd_unicode: bool = False, white_space: bool = False)[source]#

Normalizes a string for more useful deduplication.

Note

All processing options are off by default.

Example

>>> import daft
>>> df = daft.from_pydict({"x": ["hello world", "Hello, world!", "HELLO,   \nWORLD!!!!"]})
>>> df = df.with_column("normalized", df["x"].str.normalize(remove_punct=True, lowercase=True, white_space=True))
>>> df.show()
╭───────────────┬─────────────╮
│ x             ┆ normalized  │
│ ---           ┆ ---         │
│ Utf8          ┆ Utf8        │
╞═══════════════╪═════════════╡
│ hello world   ┆ hello world │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Hello, world! ┆ hello world │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ HELLO,        ┆ hello world │
│ WORLD!!!!     ┆             │
╰───────────────┴─────────────╯

(Showing first 3 of 3 rows)
Parameters:
  • remove_punct – Whether to remove all punctuation (ASCII).

  • lowercase – Whether to convert the string to lowercase.

  • nfd_unicode – Whether to normalize and decompose Unicode characters according to NFD.

  • white_space – Whether to normalize whitespace, replacing newlines etc with spaces and removing double spaces.

Returns:

a String expression which is normalized.

Return type:

Expression