daft.Expression.str.normalize#
- Expression.str.normalize(*, remove_punct: bool = False, lowercase: bool = False, nfd_unicode: bool = False, white_space: bool = False)[source]#
Normalizes a string for more useful deduplication.
Note
All processing options are off by default.
Example
>>> import daft >>> df = daft.from_pydict({"x": ["hello world", "Hello, world!", "HELLO, \nWORLD!!!!"]}) >>> df = df.with_column("normalized", df["x"].str.normalize(remove_punct=True, lowercase=True, white_space=True)) >>> df.show() ╭───────────────┬─────────────╮ │ x ┆ normalized │ │ --- ┆ --- │ │ Utf8 ┆ Utf8 │ ╞═══════════════╪═════════════╡ │ hello world ┆ hello world │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ Hello, world! ┆ hello world │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ HELLO, ┆ hello world │ │ WORLD!!!! ┆ │ ╰───────────────┴─────────────╯ (Showing first 3 of 3 rows)
- Parameters:
remove_punct – Whether to remove all punctuation (ASCII).
lowercase – Whether to convert the string to lowercase.
nfd_unicode – Whether to normalize and decompose Unicode characters according to NFD.
white_space – Whether to normalize whitespace, replacing newlines etc with spaces and removing double spaces.
- Returns:
a String expression which is normalized.
- Return type:
Expression