daft.Expression.str.tokenize_encode

daft.Expression.str.tokenize_encode#

Expression.str.tokenize_encode(tokens_path: str, *, io_config: IOConfig | None = None, pattern: str | None = None, special_tokens: str | None = None, use_special_tokens: bool | None = None) Expression[source]#

Encodes each string as a list of integer tokens using a tokenizer.

Uses openai/tiktoken for tokenization.

Supported built-in tokenizers: cl100k_base, o200k_base, p50k_base, p50k_edit, r50k_base. Also supports loading tokens from a file in tiktoken format.

Note

If using this expression with Llama 3 tokens, note that Llama 3 does some extra preprocessing on strings in certain edge cases. This may result in slightly different encodings in these cases.

Parameters:
  • tokens_path – The name of a built-in tokenizer, or the path to a token file (supports downloading).

  • io_config (optional) – IOConfig to use when accessing remote storage.

  • pattern (optional) – Regex pattern to use to split strings in tokenization step. Necessary if loading from a file.

  • special_tokens (optional) – Name of the set of special tokens to use. Currently only “llama3” supported. Necessary if loading from a file.

  • use_special_tokens (optional) – Whether or not to parse special tokens included in input. Disabled by default. Automatically enabled if special_tokens is provided.

Returns:

An expression with the encodings of the strings as lists of unsigned 32-bit integers.

Return type:

Expression