daft.Expression.str.tokenize_decode

daft.Expression.str.tokenize_decode#

Expression.str.tokenize_decode(tokens_path: str, *, io_config: IOConfig | None = None, pattern: str | None = None, special_tokens: str | None = None) Expression[source]#

Decodes each list of integer tokens into a string using a tokenizer.

Uses openai/tiktoken for tokenization.

Supported built-in tokenizers: cl100k_base, o200k_base, p50k_base, p50k_edit, r50k_base. Also supports loading tokens from a file in tiktoken format.

Parameters:
  • tokens_path – The name of a built-in tokenizer, or the path to a token file (supports downloading).

  • io_config (optional) – IOConfig to use when accessing remote storage.

  • pattern (optional) – Regex pattern to use to split strings in tokenization step. Necessary if loading from a file.

  • special_tokens (optional) – Name of the set of special tokens to use. Currently only “llama3” supported. Necessary if loading from a file.

Returns:

An expression with decoded strings.

Return type:

Expression