daft.Expression.str.tokenize_decode#
- Expression.str.tokenize_decode(tokens_path: str, *, io_config: IOConfig | None = None, pattern: str | None = None, special_tokens: str | None = None) Expression [source]#
Decodes each list of integer tokens into a string using a tokenizer.
Uses openai/tiktoken for tokenization.
Supported built-in tokenizers:
cl100k_base
,o200k_base
,p50k_base
,p50k_edit
,r50k_base
. Also supports loading tokens from a file in tiktoken format.- Parameters:
tokens_path – The name of a built-in tokenizer, or the path to a token file (supports downloading).
io_config (optional) – IOConfig to use when accessing remote storage.
pattern (optional) – Regex pattern to use to split strings in tokenization step. Necessary if loading from a file.
special_tokens (optional) – Name of the set of special tokens to use. Currently only “llama3” supported. Necessary if loading from a file.
- Returns:
An expression with decoded strings.
- Return type:
Expression