daft.Expression.str.tokenize_encode#
- Expression.str.tokenize_encode(tokens_path: str, *, io_config: IOConfig | None = None, pattern: str | None = None, special_tokens: str | None = None, use_special_tokens: bool | None = None) Expression [source]#
Encodes each string as a list of integer tokens using a tokenizer.
Uses openai/tiktoken for tokenization.
Supported built-in tokenizers:
cl100k_base
,o200k_base
,p50k_base
,p50k_edit
,r50k_base
. Also supports loading tokens from a file in tiktoken format.Note
If using this expression with Llama 3 tokens, note that Llama 3 does some extra preprocessing on strings in certain edge cases. This may result in slightly different encodings in these cases.
- Parameters:
tokens_path – The name of a built-in tokenizer, or the path to a token file (supports downloading).
io_config (optional) – IOConfig to use when accessing remote storage.
pattern (optional) – Regex pattern to use to split strings in tokenization step. Necessary if loading from a file.
special_tokens (optional) – Name of the set of special tokens to use. Currently only “llama3” supported. Necessary if loading from a file.
use_special_tokens (optional) – Whether or not to parse special tokens included in input. Disabled by default. Automatically enabled if
special_tokens
is provided.
- Returns:
An expression with the encodings of the strings as lists of unsigned 32-bit integers.
- Return type:
Expression