daft.Expression.minhash#
- Expression.minhash(num_hashes: int, ngram_size: int, seed: int = 1, hash_function: Literal['murmurhash3', 'xxhash', 'sha1'] = 'murmurhash3') Expression [source]#
Runs the MinHash algorithm on the series.
For a string, calculates the minimum hash over all its ngrams, repeating with
num_hashes
permutations. Returns as a list of 32-bit unsigned integers.Tokens for the ngrams are delimited by spaces. The strings are not normalized or pre-processed, so it is recommended to normalize the strings yourself.
- Parameters:
num_hashes – The number of hash permutations to compute.
ngram_size – The number of tokens in each shingle/ngram.
seed (optional) – Seed used for generating permutations and the initial string hashes. Defaults to 1.
hash_function (optional) – Hash function to use for initial string hashing. One of “murmurhash3”, “xxhash”, or “sha1”. Defaults to “murmurhash3”.