daft.Expression.minhash

daft.Expression.minhash#

Expression.minhash(num_hashes: int, ngram_size: int, seed: int = 1, hash_function: Literal['murmurhash3', 'xxhash', 'sha1'] = 'murmurhash3') Expression[source]#

Runs the MinHash algorithm on the series.

For a string, calculates the minimum hash over all its ngrams, repeating with num_hashes permutations. Returns as a list of 32-bit unsigned integers.

Tokens for the ngrams are delimited by spaces. The strings are not normalized or pre-processed, so it is recommended to normalize the strings yourself.

Parameters:
  • num_hashes – The number of hash permutations to compute.

  • ngram_size – The number of tokens in each shingle/ngram.

  • seed (optional) – Seed used for generating permutations and the initial string hashes. Defaults to 1.

  • hash_function (optional) – Hash function to use for initial string hashing. One of “murmurhash3”, “xxhash”, or “sha1”. Defaults to “murmurhash3”.