SQL#

sql(sql: str, catalog: Optional[SQLCatalog] = None, register_globals: bool = True) DataFrame[source]#

Run a SQL query, returning the results as a DataFrame

Warning

This features is early in development and will likely experience API changes.

Examples

A simple example joining 2 dataframes together using a SQL statement, relying on Daft to detect the names of SQL tables using their corresponding Python variable names.

>>> import daft
>>>
>>> df1 = daft.from_pydict({"a": [1, 2, 3], "b": ["foo", "bar", "baz"]})
>>> df2 = daft.from_pydict({"a": [1, 2, 3], "c": ["daft", None, None]})
>>>
>>> # Daft automatically detects `df1` and `df2` from your Python global namespace
>>> result_df = daft.sql("SELECT * FROM df1 JOIN df2 ON df1.a = df2.a")
>>> result_df.show()
╭───────┬──────┬──────╮
│ a     ┆ b    ┆ c    │
│ ---   ┆ ---  ┆ ---  │
│ Int64 ┆ Utf8 ┆ Utf8 │
╞═══════╪══════╪══════╡
│ 1     ┆ foo  ┆ daft │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2     ┆ bar  ┆ None │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3     ┆ baz  ┆ None │
╰───────┴──────┴──────╯

(Showing first 3 of 3 rows)

A more complex example using a SQLCatalog to create a named table called "my_table", which can then be referenced from inside your SQL statement.

>>> import daft
>>> from daft.sql import SQLCatalog
>>>
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": ["foo", "bar", "baz"]})
>>>
>>> # Register dataframes as tables in SQL explicitly with names
>>> catalog = SQLCatalog({"my_table": df})
>>>
>>> daft.sql("SELECT a FROM my_table", catalog=catalog).show()
╭───────╮
│ a     │
│ ---   │
│ Int64 │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ 2     │
├╌╌╌╌╌╌╌┤
│ 3     │
╰───────╯

(Showing first 3 of 3 rows)
Parameters:
  • sql (str) – SQL query to execute

  • catalog (SQLCatalog, optional) – Catalog of tables to use in the query. Defaults to None, in which case a catalog will be built from variables in the callers scope.

  • register_globals (bool, optional) – Whether to incorporate global variables into the supplied catalog, in which case a copy of the catalog will be made and the original not modified. Defaults to True.

Returns:

Dataframe containing the results of the query

Return type:

DataFrame

sql_expr(sql: str) Expression[source]#

Parses a SQL string into a Daft Expression

This function allows you to create Daft Expressions from SQL snippets, which can then be used in Daft operations or combined with other Daft Expressions.

Parameters:

sql (str) – A SQL string to be parsed into a Daft Expression.

Returns:

A Daft Expression representing the parsed SQL.

Return type:

Expression

Examples

Create a simple SQL expression:

>>> import daft
>>> expr = daft.sql_expr("1 + 2")
>>> print(expr)
lit(1) + lit(2)

Use SQL expression in a Daft DataFrame operation:

>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> df = df.with_column("c", daft.sql_expr("a + b"))
>>> df.show()
╭───────┬───────┬───────╮
│ a     ┆ b     ┆ c     │
│ ---   ┆ ---   ┆ ---   │
│ Int64 ┆ Int64 ┆ Int64 │
╞═══════╪═══════╪═══════╡
│ 1     ┆ 4     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     ┆ 7     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     ┆ 9     │
╰───────┴───────┴───────╯

(Showing first 3 of 3 rows)

daft.sql_expr is also called automatically for you in some DataFrame operations such as filters:

>>> df = daft.from_pydict({"x": [1, 2, 3], "y": [4, 5, 6]})
>>> result = df.where("x < 3 AND y > 4")
>>> result.show()
╭───────┬───────╮
│ x     ┆ y     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 2     ┆ 5     │
╰───────┴───────╯

(Showing first 1 of 1 rows)

SQL Functions#

This is a full list of functions that can be used from within SQL.

AzureConfig(storage_account, access_key, ...)

Create configurations to be used when accessing Azure Blob Storage.

GCSConfig(project_id, credentials, token, ...)

Create configurations to be used when accessing Google Cloud Storage.

HTTPConfig(user_agent, bearer_token)

Create configurations for sending web requests.

S3Config(region_name, endpoint_url, key_id, ...)

Create configurations to be used when accessing an S3-compatible system.

abs(input)

Gets the absolute value of a number.

acos(input)

Calculates the inverse cosine (arc cosine) of a number.

acosh(input)

Calculates the inverse hyperbolic cosine of a number.

array_to_string(input, separator)

Joins elements of a list into a single string using a specified separator.

asin(input)

Calculates the inverse sine (arc sine) of a number.

asinh(input)

Calculates the inverse hyperbolic sine of a number.

atan(input)

Calculates the inverse tangent (arc tangent) of a number.

atan2(y, x)

Calculates the angle between the positive x-axis and the ray from (0,0) to (x,y).

atanh(input)

Calculates the inverse hyperbolic tangent of a number.

avg(input)

Calculates the average (mean) of non-null elements in the input expression.

capitalize(string_input)

Capitalizes the first character of the string

ceil(input)

Rounds a number up to the nearest integer.

clip(input, min, max)

Clips a number to a specified range.

concat(todo)

Concatenate the inputs into a single string

contains(string_input, substring)

Returns true if the string contains the specified substring

cos(input)

Calculates the cosine of an angle in radians.

cot(input)

Calculates the cotangent of an angle in radians.

count(input)

Counts the number of non-null elements in the input expression.

count_matches(input, pattern, whole_words, ...)

Counts the number of times a pattern, or multiple patterns, appears in the input.

date(input)

Extracts the date component from a datetime expression.

day(input)

Extracts the day component from a datetime expression.

dayofweek(input)

Extracts the day_of_week component from a datetime expression.

degrees(input)

Converts an angle from radians to degrees.

ends_with(string_input, substring)

Returns true if the string ends with the specified substring

exp(input, exponent)

Calculates the exponential of a number (e^x).

explode(input)

Expands a list column into multiple rows.

fill_nan(input, fill_value)

Replaces NaN values in the input expression with a specified fill value.

find(string_input, substring)

Returns the index of the first occurrence of a substring within the string

floor(input)

Rounds a number down to the nearest integer.

hash(input, seed)

Hashes the values in the input expression.

hour(input)

Extracts the hour component from a datetime expression.

image_crop(input_image, bounding_box)

Crops an image to a specified bounding box.

image_decode(input, mode, on_error)

Decodes an image from binary data.

image_encode(input_image, image_format)

Encodes an image into the specified image file format, returning a binary column of encoded bytes.

image_resize(input_image, width, height)

Resizes an image to the specified width and height.

image_to_mode(input_image, mode)

Converts an image to the specified mode (e.g.

is_inf(input)

Checks if the input expression is infinite (positive or negative infinity).

is_nan(input)

Checks if the input expression is NaN (Not a Number).

json_query(input, query)

Extracts a JSON object from a JSON string using a JSONPath expression.

left(string_input, length)

Returns the specified number of leftmost characters from the string

length(string_input)

Returns the length of the string

length_bytes(string_input)

Returns the length of the string in bytes

list_chunk(input, chunk_size)

Splits a list into chunks of a specified size.

list_count(input, mode)

Counts the number of elements in a list.

list_join(input, separator)

Joins elements of a list into a single string using a specified separator.

list_max(input)

Returns the maximum value in a list.

list_mean(input)

Calculates the mean (average) of values in a list.

list_min(input)

Returns the minimum value in a list.

list_slice(input, start, end)

Extracts a portion of a list from a start index to an end index.

list_sort(input, order)

Sorts the elements of a list in ascending or descending order.

list_sum(input)

Calculates the sum of values in a list.

ln(input)

Calculates the natural logarithm of a number.

log(input, base)

Calculates the natural logarithm of a number.

log10(input)

Calculates the base-10 logarithm of a number.

log2(input)

Calculates the base-2 logarithm of a number.

lower(string_input)

Converts the string to lowercase

lpad(string_input, length, pad)

Pads the string on the left side with the specified string until it reaches the specified length

lstrip(string_input)

Removes leading whitespace from the string

map_extract(input, key)

Retrieves the value associated with a given key from a map.

map_get(input, key)

Retrieves the value associated with a given key from a map.

max(input)

Finds the maximum value among non-null elements in the input expression.

mean(input)

Calculates the average (mean) of non-null elements in the input expression.

min(input)

Finds the minimum value among non-null elements in the input expression.

minhash(input, num_hashes, ngram_size, seed, ...)

Calculates the minimum hash over the inputs ngrams, repeating with num_hashes permutations.

minute(input)

Extracts the minute component from a datetime expression.

month(input)

Extracts the month component from a datetime expression.

normalize(input, remove_punct, lowercase, ...)

Normalizes a string for more useful deduplication and data cleaning.

not_nan(input)

Checks if the input expression is not NaN (Not a Number).

partitioning_days(input)

Extracts the number of days since epoch time from a datetime expression.

partitioning_hours(input)

Extracts the number of hours since epoch time from a datetime expression.

partitioning_iceberg_bucket(input, num_buckets)

Computes a bucket number for the input expression based the specified number of buckets using an Iceberg-specific hash.

partitioning_iceberg_truncate(input, width)

Truncates the input expression to a specified width.

partitioning_months(input)

Extracts the number of months since epoch time from a datetime expression.

partitioning_years(input)

Extracts the number of years since epoch time from a datetime expression.

radians(input)

Converts an angle from degrees to radians.

regexp_extract(string_input, pattern)

Extracts the first substring that matches the specified regular expression pattern

regexp_extract_all(string_input, pattern)

Extracts all substrings that match the specified regular expression pattern

regexp_match(string_input, pattern)

Returns true if the string matches the specified regular expression pattern

regexp_replace(string_input, pattern, ...)

Replaces all occurrences of a substring with a new string

regexp_split(string_input, delimiter)

Splits the string by the specified delimiter and returns an array of substrings

repeat(string_input, count)

Repeats the string the specified number of times

reverse(string_input)

Reverses the order of characters in the string

right(string_input, length)

Returns the specified number of rightmost characters from the string

round(input, precision)

Rounds a number to a specified number of decimal places.

rpad(string_input, length, pad)

Pads the string on the right side with the specified string until it reaches the specified length

rstrip(string_input)

Removes trailing whitespace from the string

second(input)

Extracts the second component from a datetime expression.

sign(input)

Returns the sign of a number (-1, 0, or 1).

sin(input)

Calculates the sine of an angle in radians.

split(string_input, delimiter)

Splits the string by the specified delimiter and returns an array of substrings

sqrt(input)

Calculates the square root of a number.

starts_with(string_input, substring)

Returns true if the string starts with the specified substring

stddev(input)

Calculates the standard deviation of non-null elements in the input expression.

stddev_samp(input)

Calculates the standard deviation of non-null elements in the input expression.

struct_extract(input, field)

Extracts a field from a struct expression by name.

struct_get(input, field)

Extracts a field from a struct expression by name.

sum(input)

Calculates the sum of non-null elements in the input expression.

tan(input)

Calculates the tangent of an angle in radians.

time(input)

Extracts the time component from a datetime expression.

to_date(string_input, format)

Parses the string as a date using the specified format.

to_datetime(string_input, format)

Parses the string as a datetime using the specified format.

tokenize_decode(input, token_path, ...)

Encodes each string as a list of integer tokens using a tokenizer.

tokenize_encode(input, token_path, ...)

Decodes each list of integer tokens into a string using a tokenizer.

unnest(input)

Expands a list column into multiple rows.

upper(string_input)

Converts the string to uppercase

year(input)

Extracts the year component from a datetime expression.