Expressions#

Expressions are how you can express computations that should be run over columns of data.

Creating Expressions#

Referring to a column in a DataFrame#

Most commonly you will be creating expressions by using the daft.col() function.

# Refers to column "A"
daft.col("A")
Output#
col(A)

The above code creates an Expression that refers to a column named "A".

Using SQL#

Daft can also parse valid SQL as expressions.

daft.sql_expr("A + 1")
Output#
col(A) + lit(1)

The above code will create an expression representing โ€œthe column named โ€˜xโ€™ incremented by 1โ€. For many APIs, sql_expr will actually be applied for you as syntactic sugar!

Literals#

You may find yourself needing to hardcode a โ€œsingle valueโ€ oftentimes as an expression. Daft provides a lit() helper to do so:

from daft import lit

# Refers to an expression which always evaluates to 42
lit(42)
Output#
lit(42)

This special lit() expression we just created evaluates always to the value 42.

Wildcard Expressions#

You can create expressions on multiple columns at once using a wildcard. The expression col("*") selects every column in a DataFrame, and you can operate on this expression in the same way as a single column:

import daft
from daft import col

df = daft.from_pydict({"A": [1, 2, 3], "B": [4, 5, 6]})
df.select(col("*") * 3).show()
Output#
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ A     โ”† B     โ”‚
โ”‚ ---   โ”† ---   โ”‚
โ”‚ Int64 โ”† Int64 โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 3     โ”† 12    โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 6     โ”† 15    โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 9     โ”† 18    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Wildcards also work very well for accessing all members of a struct column:

import daft
from daft import col

df = daft.from_pydict({
    "person": [
        {"name": "Alice", "age": 30},
        {"name": "Bob", "age": 25},
        {"name": "Charlie", "age": 35}
    ]
})

# Access all fields of the 'person' struct
df.select(col("person.*")).show()
Output#
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ name     โ”† age   โ”‚
โ”‚ ---      โ”† ---   โ”‚
โ”‚ String   โ”† Int64 โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Alice    โ”† 30    โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ Bob      โ”† 25    โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ Charlie  โ”† 35    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

In this example, we use the wildcard * to access all fields of the person struct column. This is equivalent to selecting each field individually (person.name, person.age), but is more concise and flexible, especially when dealing with structs that have many fields.

Composing Expressions#

Numeric Expressions#

Since column โ€œAโ€ is an integer, we can run numeric computation such as addition, division and checking its value. Here are some examples where we create new columns using the results of such computations:

# Add 1 to each element in column "A"
df = df.with_column("A_add_one", df["A"] + 1)

# Divide each element in column A by 2
df = df.with_column("A_divide_two", df["A"] / 2.)

# Check if each element in column A is more than 1
df = df.with_column("A_gt_1", df["A"] > 1)

df.collect()
Output#
+---------+-------------+----------------+-----------+
|       A |   A_add_one |   A_divide_two | A_gt_1    |
|   Int64 |       Int64 |        Float64 | Boolean   |
+=========+=============+================+===========+
|       1 |           2 |            0.5 | false     |
+---------+-------------+----------------+-----------+
|       2 |           3 |            1   | true      |
+---------+-------------+----------------+-----------+
|       3 |           4 |            1.5 | true      |
+---------+-------------+----------------+-----------+
(Showing first 3 of 3 rows)

Notice that the returned types of these operations are also well-typed according to their input types. For example, calling df["A"] > 1 returns a column of type Boolean.

Both the Float and Int types are numeric types, and inherit many of the same arithmetic Expression operations. You may find the full list of numeric operations in the Expressions API reference.

String Expressions#

Daft also lets you have columns of strings in a DataFrame. Letโ€™s take a look!

df = daft.from_pydict({"B": ["foo", "bar", "baz"]})
df.show()
Output#
+--------+
| B      |
| Utf8   |
+========+
| foo    |
+--------+
| bar    |
+--------+
| baz    |
+--------+
(Showing first 3 rows)

Unlike the numeric types, the string type does not support arithmetic operations such as * and /. The one exception to this is the + operator, which is overridden to concatenate two string expressions as is commonly done in Python. Letโ€™s try that!

df = df.with_column("B2", df["B"] + "foo")
df.show()
Output#
+--------+--------+
| B      | B2     |
| Utf8   | Utf8   |
+========+========+
| foo    | foofoo |
+--------+--------+
| bar    | barfoo |
+--------+--------+
| baz    | bazfoo |
+--------+--------+
(Showing first 3 rows)

There are also many string operators that are accessed through a separate .str.* โ€œmethod namespaceโ€.

For example, to check if each element in column โ€œBโ€ contains the substring โ€œaโ€, we can use the .str.contains method:

df = df.with_column("B2_contains_B", df["B2"].str.contains(df["B"]))
df.show()
Output#
+--------+--------+-----------------+
| B      | B2     | B2_contains_B   |
| Utf8   | Utf8   | Boolean         |
+========+========+=================+
| foo    | foofoo | true            |
+--------+--------+-----------------+
| bar    | barfoo | true            |
+--------+--------+-----------------+
| baz    | bazfoo | true            |
+--------+--------+-----------------+
(Showing first 3 rows)

You may find a full list of string operations in the Expressions API reference.

URL Expressions#

One special case of a String column you may find yourself working with is a column of URL strings.

Daft provides the .url.* method namespace with functionality for working with URL strings. For example, to download data from URLs:

df = daft.from_pydict({
    "urls": [
        "https://www.google.com",
        "s3://daft-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg",
    ],
})
df = df.with_column("data", df["urls"].url.download())
df.collect()
Output#
+----------------------+----------------------+
| urls                 | data                 |
| Utf8                 | Binary               |
+======================+======================+
| https://www.google.c | b'<!doctype          |
| om                   | html><html           |
|                      | itemscope="" itemtyp |
|                      | e="http://sche...    |
+----------------------+----------------------+
| s3://daft-public-    | b'\xff\xd8\xff\xe0\x |
| data/open-           | 00\x10JFIF\x00\x01\x |
| images/validation-   | 01\x01\x00H\x00H\... |
| images/0001e...      |                      |
+----------------------+----------------------+
(Showing first 2 of 2 rows)

This works well for URLs which are HTTP paths to non-HTML files (e.g. jpeg), local filepaths or even paths to a file in an object store such as AWS S3 as well!

JSON Expressions#

If you have a column of JSON strings, Daft provides the .json.* method namespace to run JQ-style filters on them. For example, to extract a value from a JSON object:

df = daft.from_pydict({
    "json": [
        '{"a": 1, "b": 2}',
        '{"a": 3, "b": 4}',
    ],
})
df = df.with_column("a", df["json"].json.query(".a"))
df.collect()
Output#
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ json             โ”† a    โ”‚
โ”‚ ---              โ”† ---  โ”‚
โ”‚ Utf8             โ”† Utf8 โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ก
โ”‚ {"a": 1, "b": 2} โ”† 1    โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ {"a": 3, "b": 4} โ”† 3    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

(Showing first 2 of 2 rows)

Daft uses jaq as the underlying executor, so you can find the full list of supported filters in the jaq documentation.

Logical Expressions#

Logical Expressions are an expression that refers to a column of type Boolean, and can only take on the values True or False.

df = daft.from_pydict({"C": [True, False, True]})

Daft supports logical operations such as & (and) and | (or) between logical expressions.

Comparisons#

Many of the types in Daft support comparisons between expressions that returns a Logical Expression.

For example, here we can compare if each element in column โ€œAโ€ is equal to elements in column โ€œBโ€:

df = daft.from_pydict({"A": [1, 2, 3], "B": [1, 2, 4]})

df = df.with_column("A_eq_B", df["A"] == df["B"])

df.collect()
Output#
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ A     โ”† B     โ”† A_eq_B  โ”‚
โ”‚ ---   โ”† ---   โ”† ---     โ”‚
โ”‚ Int64 โ”† Int64 โ”† Boolean โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 1     โ”† 1     โ”† true    โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 2     โ”† 2     โ”† true    โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 3     โ”† 4     โ”† false   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

(Showing first 3 of 3 rows)

Other useful comparisons can be found in the Expressions API reference.

If Else Pattern#

The .if_else() method is a useful expression to have up your sleeve for choosing values between two other expressions based on a logical expression:

df = daft.from_pydict({"A": [1, 2, 3], "B": [0, 2, 4]})

# Pick values from column A if the value in column A is bigger
# than the value in column B. Otherwise, pick values from column B.
df = df.with_column(
    "A_if_bigger_else_B",
    (df["A"] > df["B"]).if_else(df["A"], df["B"]),
)

df.collect()
Output#
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ A     โ”† B     โ”† A_if_bigger_else_B โ”‚
โ”‚ ---   โ”† ---   โ”† ---                โ”‚
โ”‚ Int64 โ”† Int64 โ”† Int64              โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 1     โ”† 0     โ”† 1                  โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 2     โ”† 2     โ”† 2                  โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 3     โ”† 4     โ”† 4                  โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

(Showing first 3 of 3 rows)

This is a useful expression for cleaning your data!

Temporal Expressions#

Daft provides rich support for working with temporal data types like Timestamp and Duration. Letโ€™s explore some common temporal operations:

Basic Temporal Operations#

You can perform arithmetic operations with timestamps and durations, such as adding a duration to a timestamp or calculating the duration between two timestamps:

import datetime

df = daft.from_pydict({
    "timestamp": [
        datetime.datetime(2021, 1, 1, 0, 1, 1),
        datetime.datetime(2021, 1, 1, 0, 1, 59),
        datetime.datetime(2021, 1, 1, 0, 2, 0),
    ]
})

# Add 10 seconds to each timestamp
df = df.with_column(
    "plus_10_seconds",
    df["timestamp"] + datetime.timedelta(seconds=10)
)

df.show()
Output#
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ timestamp                     โ”† plus_10_seconds               โ”‚
โ”‚ ---                           โ”† ---                           โ”‚
โ”‚ Timestamp(Microseconds, None) โ”† Timestamp(Microseconds, None) โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 2021-01-01 00:01:01           โ”† 2021-01-01 00:01:11           โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 2021-01-01 00:01:59           โ”† 2021-01-01 00:02:09           โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 2021-01-01 00:02:00           โ”† 2021-01-01 00:02:10           โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Temporal Component Extraction#

The .dt.* method namespace provides extraction methods for the components of a timestamp, such as year, month, day, hour, minute, and second:

df = daft.from_pydict({
    "timestamp": [
        datetime.datetime(2021, 1, 1, 0, 1, 1),
        datetime.datetime(2021, 1, 1, 0, 1, 59),
        datetime.datetime(2021, 1, 1, 0, 2, 0),
    ]
})

# Extract year, month, day, hour, minute, and second from the timestamp
df = df.with_columns({
    "year": df["timestamp"].dt.year(),
    "month": df["timestamp"].dt.month(),
    "day": df["timestamp"].dt.day(),
    "hour": df["timestamp"].dt.hour(),
    "minute": df["timestamp"].dt.minute(),
    "second": df["timestamp"].dt.second()
})

df.show()
Output#
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ timestamp                     โ”† year  โ”† month  โ”† day    โ”† hour   โ”† minute โ”† second โ”‚
โ”‚ ---                           โ”† ---   โ”† ---    โ”† ---    โ”† ---    โ”† ---    โ”† ---    โ”‚
โ”‚ Timestamp(Microseconds, None) โ”† Int32 โ”† UInt32 โ”† UInt32 โ”† UInt32 โ”† UInt32 โ”† UInt32 โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 2021-01-01 00:01:01           โ”† 2021  โ”† 1      โ”† 1      โ”† 0      โ”† 1      โ”† 1      โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 2021-01-01 00:01:59           โ”† 2021  โ”† 1      โ”† 1      โ”† 0      โ”† 1      โ”† 59     โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 2021-01-01 00:02:00           โ”† 2021  โ”† 1      โ”† 1      โ”† 0      โ”† 2      โ”† 0      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Time Zone Operations#

You can parse strings as timestamps with time zones and convert between different time zones:

df = daft.from_pydict({
    "timestamp_str": [
        "2021-01-01 00:00:00.123 +0800",
        "2021-01-02 12:30:00.456 +0800"
    ]
})

# Parse the timestamp string with time zone and convert to New York time
df = df.with_column(
    "ny_time",
    df["timestamp_str"].str.to_datetime(
        "%Y-%m-%d %H:%M:%S%.3f %z",
        timezone="America/New_York"
    )
)

df.show()
Output#
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ timestamp_str                 โ”† ny_time                                           โ”‚
โ”‚ ---                           โ”† ---                                               โ”‚
โ”‚ Utf8                          โ”† Timestamp(Milliseconds, Some("America/New_York")) โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 2021-01-01 00:00:00.123 +0800 โ”† 2020-12-31 11:00:00.123 EST                       โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 2021-01-02 12:30:00.456 +0800 โ”† 2021-01-01 23:30:00.456 EST                       โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Temporal Truncation#

The .dt.truncate() method allows you to truncate timestamps to specific time units. This can be useful for grouping data by time periods. For example, to truncate timestamps to the nearest hour:

df = daft.from_pydict({
    "timestamp": [
        datetime.datetime(2021, 1, 7, 0, 1, 1),
        datetime.datetime(2021, 1, 8, 0, 1, 59),
        datetime.datetime(2021, 1, 9, 0, 30, 0),
        datetime.datetime(2021, 1, 10, 1, 59, 59),
    ]
})

# Truncate timestamps to the nearest hour
df = df.with_column(
    "hour_start",
    df["timestamp"].dt.truncate("1 hour")
)

df.show()
Output#
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ timestamp                     โ”† hour_start                    โ”‚
โ”‚ ---                           โ”† ---                           โ”‚
โ”‚ Timestamp(Microseconds, None) โ”† Timestamp(Microseconds, None) โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ 2021-01-07 00:01:01           โ”† 2021-01-07 00:00:00           โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 2021-01-08 00:01:59           โ”† 2021-01-08 00:00:00           โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 2021-01-09 00:30:00           โ”† 2021-01-09 00:00:00           โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ 2021-01-10 01:59:59           โ”† 2021-01-10 01:00:00           โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ