Quickstart#
In this quickstart, you will learn the basics of Daft's DataFrame and SQL API and the features that set it apart from frameworks like Pandas, PySpark, Dask, and Ray.
Install Daft#
You can install Daft using pip
. Run the following command in your terminal or notebook:
pip install daft
For more advanced installation options, please see Installation.
Create Your First Daft DataFrame#
See also DataFrame Creation. Let's create a DataFrame from a dictionary of columns:
import daft
df = daft.from_pydict({
"A": [1, 2, 3, 4],
"B": [1.5, 2.5, 3.5, 4.5],
"C": [True, True, False, False],
"D": [None, None, None, None],
})
df
+-------+---------+---------+------+
| A | B | C | D |
| Int64 | Float64 | Boolean | Null |
+=======+=========+=========+======+
| 1 | 1.5 | true | None |
+-------+---------+---------+------+
| 2 | 2.5 | true | None |
+-------+---------+---------+------+
| 3 | 3.5 | false | None |
+-------+---------+---------+------+
| 4 | 4.5 | false | None |
+-------+---------+---------+------+
(Showing first 4 of 4 rows)
You just created your first DataFrame!
Read From a Data Source#
Daft supports both local paths as well as paths to object storage such as AWS S3:
- CSV files:
daft.read_csv("s3://path/to/bucket/*.csv")
- Parquet files:
daft.read_parquet("/path/*.parquet")
- JSON line-delimited files:
daft.read_json("/path/*.json")
- Files on disk:
daft.from_glob_path("/path/*.jpeg")
Note
To work with other formats like Delta Lake and Iceberg, check out their respective pages.
Letβs read in a Parquet file from a public S3 bucket. Note that this Parquet file is partitioned on the column country
. This will be important later on.
# Set IO Configurations to use anonymous data access mode
daft.set_planning_config(default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True)))
df = daft.read_parquet("s3://daft-public-data/tutorials/10-min/sample-data-dog-owners-partitioned.pq/**")
df
+------------+-----------+-------+------+---------+---------+
| first_name | last_name | age | DoB | country | has_dog |
| Utf8 | Utf8 | Int64 | Date | Utf8 | Boolean |
+------------+-----------+-------+------+---------+---------+
(No data to display: Dataframe not materialized)
Why does it say (No data to display: Dataframe not materialized)
and where are the rows?
Execute Your DataFrame and View Data#
Daft DataFrames are lazy by default. This means that the contents will not be computed (βmaterializedβ) unless you explicitly tell Daft to do so. This is best practice for working with larger-than-memory datasets and parallel/distributed architectures.
The file we have just loaded only has 5 rows. You can materialize the whole DataFrame in memory easily using the df.collect()
method:
df.collect()
+------------+-----------+-------+------------+----------------+---------+
| first_name | last_name | age | DoB | country | has_dog |
| Utf8 | Utf8 | Int64 | Date | Utf8 | Boolean |
+------------+-----------+-------+------------+----------------+---------+
| Ernesto | Evergreen | 34 | 1990-04-03 | Canada | true |
| James | Jale | 62 | 1962-03-24 | Canada | true |
| Wolfgang | Winter | 23 | 2001-02-12 | Germany | None |
| Shandra | Shamas | 57 | 1967-01-02 | United Kingdom | true |
| Zaya | Zaphora | 40 | 1984-04-07 | United Kingdom | true |
+------------+-----------+-------+------------+----------------+---------+
(Showing first 5 of 5 rows)
To view just the first few rows, you can use the df.show()
method:
df.show(3)
+------------+-----------+-------+------------+----------------+---------+
| first_name | last_name | age | DoB | country | has_dog |
| Utf8 | Utf8 | Int64 | Date | Utf8 | Boolean |
+------------+-----------+-------+------------+----------------+---------+
| Ernesto | Evergreen | 34 | 1990-04-03 | Canada | true |
| James | Jale | 62 | 1962-03-24 | Canada | true |
| Wolfgang | Winter | 23 | 2001-02-12 | Germany | None |
+------------+-----------+-------+------------+----------------+---------+
(Showing first 3 of 5 rows)
Now let's take a look at some common DataFrame operations.
Select Columns#
You can select specific columns from your DataFrame with the df.select()
method:
df.select("first_name", "has_dog").show()
+------------+---------+
| first_name | has_dog |
| Utf8 | Boolean |
+------------+---------+
| Ernesto | true |
| James | true |
| Wolfgang | None |
| Shandra | true |
| Zaya | true |
+------------+---------+
(Showing first 5 of 5 rows)
Select Rows#
You can filter rows using the df.where()
method that takes an Logical Expression predicate input. In this case, we call the df.col()
method that refers to the column with the provided name age
:
df.where(daft.col("age") >= 40).show()
+------------+-----------+-------+------------+----------------+---------+
| first_name | last_name | age | DoB | country | has_dog |
| Utf8 | Utf8 | Int64 | Date | Utf8 | Boolean |
+------------+-----------+-------+------------+----------------+---------+
| James | Jale | 62 | 1962-03-24 | Canada | true |
| Shandra | Shamas | 57 | 1967-01-02 | United Kingdom | true |
| Zaya | Zaphora | 40 | 1984-04-07 | United Kingdom | true |
+------------+-----------+-------+------------+----------------+---------+
(Showing first 3 of 3 rows)
Filtering can give you powerful optimization when you are working with partitioned files or tables. Daft will use the predicate to read only the necessary partitions, skipping any data that is not relevant.
Note
As mentioned earlier that our Parquet file is partitioned on the country
column, this means that queries with a country
predicate will benefit from query optimization.
Exclude Data#
You can limit the number of rows in a DataFrame by calling the df.limit()
method:
df.limit(2).show()
+------------+-----------+-------+------------+----------------+---------+
| first_name | last_name | age | DoB | country | has_dog |
| Utf8 | Utf8 | Int64 | Date | Utf8 | Boolean |
+------------+-----------+-------+------------+----------------+---------+
| Ernesto | Evergreen | 34 | 1990-04-03 | Canada | true |
+------------+-----------+-------+------------+----------------+---------+
(Showing first 1 of 1 rows)
To drop columns from the DataFrame, use the df.exclude()
method.
df.exclude("DoB").show()
+------------+-----------+-------+----------------+---------+
| first_name | last_name | age | country | has_dog |
| Utf8 | Utf8 | Int64 | Utf8 | Boolean |
+------------+-----------+-------+----------------+---------+
| Ernesto | Evergreen | 34 | Canada | true |
| James | Jale | 62 | Canada | true |
| Wolfgang | Winter | 23 | Germany | None |
| Shandra | Shamas | 57 | United Kingdom | true |
| Zaya | Zaphora | 40 | United Kingdom | true |
+------------+-----------+-------+----------------+---------+
(Showing first 5 of 5 rows)
Transform Columns with Expressions#
Expressions are an API for defining computation that needs to happen over columns. For example, use the daft.col()
expressions together with the with_column
method to create a new column called full_name
, joining the contents from the last_name
column with the first_name
column:
df = df.with_column("full_name", daft.col("first_name") + " " + daft.col("last_name"))
df.select("full_name", "age", "country", "has_dog").show()
+-------------------+-------+----------------+---------+
| full_name | age | country | has_dog |
| Utf8 | Int64 | Utf8 | Boolean |
+-------------------+-------+----------------+---------+
| Ernesto Evergreen | 34 | Canada | true |
| James Jale | 62 | Canada | true |
| Wolfgang Winter | 23 | Germany | None |
| Shandra Shamas | 57 | United Kingdom | true |
| Zaya Zaphora | 40 | United Kingdom | true |
+-------------------+-------+----------------+---------+
(Showing first 5 of 5 rows)
Alternatively, you can also run your column transformation using Expressions directly inside your df.select()
method*:
df.select((daft.col("first_name").alias("full_name") + " " + daft.col("last_name")), "age", "country", "has_dog").show()
+-------------------+-------+----------------+---------+
| full_name | age | country | has_dog |
| Utf8 | Int64 | Utf8 | Boolean |
+-------------------+-------+----------------+---------+
| Ernesto Evergreen | 34 | Canada | true |
| James Jale | 62 | Canada | true |
| Wolfgang Winter | 23 | Germany | None |
| Shandra Shamas | 57 | United Kingdom | true |
| Zaya Zaphora | 40 | United Kingdom | true |
+-------------------+-------+----------------+---------+
(Showing first 5 of 5 rows)
Sort Data#
You can sort a DataFrame with the df.sort()
, in this example we chose to sort in ascending order:
df.sort(daft.col("age"), desc=False).show()
+------------+-----------+-------+------------+----------------+---------+
| first_name | last_name | age | DoB | country | has_dog |
| Utf8 | Utf8 | Int64 | Date | Utf8 | Boolean |
+------------+-----------+-------+------------+----------------+---------+
| Wolfgang | Winter | 23 | 2001-02-12 | Germany | None |
| Ernesto | Evergreen | 34 | 1990-04-03 | Canada | true |
| Zaya | Zaphora | 40 | 1984-04-07 | United Kingdom | true |
| Shandra | Shamas | 57 | 1967-01-02 | United Kingdom | true |
| James | Jale | 62 | 1962-03-24 | Canada | true |
+------------+-----------+-------+------------+----------------+---------+
(Showing first 5 of 5 rows)
Group and Aggregate Data#
You can group and aggregate your data using the df.groupby()
and the df.agg()
methods. A groupby aggregation operation over a dataset happens in 2 steps:
- Split the data into groups based on some criteria using
df.groupby()
- Specify how to aggregate the data for each group using
df.agg()
grouped = df.groupby("country").agg(
daft.col("age").mean().alias("avg_age"),
daft.col("has_dog").count()
).show()
+----------------+---------+---------+
| country | avg_age | has_dog |
| Utf8 | Float64 | UInt64 |
+----------------+---------+---------+
| Canada | 48 | 2 |
| Germany | 23 | 0 |
| United Kingdom | 48.5 | 2 |
+----------------+---------+---------+
(Showing first 3 of 3 rows)
Note
The df.alias()
method renames the given column.
What's Next?#
Now that you have a basic sense of Daftβs functionality and features, here are some more resources to help you get the most out of Daft:
Check out the Core Concepts sections for more details about:
Work with your favorite tools:
Coming from?
Try your hand at some Tutorials: