Reading/Writing#
Daft can read data from a variety of sources, and write data to many destinations.
Reading Data#
From Files#
DataFrames can be loaded from file(s) on some filesystem, commonly your local filesystem or a remote cloud object store such as AWS S3.
Additionally, Daft can read data from a variety of container file formats, including CSV, line-delimited JSON and Parquet.
Daft supports file paths to a single file, a directory of files, and wildcards. It also supports paths to remote object storage such as AWS S3.
import daft
# You can read a single CSV file from your local filesystem
df = daft.read_csv("path/to/file.csv")
# You can also read folders of CSV files, or include wildcards to select for patterns of file paths
df = daft.read_csv("path/to/*.csv")
# Other formats such as parquet and line-delimited JSON are also supported
df = daft.read_parquet("path/to/*.parquet")
df = daft.read_json("path/to/*.json")
# Remote filesystems such as AWS S3 are also supported, and can be specified with their protocols
df = daft.read_csv("s3://mybucket/path/to/*.csv")
To learn more about each of these constructors, as well as the options that they support, consult the API documentation on creating DataFrames from files.
From Data Catalogs#
If you use catalogs such as Apache Iceberg or Hive, you may wish to consult our user guide on integrations with Data Catalogs: Daft integration with Data Catalogs.
From File Paths#
Daft also provides an easy utility to create a DataFrame from globbing a path. You can use the daft.from_glob_path()
method which will read a DataFrame of globbed filepaths.
df = daft.from_glob_path("s3://mybucket/path/to/images/*.jpeg")
# +----------+------+-----+
# | name | size | ... |
# +----------+------+-----+
# ...
This is especially useful for reading things such as a folder of images or documents into Daft. A common pattern is to then download data from these files into your DataFrame as bytes, using the .url.download()
method.
From Memory#
For testing, or small datasets that fit in memory, you may also create DataFrames using Python lists and dictionaries.
# Create DataFrame using a dictionary of {column_name: list_of_values}
df = daft.from_pydict({"A": [1, 2, 3], "B": ["foo", "bar", "baz"]})
# Create DataFrame using a list of rows, where each row is a dictionary of {column_name: value}
df = daft.from_pylist([{"A": 1, "B": "foo"}, {"A": 2, "B": "bar"}, {"A": 3, "B": "baz"}])
To learn more, consult the API documentation on creating DataFrames from in-memory data structures.
From Databases#
Daft can also read data from a variety of databases, including PostgreSQL, MySQL, Trino, and SQLite using the daft.read_sql()
method.
In order to partition the data, you can specify a partition column, which will allow Daft to read the data in parallel.
# Read from a PostgreSQL database
uri = "postgresql://user:password@host:port/database"
df = daft.read_sql(uri, "SELECT * FROM my_table")
# Read with a partition column
df = daft.read_sql(uri, "SELECT * FROM my_table", partition_col="date")
To learn more, consult the API documentation on daft.read_sql()
.
Writing Data#
The df.write_*(…) methods are used to write DataFrames to files or other destinations.
# Write to various file formats in a local folder
df.write_csv("path/to/folder/")
df.write_parquet("path/to/folder/")
# Write DataFrame to a remote filesystem such as AWS S3
df.write_csv("s3://mybucket/path/")
Note that because Daft is a distributed DataFrame library, by default it will produce multiple files (one per partition) at your specified destination.