Huggingface Datasets#
Daft is able to read datasets directly from Huggingface via the hf://datasets/
protocol.
Since huggingface will automatically convert all public datasets to parquet format,
we can read these datasets using the read_parquet
method.
Note
This is limited to either public datasets, or PRO/ENTERPRISE datasets.
For other file formats, you will need to manually specify the path or glob pattern to the files you want to read, similar to how you would read from a local file system.
Reading Public Datasets#
import daft
df = daft.read_parquet("hf://datasets/username/dataset_name")
This will read the entire dataset into a daft DataFrame.
Not only can you read entire datasets, but you can also read individual files from a dataset.
import daft
df = daft.read_parquet("hf://datasets/username/dataset_name/file_name.parquet")
# or a csv file
df = daft.read_csv("hf://datasets/username/dataset_name/file_name.csv")
# or a glob pattern
df = daft.read_parquet("hf://datasets/username/dataset_name/**/*.parquet")