daft.read_parquet

Contents

daft.read_parquet#

daft.read_parquet(path: Union[str, List[str]], schema_hints: Optional[Dict[str, DataType]] = None, io_config: Optional[IOConfig] = None, use_native_downloader: bool = True, coerce_int96_timestamp_unit: Optional[Union[str, TimeUnit]] = None, _multithreaded_io: Optional[bool] = None) DataFrame[source]#

Creates a DataFrame from Parquet file(s)

Example

>>> df = daft.read_parquet("/path/to/file.parquet")
>>> df = daft.read_parquet("/path/to/directory")
>>> df = daft.read_parquet("/path/to/files-*.parquet")
>>> df = daft.read_parquet("s3://path/to/files-*.parquet")
Parameters:
  • path (str) – Path to Parquet file (allows for wildcards)

  • schema_hints (dict[str, DataType]) – A mapping between column names and datatypes - passing this option will override the specified columns on the inferred schema with the specified DataTypes

  • io_config (IOConfig) – Config to be used with the native downloader

  • use_native_downloader – Whether to use the native downloader instead of PyArrow for reading Parquet.

  • coerce_int96_timestamp_unit – TimeUnit to coerce Int96 TimeStamps to. e.g.: [ns, us, ms], Defaults to None.

  • _multithreaded_io – Whether to use multithreading for IO threads. Setting this to False can be helpful in reducing the amount of system resources (number of connections and thread contention) when running in the Ray runner. Defaults to None, which will let Daft decide based on the runner it is currently using.

Returns:

parsed DataFrame

Return type:

DataFrame