daft.read_parquet

Contents

daft.read_parquet#

read_parquet(path: Union[str, List[str]], row_groups: Optional[List[List[int]]] = None, infer_schema: bool = True, schema: Optional[Dict[str, DataType]] = None, io_config: Optional[IOConfig] = None, file_path_column: Optional[str] = None, hive_partitioning: bool = False, use_native_downloader: bool = True, coerce_int96_timestamp_unit: Optional[Union[str, TimeUnit]] = None, schema_hints: Optional[Dict[str, DataType]] = None, _multithreaded_io: Optional[bool] = None, _chunk_size: Optional[int] = None) DataFrame[source]#

Creates a DataFrame from Parquet file(s)

Example

>>> df = daft.read_parquet("/path/to/file.parquet")
>>> df = daft.read_parquet("/path/to/directory")
>>> df = daft.read_parquet("/path/to/files-*.parquet")
>>> df = daft.read_parquet("s3://path/to/files-*.parquet")
>>> df = daft.read_parquet("gs://path/to/files-*.parquet")
Parameters:
  • path (str) – Path to Parquet file (allows for wildcards)

  • row_groups (List[int] or List[List[int]]) – List of row groups to read corresponding to each file.

  • infer_schema (bool) – Whether to infer the schema of the Parquet, defaults to True.

  • schema (dict[str, DataType]) – A schema that is used as the definitive schema for the Parquet file if infer_schema is False, otherwise it is used as a schema hint that is applied after the schema is inferred.

  • io_config (IOConfig) – Config to be used with the native downloader

  • file_path_column – Include the source path(s) as a column with this name. Defaults to None.

  • hive_partitioning – Whether to infer hive_style partitions from file paths and include them as columns in the Dataframe. Defaults to False.

  • use_native_downloader – Whether to use the native downloader instead of PyArrow for reading Parquet.

  • coerce_int96_timestamp_unit – TimeUnit to coerce Int96 TimeStamps to. e.g.: [ns, us, ms], Defaults to None.

  • _multithreaded_io – Whether to use multithreading for IO threads. Setting this to False can be helpful in reducing the amount of system resources (number of connections and thread contention) when running in the Ray runner. Defaults to None, which will let Daft decide based on the runner it is currently using.

Returns:

parsed DataFrame

Return type:

DataFrame