daft.read_csv

Contents

daft.read_csv#

read_csv(path: Union[str, List[str]], infer_schema: bool = True, schema: Optional[Dict[str, DataType]] = None, has_headers: bool = True, delimiter: Optional[str] = None, double_quote: bool = True, quote: Optional[str] = None, escape_char: Optional[str] = None, comment: Optional[str] = None, allow_variable_columns: bool = False, io_config: Optional[IOConfig] = None, file_path_column: Optional[str] = None, hive_partitioning: bool = False, use_native_downloader: bool = True, schema_hints: Optional[Dict[str, DataType]] = None, _buffer_size: Optional[int] = None, _chunk_size: Optional[int] = None) DataFrame[source]#

Creates a DataFrame from CSV file(s)

Example

>>> df = daft.read_csv("/path/to/file.csv")
>>> df = daft.read_csv("/path/to/directory")
>>> df = daft.read_csv("/path/to/files-*.csv")
>>> df = daft.read_csv("s3://path/to/files-*.csv")
Parameters:
  • path (str) – Path to CSV (allows for wildcards)

  • infer_schema (bool) – Whether to infer the schema of the CSV, defaults to True.

  • schema (dict[str, DataType]) – A schema that is used as the definitive schema for the CSV if infer_schema is False, otherwise it is used as a schema hint that is applied after the schema is inferred.

  • has_headers (bool) – Whether the CSV has a header or not, defaults to True

  • delimiter (Str) – Delimiter used in the CSV, defaults to “,”

  • doubled_quote (bool) – Whether to support double quote escapes, defaults to True

  • escape_char (str) – Character to use as the escape character for double quotes, or defaults to "

  • comment (str) – Character to treat as the start of a comment line, or None to not support comments

  • allow_variable_columns (bool) – Whether to allow for variable number of columns in the CSV, defaults to False. If set to True, Daft will append nulls to rows with less columns than the schema, and ignore extra columns in rows with more columns

  • io_config (IOConfig) – Config to be used with the native downloader

  • file_path_column – Include the source path(s) as a column with this name. Defaults to None.

  • hive_partitioning – Whether to infer hive_style partitions from file paths and include them as columns in the Dataframe. Defaults to False.

  • use_native_downloader – Whether to use the native downloader instead of PyArrow for reading Parquet. This is currently experimental.

Returns:

parsed DataFrame

Return type:

DataFrame