daft.set_execution_config

daft.set_execution_config#

daft.set_execution_config(config: Optional[PyDaftExecutionConfig] = None, scan_tasks_min_size_bytes: Optional[int] = None, scan_tasks_max_size_bytes: Optional[int] = None, broadcast_join_size_bytes_threshold: Optional[int] = None, parquet_split_row_groups_max_files: Optional[int] = None, sort_merge_join_sort_with_aligned_boundaries: Optional[bool] = None, sample_size_for_sort: Optional[int] = None, num_preview_rows: Optional[int] = None, parquet_target_filesize: Optional[int] = None, parquet_target_row_group_size: Optional[int] = None, parquet_inflation_factor: Optional[float] = None, csv_target_filesize: Optional[int] = None, csv_inflation_factor: Optional[float] = None, shuffle_aggregation_default_partitions: Optional[int] = None) DaftContext[source]#

Globally sets various configuration parameters which control various aspects of Daft execution. These configuration values are used when a Dataframe is executed (e.g. calls to write_*, collect() or show())

Parameters:
  • config – A PyDaftExecutionConfig object to set the config to, before applying other kwargs. Defaults to None which indicates that the old (current) config should be used.

  • scan_tasks_min_size_bytes – Minimum size in bytes when merging ScanTasks when reading files from storage. Increasing this value will make Daft perform more merging of files into a single partition before yielding, which leads to bigger but fewer partitions. (Defaults to 96 MiB)

  • scan_tasks_max_size_bytes – Maximum size in bytes when merging ScanTasks when reading files from storage. Increasing this value will increase the upper bound of the size of merged ScanTasks, which leads to bigger but fewer partitions. (Defaults to 384 MiB)

  • broadcast_join_size_bytes_threshold – If one side of a join is smaller than this threshold, a broadcast join will be used. Default is 10 MiB.

  • parquet_split_row_groups_max_files – Maximum number of files to read in which the row group splitting should happen. (Defaults to 10)

  • sort_merge_join_sort_with_aligned_boundaries – Whether to use a specialized algorithm for sorting both sides of a sort-merge join such that they have aligned boundaries. This can lead to a faster merge-join at the cost of more skewed sorted join inputs, increasing the risk of OOMs.

  • sample_size_for_sort – number of elements to sample from each partition when running sort, Default is 20.

  • num_preview_rows – number of rows to when showing a dataframe preview, Default is 8.

  • parquet_target_filesize – Target File Size when writing out Parquet Files. Defaults to 512MB

  • parquet_target_row_group_size – Target Row Group Size when writing out Parquet Files. Defaults to 128MB

  • parquet_inflation_factor – Inflation Factor of parquet files (In-Memory-Size / File-Size) ratio. Defaults to 3.0

  • csv_target_filesize – Target File Size when writing out CSV Files. Defaults to 512MB

  • csv_inflation_factor – Inflation Factor of CSV files (In-Memory-Size / File-Size) ratio. Defaults to 0.5

  • shuffle_aggregation_default_partitions – Minimum number of partitions to create when performing aggregations. Defaults to 200, unless the number of input partitions is less than 200.