daft.set_execution_config

daft.set_execution_config#

set_execution_config(config: Optional[PyDaftExecutionConfig] = None, scan_tasks_min_size_bytes: Optional[int] = None, scan_tasks_max_size_bytes: Optional[int] = None, broadcast_join_size_bytes_threshold: Optional[int] = None, parquet_split_row_groups_max_files: Optional[int] = None, sort_merge_join_sort_with_aligned_boundaries: Optional[bool] = None, hash_join_partition_size_leniency: Optional[float] = None, sample_size_for_sort: Optional[int] = None, num_preview_rows: Optional[int] = None, parquet_target_filesize: Optional[int] = None, parquet_target_row_group_size: Optional[int] = None, parquet_inflation_factor: Optional[float] = None, csv_target_filesize: Optional[int] = None, csv_inflation_factor: Optional[float] = None, shuffle_aggregation_default_partitions: Optional[int] = None, read_sql_partition_size_bytes: Optional[int] = None, enable_aqe: Optional[bool] = None, enable_native_executor: Optional[bool] = None, default_morsel_size: Optional[int] = None, shuffle_algorithm: Optional[str] = None, pre_shuffle_merge_threshold: Optional[int] = None, enable_ray_tracing: Optional[bool] = None) DaftContext[source]#

Globally sets various configuration parameters which control various aspects of Daft execution. These configuration values are used when a Dataframe is executed (e.g. calls to write_*, collect() or show())

Parameters:
  • config – A PyDaftExecutionConfig object to set the config to, before applying other kwargs. Defaults to None which indicates that the old (current) config should be used.

  • scan_tasks_min_size_bytes – Minimum size in bytes when merging ScanTasks when reading files from storage. Increasing this value will make Daft perform more merging of files into a single partition before yielding, which leads to bigger but fewer partitions. (Defaults to 96 MiB)

  • scan_tasks_max_size_bytes – Maximum size in bytes when merging ScanTasks when reading files from storage. Increasing this value will increase the upper bound of the size of merged ScanTasks, which leads to bigger but fewer partitions. (Defaults to 384 MiB)

  • broadcast_join_size_bytes_threshold – If one side of a join is smaller than this threshold, a broadcast join will be used. Default is 10 MiB.

  • parquet_split_row_groups_max_files – Maximum number of files to read in which the row group splitting should happen. (Defaults to 10)

  • sort_merge_join_sort_with_aligned_boundaries – Whether to use a specialized algorithm for sorting both sides of a sort-merge join such that they have aligned boundaries. This can lead to a faster merge-join at the cost of more skewed sorted join inputs, increasing the risk of OOMs.

  • hash_join_partition_size_leniency – If the left side of a hash join is already correctly partitioned and the right side isn’t, and the ratio between the left and right size is at least this value, then the right side is repartitioned to have an equal number of partitions as the left. Defaults to 0.5.

  • sample_size_for_sort – number of elements to sample from each partition when running sort, Default is 20.

  • num_preview_rows – number of rows to when showing a dataframe preview, Default is 8.

  • parquet_target_filesize – Target File Size when writing out Parquet Files. Defaults to 512MB

  • parquet_target_row_group_size – Target Row Group Size when writing out Parquet Files. Defaults to 128MB

  • parquet_inflation_factor – Inflation Factor of parquet files (In-Memory-Size / File-Size) ratio. Defaults to 3.0

  • csv_target_filesize – Target File Size when writing out CSV Files. Defaults to 512MB

  • csv_inflation_factor – Inflation Factor of CSV files (In-Memory-Size / File-Size) ratio. Defaults to 0.5

  • shuffle_aggregation_default_partitions – Maximum number of partitions to create when performing aggregations. Defaults to 200, unless the number of input partitions is less than 200.

  • read_sql_partition_size_bytes – Target size of partition when reading from SQL databases. Defaults to 512MB

  • enable_aqe – Enables Adaptive Query Execution, Defaults to False

  • enable_native_executor – Enables the native executor, Defaults to False

  • default_morsel_size – Default size of morsels used for the new local executor. Defaults to 131072 rows.

  • shuffle_algorithm – The shuffle algorithm to use. Defaults to “map_reduce”. Other options are “pre_shuffle_merge”.

  • pre_shuffle_merge_threshold – Memory threshold in bytes for pre-shuffle merge. Defaults to 1GB

  • enable_ray_tracing – Enable tracing for Ray. Accessible in /tmp/ray/session_latest/logs/daft after the run completes. Defaults to False.