daft.set_execution_config#
- set_execution_config(config: Optional[PyDaftExecutionConfig] = None, scan_tasks_min_size_bytes: Optional[int] = None, scan_tasks_max_size_bytes: Optional[int] = None, broadcast_join_size_bytes_threshold: Optional[int] = None, parquet_split_row_groups_max_files: Optional[int] = None, sort_merge_join_sort_with_aligned_boundaries: Optional[bool] = None, hash_join_partition_size_leniency: Optional[float] = None, sample_size_for_sort: Optional[int] = None, num_preview_rows: Optional[int] = None, parquet_target_filesize: Optional[int] = None, parquet_target_row_group_size: Optional[int] = None, parquet_inflation_factor: Optional[float] = None, csv_target_filesize: Optional[int] = None, csv_inflation_factor: Optional[float] = None, shuffle_aggregation_default_partitions: Optional[int] = None, read_sql_partition_size_bytes: Optional[int] = None, enable_aqe: Optional[bool] = None, enable_native_executor: Optional[bool] = None, default_morsel_size: Optional[int] = None, shuffle_algorithm: Optional[str] = None, pre_shuffle_merge_threshold: Optional[int] = None, enable_ray_tracing: Optional[bool] = None) DaftContext [source]#
Globally sets various configuration parameters which control various aspects of Daft execution. These configuration values are used when a Dataframe is executed (e.g. calls to
write_*
,collect()
orshow()
)- Parameters:
config – A PyDaftExecutionConfig object to set the config to, before applying other kwargs. Defaults to None which indicates that the old (current) config should be used.
scan_tasks_min_size_bytes – Minimum size in bytes when merging ScanTasks when reading files from storage. Increasing this value will make Daft perform more merging of files into a single partition before yielding, which leads to bigger but fewer partitions. (Defaults to 96 MiB)
scan_tasks_max_size_bytes – Maximum size in bytes when merging ScanTasks when reading files from storage. Increasing this value will increase the upper bound of the size of merged ScanTasks, which leads to bigger but fewer partitions. (Defaults to 384 MiB)
broadcast_join_size_bytes_threshold – If one side of a join is smaller than this threshold, a broadcast join will be used. Default is 10 MiB.
parquet_split_row_groups_max_files – Maximum number of files to read in which the row group splitting should happen. (Defaults to 10)
sort_merge_join_sort_with_aligned_boundaries – Whether to use a specialized algorithm for sorting both sides of a sort-merge join such that they have aligned boundaries. This can lead to a faster merge-join at the cost of more skewed sorted join inputs, increasing the risk of OOMs.
hash_join_partition_size_leniency – If the left side of a hash join is already correctly partitioned and the right side isn’t, and the ratio between the left and right size is at least this value, then the right side is repartitioned to have an equal number of partitions as the left. Defaults to 0.5.
sample_size_for_sort – number of elements to sample from each partition when running sort, Default is 20.
num_preview_rows – number of rows to when showing a dataframe preview, Default is 8.
parquet_target_filesize – Target File Size when writing out Parquet Files. Defaults to 512MB
parquet_target_row_group_size – Target Row Group Size when writing out Parquet Files. Defaults to 128MB
parquet_inflation_factor – Inflation Factor of parquet files (In-Memory-Size / File-Size) ratio. Defaults to 3.0
csv_target_filesize – Target File Size when writing out CSV Files. Defaults to 512MB
csv_inflation_factor – Inflation Factor of CSV files (In-Memory-Size / File-Size) ratio. Defaults to 0.5
shuffle_aggregation_default_partitions – Maximum number of partitions to create when performing aggregations. Defaults to 200, unless the number of input partitions is less than 200.
read_sql_partition_size_bytes – Target size of partition when reading from SQL databases. Defaults to 512MB
enable_aqe – Enables Adaptive Query Execution, Defaults to False
enable_native_executor – Enables the native executor, Defaults to False
default_morsel_size – Default size of morsels used for the new local executor. Defaults to 131072 rows.
shuffle_algorithm – The shuffle algorithm to use. Defaults to “map_reduce”. Other options are “pre_shuffle_merge”.
pre_shuffle_merge_threshold – Memory threshold in bytes for pre-shuffle merge. Defaults to 1GB
enable_ray_tracing – Enable tracing for Ray. Accessible in
/tmp/ray/session_latest/logs/daft
after the run completes. Defaults to False.