daft.DataFrame.iter_partitions

daft.DataFrame.iter_partitions#

DataFrame.iter_partitions(results_buffer_size: Union[int, None, Literal['num_cpus']] = 'num_cpus') Iterator[Union[MicroPartition, ray.ObjectRef[MicroPartition]]][source]#

Begin executing this dataframe and return an iterator over the partitions.

Each partition will be returned as a daft.recordbatch object (if using Python runner backend) or a ray ObjectRef (if using Ray runner backend).

Note

A quick note on configuring asynchronous/parallel execution using results_buffer_size.

The results_buffer_size kwarg controls how many results Daft will allow to be in the buffer while iterating. Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.

  • Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput

  • Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput

  • Setting this value to None means the iterator will consume as much resources as it deems appropriate per-iteration

The default value is the total number of CPUs available on the current machine.

Parameters:

results_buffer_size – how many partitions to allow in the results buffer (defaults to the total number of CPUs available on the machine).

>>> import daft
>>>
>>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]}).into_partitions(2)
>>> for part in df.iter_partitions():
...     print(part)
MicroPartition with 2 rows:
TableState: Loaded. 1 tables
╭───────┬──────╮
│ foo   ┆ bar  │
│ ---   ┆ ---  │
│ Int64 ┆ Utf8 │
╞═══════╪══════╡
│ 1     ┆ a    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2     ┆ b    │
╰───────┴──────╯


Statistics: missing

MicroPartition with 1 rows:
TableState: Loaded. 1 tables
╭───────┬──────╮
│ foo   ┆ bar  │
│ ---   ┆ ---  │
│ Int64 ┆ Utf8 │
╞═══════╪══════╡
│ 3     ┆ c    │
╰───────┴──────╯


Statistics: missing