daft.DataFrame.iter_rows#
- DataFrame.iter_rows(results_buffer_size: Union[int, None, Literal['num_cpus']] = 'num_cpus', column_format: Literal['python', 'arrow'] = 'python') Iterator[Dict[str, Any]] [source]#
Return an iterator of rows for this dataframe.
Each row will be a Python dictionary of the form { “key” : value, … }. If you are instead looking to iterate over entire partitions of data, see:
df.iter_partitions()
.By default, Daft will convert the columns to Python lists for easy consumption. Datatypes with Python equivalents will be converted accordingly, e.g. timestamps to datetime, tensors to numpy arrays. For nested data such as List or Struct arrays, however, this can be expensive. You may wish to set
column_format
to “arrow” such that the nested data is returned as Arrow scalars.Note
A quick note on configuring asynchronous/parallel execution using
results_buffer_size
.The
results_buffer_size
kwarg controls how many results Daft will allow to be in the buffer while iterating. Once this buffer is filled, Daft will not run any more work until some partition is consumed from the buffer.Increasing this value means the iterator will consume more memory and CPU resources but have higher throughput
Decreasing this value means the iterator will consume lower memory and CPU resources, but have lower throughput
Setting this value to
None
means the iterator will consume as much resources as it deems appropriate per-iteration
The default value is the total number of CPUs available on the current machine.
Example
>>> import daft >>> >>> df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]}) >>> for row in df.iter_rows(): ... print(row) {'foo': 1, 'bar': 'a'} {'foo': 2, 'bar': 'b'} {'foo': 3, 'bar': 'c'}
- Parameters:
results_buffer_size – how many partitions to allow in the results buffer (defaults to the total number of CPUs available on the machine).
column_format – the format of the columns to iterate over. One of “python” or “arrow”. Defaults to “python”.
See also
df.iter_partitions()
: iterator over entire partitions instead of single rows