daft.Expression.url.download#
- Expression.url.download(max_connections: int = 32, on_error: Literal['raise', 'null'] = 'raise', io_config: IOConfig | None = None, use_native_downloader: bool = True) Expression [source]#
Treats each string as a URL, and downloads the bytes contents as a bytes column
Note
If you are observing excessive S3 issues (such as timeouts, DNS errors or slowdown errors) during URL downloads, you may wish to reduce the value of
max_connections
(defaults to 32) to reduce the amount of load you are placing on your S3 servers.Alternatively, if you are running on machines with lower number of cores but very high network bandwidth, you can increase
max_connections
to get higher throughput with additional parallelism- Parameters:
max_connections – The maximum number of connections to use per thread to use for downloading URLs. Defaults to 32.
on_error – Behavior when a URL download error is encountered - “raise” to raise the error immediately or “null” to log the error but fallback to a Null value. Defaults to “raise”.
io_config – IOConfig to use when accessing remote storage. Note that the S3Config’s
max_connections
parameter will be overridden withmax_connections
that is passed in as a kwarg.use_native_downloader (bool) – Use the native downloader rather than python based one. Defaults to True.
- Returns:
a Binary expression which is the bytes contents of the URL, or None if an error occurred during download
- Return type:
Expression