daft.Expression.url.download

daft.Expression.url.download#

Expression.url.download(max_connections: int = 32, on_error: Literal['raise'] | Literal['null'] = 'raise', io_config: IOConfig | None = None, use_native_downloader: bool = True) Expression[source]#

Treats each string as a URL, and downloads the bytes contents as a bytes column

Note

If you are observing excessive S3 issues (such as timeouts, DNS errors or slowdown errors) during URL downloads, you may wish to reduce the value of max_connections (defaults to 32) to reduce the amount of load you are placing on your S3 servers.

Alternatively, if you are running on machines with lower number of cores but very high network bandwidth, you can increase max_connections to get higher throughput with additional parallelism

Parameters:
  • max_connections – The maximum number of connections to use per thread to use for downloading URLs. Defaults to 32.

  • on_error – Behavior when a URL download error is encountered - “raise” to raise the error immediately or “null” to log the error but fallback to a Null value. Defaults to “raise”.

  • io_config – IOConfig to use when accessing remote storage. Note that the S3Config’s max_connections parameter will be overridden with max_connections that is passed in as a kwarg.

  • use_native_downloader (bool) – Use the native downloader rather than python based one. Defaults to True.

Returns:

a Binary expression which is the bytes contents of the URL, or None if an error occurred during download

Return type:

Expression