daft.read_warc

Contents

daft.read_warc#

read_warc(path: Union[str, List[str]], io_config: Optional[IOConfig] = None, file_path_column: Optional[str] = None, _multithreaded_io: Optional[bool] = None) DataFrame[source]#

Creates a DataFrame from WARC or gzipped WARC file(s). This is an experimental feature and the API may change in the future.

Example

>>> df = daft.read_warc("/path/to/file.warc")
>>> df = daft.read_warc("/path/to/directory")
>>> df = daft.read_warc("/path/to/files-*.warc")
>>> df = daft.read_warc("s3://path/to/files-*.warc")
>>> df = daft.read_warc("gs://path/to/files-*.warc")
Parameters:
  • path (Union[str, List[str]]) – Path to WARC file (allows for wildcards)

  • io_config (Optional[IOConfig]) – Config to be used with the native downloader

  • file_path_column (Optional[str]) – Include the source path(s) as a column with this name. Defaults to None.

  • _multithreaded_io (Optional[bool]) – Whether to use multithreading for IO threads. Setting this to False can be helpful in reducing the amount of system resources (number of connections and thread contention) when running in the Ray runner. Defaults to None, which will let Daft decide based on the runner it is currently using.

Returns:

parsed DataFrame with mandatory metadata columns (“WARC-Record-ID”, “WARC-Type”, “WARC-Date”, “Content-Length”), one optional

metadata column (“WARC-Identified-Payload-Type”), one column “warc_content” with the raw byte content of the WARC record, and one column “warc_headers” with the remaining headers of the WARC record stored as a JSON string.

Return type:

DataFrame