daft.read_sql

Contents

daft.read_sql#

read_sql(sql: str, conn: Union[Callable[[], Connection], str], partition_col: Optional[str] = None, num_partitions: Optional[int] = None, partition_bound_strategy: str = 'min-max', disable_pushdowns_to_sql: bool = False, infer_schema: bool = True, infer_schema_length: int = 10, schema: Optional[Dict[str, DataType]] = None) DataFrame[source]#

Create a DataFrame from the results of a SQL query.

Parameters:
  • sql (str) – SQL query to execute

  • conn (Union[Callable[[], Connection], str]) – SQLAlchemy connection factory or database URL

  • partition_col (Optional[str]) – Column to partition the data by, defaults to None

  • num_partitions (Optional[int]) – Number of partitions to read the data into, defaults to None, which will lets Daft determine the number of partitions. If specified, partition_col must also be specified.

  • partition_bound_strategy (str) – Strategy to determine partition bounds, either “min-max” or “percentile”, defaults to “min-max”

  • disable_pushdowns_to_sql (bool) – Whether to disable pushdowns to the SQL query, defaults to False

  • infer_schema (bool) – Whether to turn on schema inference, defaults to True. If set to False, the schema parameter must be provided.

  • infer_schema_length (int) – The number of rows to scan when inferring the schema, defaults to 10. If infer_schema is False, this parameter is ignored. Note that if Daft is able to use ConnectorX to infer the schema, this parameter is ignored as ConnectorX is an Arrow backed driver.

  • schema (Optional[Dict[str, DataType]]) – A mapping of column names to datatypes. If infer_schema is False, this schema is used as the definitive schema for the data, otherwise it is used as a schema hint that is applied after the schema is inferred. This can be useful if the types can be more precisely determined than what the inference can provide (e.g., if a column can be declared as a fixed-sized list rather than a list).

Returns:

Dataframe containing the results of the query

Return type:

DataFrame

Note

  1. Supported dialects:

    Daft uses SQLGlot to build and translate SQL queries between dialects. For a list of supported dialects, see SQLGlot’s dialect documentation.

  2. Partitioning:

    When partition_col is specified, the function partitions the query based on that column. You can define num_partitions or leave it to Daft to decide. Daft uses the partition_bound_strategy parameter to determine the partitioning strategy: - min_max: Daft calculates the minimum and maximum values of the specified column, then partitions the query using equal ranges between the minimum and maximum values. - percentile: Daft calculates the specified column’s percentiles via a PERCENTILE_DISC function to determine partitions (e.g., for num_partitions=3, it uses the 33rd and 66th percentiles).

  3. Execution:

    Daft executes SQL queries using using ConnectorX or SQLAlchemy, preferring ConnectorX unless a SQLAlchemy connection factory is specified or the database dialect is unsupported by ConnectorX.

  4. Pushdowns:

    Daft pushes down operations such as filtering, projections, and limits into the SQL query when possible. You can disable pushdowns by setting disable_pushdowns_to_sql=True, which will execute the SQL query as is.

Example

Read data from a SQL query and a database URL:

>>> df = daft.read_sql("SELECT * FROM my_table", "sqlite:///my_database.db")

Read data from a SQL query and a SQLAlchemy connection factory:

>>> def create_conn():
...     return sqlalchemy.create_engine("sqlite:///my_database.db").connect()
>>> df = daft.read_sql("SELECT * FROM my_table", create_conn)

Read data from a SQL query and partition the data by a column:

>>> df = daft.read_sql(
...     "SELECT * FROM my_table",
...     "sqlite:///my_database.db",
...     partition_col="id"
... )

Read data from a SQL query and partition the data into 3 partitions:

>>> df = daft.read_sql(
...     "SELECT * FROM my_table",
...     "sqlite:///my_database.db",
...     partition_col="id",
...     num_partitions=3
... )