Unity Catalog#
Unity Catalog is an open-sourced catalog developed by Databricks. Users of Unity Catalog are able to work with data assets such as tables (Parquet, CSV, Iceberg, Delta), volumes (storing raw files), functions and models.
To use Daft with the Unity Catalog, you will need to install Daft with the unity
option specified like so:
pip install daft[unity]
Warning
These APIs are in beta and may be subject to change as the Unity Catalog continues to be developed.
Connecting to the Unity Catalog#
Daft includes an abstraction for the Unity Catalog.
from daft.unity_catalog import UnityCatalog
unity = UnityCatalog(
endpoint="https://<databricks_workspace_id>.cloud.databricks.com",
# Authentication can be retrieved from your provider of Unity Catalog
token="my-token",
)
# See all available catalogs
print(unity.list_catalogs())
# See available schemas in a given catalog
print(unity.list_schemas("my_catalog_name"))
# See available tables in a given schema
print(unity.list_tables("my_catalog_name.my_schema_name"))
Loading a Daft Dataframe from a Delta Lake table in Unity Catalog#
unity_table = unity.load_table("my_catalog_name.my_schema_name.my_table_name")
df = daft.read_deltalake(unity_table)
df.show()
Any subsequent filter operations on the Daft df
DataFrame object will be correctly optimized to take advantage of DeltaLake features:
# Filter which takes advantage of partition pruning capabilities of Delta Lake
df = df.where(df["partition_key"] < 1000)
df.show()
See also Delta Lake for more information about how to work with the Delta Lake tables provided by the Unity Catalog.
Roadmap#
-
Volumes integration for reading objects from volumes (e.g. images and documents)
-
Unity Iceberg integration for reading tables using the Iceberg interface instead of the Delta Lake interface
Please make issues on the Daft repository if you have any use-cases that Daft does not currently cover!