Querying Databricks Iceberg tables using Presto engine
About this task
You can query remote Databricks Iceberg tables using the IBM® watsonx.data Presto engine by registering Databricks as a custom data source for zero-copy data federation.
- Presto connects to Databricks through the Iceberg REST Catalog API interface, not Unity Catalog API directly.
- Presto does not currently support vended-credentials for Databricks integration.
- You must configure explicit storage credentials (AWS S3 and Azure Data Lake Storage Gen2) to access the underlying data files.
- Presto supports Iceberg tables only; Delta Lake tables are not supported.
- For general information about Databricks Unity Catalog integration, see Integrating Databricks Unity Catalog in watsonx.data.
Before you begin
Complete the prerequisites outlined in Integrating Databricks Unity Catalog in watsonx.data, including:
- Databricks workspace with Unity Catalog enabled
- Iceberg tables created in Databricks Unity Catalog
- Personal Access Token with
unity-catalogAPI scope - Workspace URL:
https://<workspace-instance>.cloud.databricks.com - Unity Catalog permissions configured
- Provisioned Presto engine in watsonx.data
- Iceberg REST Catalog endpoint:
https://<workspace-instance>.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest - Catalog name: Name of the Unity Catalog containing Iceberg tables
- Access credentials for the external storage location (AWS S3 or Azure)
Use of vended credentials is currently not supported. Hence, Presto requires explicit storage credentials.
Procedure
-
Register Databricks as a custom data source
-
In the watsonx.data console, click Infrastructure manager.
-
Click Add component > Add data source.
-
Select Custom as the data source type.
-
Enter a display name (e.g.,
databricks_iceberg). -
In the Properties section, add the following properties:
For AWS S3:
connector.name=iceberg iceberg.catalog.type=rest iceberg.rest.uri=https://<workspace-instance>.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest iceberg.rest.auth.type=OAUTH2 iceberg.rest.auth.oauth2.token=<databricks-access-token> iceberg.catalog.warehouse=<catalog-name> hive.s3.aws-access-key=<aws-access-key> hive.s3.aws-secret-key=<aws-secret-key>For Azure Data Lake Storage Gen2:
connector.name=iceberg iceberg.catalog.type=rest iceberg.rest.uri=https://<workspace-url>/api/2.1/unity-catalog/iceberg-rest iceberg.rest.auth.type=OAUTH2 iceberg.rest.auth.oauth2.token=<databricks-access-token> iceberg.catalog.warehouse=<catalog-name> hive.azure.abfs-storage-account=<storage-account-name> hive.azure.abfs-access-key=<storage-account-key>Replace the placeholders in the 2 storage properties as follows:
<workspace-url>: Your Databricks workspace URL<databricks-access-token>: Your Databricks personal access token<catalog-name>: The Unity Catalog name containing your Iceberg tables- Storage credential placeholders with your actual credentials
-
Click Create.
-
-
Create a catalog for the data source
- Click Data manager > Catalogs.
- Click Create catalog.
- Select Iceberg as the catalog type.
- Enter a catalog name (e.g.,
databricks_catalog). - In the Data source field, select the custom data source you created (
databricks_iceberg). - Click Create.
-
Associate the catalog with Presto engine
- Click Infrastructure manager.
- Select your Presto engine from the list.
- Click Associate catalog.
- Select the catalog you created (
databricks_catalog). - Click Associate.
-
Query Databricks Iceberg tables
-
Click Query workspace.
-
Select your Presto engine from the engine dropdown.
-
Run queries against your remote Databricks Iceberg tables using fully qualified table names:
-- List available schemas (namespaces) SHOW SCHEMAS IN databricks_catalog; -- List tables in a schema SHOW TABLES IN databricks_catalog.<schema_name>; -- Describe table structure DESCRIBE databricks_catalog.<schema_name>.<table_name>; -- Query data SELECT * FROM databricks_catalog.<schema_name>.<table_name> LIMIT 10; -- Get row count SELECT COUNT(*) FROM databricks_catalog.<schema_name>.<table_name>; -- Filter and aggregate SELECT column1, COUNT(*) as count, AVG(column2) as avg_value FROM databricks_catalog.<schema_name>.<table_name> WHERE column3 > 100 GROUP BY column1 ORDER BY count DESC;
-
Results
You can now query Iceberg tables from Databricks Unity Catalog using Presto. The queries execute directly on the data in the external storage location without copying data into watsonx.data.
Example queries and outputs
List schemas in a catalog:
SHOW SCHEMAS IN databricks_catalog;
Output:
Schema
--------------------------
default
delta_share_demo
feb14schema
information_schema
mrmadira_external_schema
tpcdsdbiceberg_10tb_partitioned_uc
List tables in a schema:
SHOW TABLES IN databricks_catalog.feb14schema;
Output:
Table
-----------------
avengers
cims_test_result
iceberg_orders
mrmadira_csv_table
Describe table structure:
DESCRIBE databricks_catalog.feb14schema.iceberg_orders;
Output:
Column | Type | Extra | Comment
-------------|-----------------------|-------|--------
order_id | bigint | |
customer_id | bigint | |
order_ts | timestamp(6) | |
total_amt | decimal(12,2) | |
Query a table:
SELECT * FROM databricks_catalog.feb14schema.iceberg_orders LIMIT 10;
Output:
order_id | customer_id | order_ts | total_amt
---------|-------------|---------------------|----------
22222 | 22222 | 2022-01-01 00:00:00 | 100.00