Querying Databricks Iceberg tables using Presto engine

About this task

You can query remote Databricks Iceberg tables using the IBM® watsonx.data Presto engine by registering Databricks as a custom data source for zero-copy data federation.

  • Presto connects to Databricks through the Iceberg REST Catalog API interface, not Unity Catalog API directly.
  • Presto does not currently support vended-credentials for Databricks integration.
  • You must configure explicit storage credentials (AWS S3 and Azure Data Lake Storage Gen2) to access the underlying data files.
  • Presto supports Iceberg tables only; Delta Lake tables are not supported.
  • For general information about Databricks Unity Catalog integration, see Integrating Databricks Unity Catalog in watsonx.data.

Before you begin

Complete the prerequisites outlined in Integrating Databricks Unity Catalog in watsonx.data, including:

  • Databricks workspace with Unity Catalog enabled
  • Iceberg tables created in Databricks Unity Catalog
  • Personal Access Token with unity-catalog API scope
  • Workspace URL: https://<workspace-instance>.cloud.databricks.com
  • Unity Catalog permissions configured
  • Provisioned Presto engine in watsonx.data
  • Iceberg REST Catalog endpoint: https://<workspace-instance>.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest
  • Catalog name: Name of the Unity Catalog containing Iceberg tables
  • Access credentials for the external storage location (AWS S3 or Azure)

Use of vended credentials is currently not supported. Hence, Presto requires explicit storage credentials.

Procedure

  1. Register Databricks as a custom data source

    1. In the watsonx.data console, click Infrastructure manager.

    2. Click Add component > Add data source.

    3. Select Custom as the data source type.

    4. Enter a display name (e.g., databricks_iceberg).

    5. In the Properties section, add the following properties:

      For AWS S3:

      connector.name=iceberg
      iceberg.catalog.type=rest
      iceberg.rest.uri=https://<workspace-instance>.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest
      iceberg.rest.auth.type=OAUTH2
      iceberg.rest.auth.oauth2.token=<databricks-access-token>
      iceberg.catalog.warehouse=<catalog-name>
      hive.s3.aws-access-key=<aws-access-key>
      hive.s3.aws-secret-key=<aws-secret-key>
      

      For Azure Data Lake Storage Gen2:

      connector.name=iceberg
      iceberg.catalog.type=rest
      iceberg.rest.uri=https://<workspace-url>/api/2.1/unity-catalog/iceberg-rest
      iceberg.rest.auth.type=OAUTH2
      iceberg.rest.auth.oauth2.token=<databricks-access-token>
      iceberg.catalog.warehouse=<catalog-name>
      hive.azure.abfs-storage-account=<storage-account-name>
      hive.azure.abfs-access-key=<storage-account-key>
      

      Replace the placeholders in the 2 storage properties as follows:

      • <workspace-url>: Your Databricks workspace URL
      • <databricks-access-token>: Your Databricks personal access token
      • <catalog-name>: The Unity Catalog name containing your Iceberg tables
      • Storage credential placeholders with your actual credentials
    6. Click Create.

  2. Create a catalog for the data source

    1. Click Data manager > Catalogs.
    2. Click Create catalog.
    3. Select Iceberg as the catalog type.
    4. Enter a catalog name (e.g., databricks_catalog).
    5. In the Data source field, select the custom data source you created (databricks_iceberg).
    6. Click Create.
  3. Associate the catalog with Presto engine

    1. Click Infrastructure manager.
    2. Select your Presto engine from the list.
    3. Click Associate catalog.
    4. Select the catalog you created (databricks_catalog).
    5. Click Associate.
  4. Query Databricks Iceberg tables

    1. Click Query workspace.

    2. Select your Presto engine from the engine dropdown.

    3. Run queries against your remote Databricks Iceberg tables using fully qualified table names:

      -- List available schemas (namespaces)
      SHOW SCHEMAS IN databricks_catalog;
      
      -- List tables in a schema
      SHOW TABLES IN databricks_catalog.<schema_name>;
      
      -- Describe table structure
      DESCRIBE databricks_catalog.<schema_name>.<table_name>;
      
      -- Query data
      SELECT * FROM databricks_catalog.<schema_name>.<table_name> LIMIT 10;
      
      -- Get row count
      SELECT COUNT(*) FROM databricks_catalog.<schema_name>.<table_name>;
      
      -- Filter and aggregate
      SELECT
          column1,
          COUNT(*) as count,
          AVG(column2) as avg_value
      FROM databricks_catalog.<schema_name>.<table_name>
      WHERE column3 > 100
      GROUP BY column1
      ORDER BY count DESC;
      

Results

You can now query Iceberg tables from Databricks Unity Catalog using Presto. The queries execute directly on the data in the external storage location without copying data into watsonx.data.

Example queries and outputs

List schemas in a catalog:

SHOW SCHEMAS IN databricks_catalog;

Output:

Schema
--------------------------
default
delta_share_demo
feb14schema
information_schema
mrmadira_external_schema
tpcdsdbiceberg_10tb_partitioned_uc

List tables in a schema:

SHOW TABLES IN databricks_catalog.feb14schema;

Output:

Table
-----------------
avengers
cims_test_result
iceberg_orders
mrmadira_csv_table

Describe table structure:

DESCRIBE databricks_catalog.feb14schema.iceberg_orders;

Output:

Column       | Type                  | Extra | Comment
-------------|-----------------------|-------|--------
order_id     | bigint               |       |
customer_id  | bigint               |       |
order_ts     | timestamp(6)         |       |
total_amt    | decimal(12,2)        |       |

Query a table:

SELECT * FROM databricks_catalog.feb14schema.iceberg_orders LIMIT 10;

Output:

order_id | customer_id | order_ts            | total_amt
---------|-------------|---------------------|----------
22222    | 22222       | 2022-01-01 00:00:00 | 100.00

Related information