Querying Databricks Unity Catalog using Spark engine
About this task
You can query remote Databricks Unity Catalog tables using the IBM® watsonx.data Spark engine through zero-copy data federation. This capability allows Spark to query both Delta Lake and Iceberg tables stored in Unity Catalog without requiring data movement or duplication.
This document demonstrates how to use the Unity Catalog Iceberg REST API endpoint (/unity-catalog/iceberg-rest) to query tables.
It first illustrates how to query a managed Iceberg table in Databricks that uses an external storage location, from watsonx.data Spark. The examples cover both:
- Explicit storage credentials
- Vended credentials
The document then explains how to query a managed Delta table that uses Unity Catalog-managed (native) storage. This is achieved by reading the Delta table as an Iceberg table through the Iceberg REST API, using vended credentials.
For more information about Databricks Unity Catalog integration, see Integrating Databricks Unity Catalog in watsonx.data.
Before you begin
Complete the prerequisites outlined in Integrating Databricks Unity Catalog in watsonx.data, including:
- Databricks workspace with Unity Catalog enabled
- OAuth credentials or Personal Access Token
- Unity Catalog permissions configured
- Storage credentials (AWS S3, Azure, or GCS)
- Provisioned Spark engine (version 3.5 or later)
Procedure
-
Create a Spark session configuration with your Unity Catalog connection details to enable zero-copy querying.
from pyspark.sql import SparkSession WORKSPACE_URL = "https://<workspace-id>.cloud.databricks.com" UNITY_CATALOG_URL = f"{WORKSPACE_URL}/api/2.1/unity-catalog/iceberg-rest" CATALOG_NAME = "your_catalog_name" SCHEMA_NAME = "your_schema_name" TABLE_NAME = "your_table_name" OAUTH_CLIENT_ID = "your_oauth_client_id" OAUTH_CLIENT_SECRET = "your_oauth_client_secret" # Configure explicit storage credentials only when not using credential vending # AWS_ACCESS_KEY_ID = "your_aws_access_key" # AWS_SECRET_ACCESS_KEY = "your_aws_secret_key" spark = SparkSession.builder \ .appName("Iceberg-REST-UC") \ .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0," "org.apache.iceberg:iceberg-aws-bundle:1.5.0") \ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ .config(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog") \ .config(f"spark.sql.catalog.{CATALOG_NAME}.type", "rest") \ .config(f"spark.sql.catalog.{CATALOG_NAME}.uri", UNITY_CATALOG_URL) \ .config(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", CATALOG_NAME) \ .config(f"spark.sql.catalog.{CATALOG_NAME}.scope", "all-apis") \ .config(f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.type", "oauth2") \ .config(f"spark.sql.catalog.{CATALOG_NAME}.oauth2-server-uri", f"{WORKSPACE_URL}/oidc/v1/token") \ .config(f"spark.sql.catalog.{CATALOG_NAME}.credential", f"{OAUTH_CLIENT_ID}:{OAUTH_CLIENT_SECRET}") \ # Use the following configurations to enable credential vending. # Do not configure explicit storage credentials when using this option. .config(f"spark.sql.catalog.{CATALOG_NAME}.header.X-Iceberg-Access-Delegation", "vended-credentials") \ .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.iceberg.aws.s3.S3V4RestSignerClient") \ # Alternatively, configure explicit S3 credentials (disable the above settings if used). # .config("spark.hadoop.fs.s3a.access.key", AWS_ACCESS_KEY_ID) \ # .config("spark.hadoop.fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY) \ .getOrCreate()Replace placeholder values
Update the following placeholders in the script below:
<workspace-id>: Your Databricks workspace identifier. Example workspace URLs:https://dbc-a1b2345c-d6e7.cloud.databricks.com/https://adb-1234567890123456.7.azuredatabricks.net/
your_catalog_name: Name of your Unity Catalogyour_schema_name: Name of your schemayour_table_name: Name of your tableyour_oauth_client_id: OAuth client IDyour_oauth_client_secret: OAuth client secret
If you plan to use explicit storage credentials, replace the following:
your_aws_access_key: AWS access key (for Amazon S3 storage)your_aws_secret_key: AWS secret key (for Amazon S3 storage)
The OAuth client ID and secret can be obtained from Service Principals in Databricks. For more information, see Integrating Databricks Unity Catalog in watsonx.data.
-
Choose a submission method
You can query remote Unity Catalog tables using one of three methods:
Querying methods Method Best for Documentation SparkLab (VS Code) Interactive development, testing, debugging VS Code Development Environment REST API Automation, CI/CD pipelines, programmatic control Submitting Spark Application by using REST API CPDCTL CLI Command-line submissions, shell scripts, DevOps workflows Submitting Spark Application by using CPDCTL -
Query remote Iceberg tables by using SparkLab
-
Open SparkLab in watsonx.data.
-
Create a new PySpark script.
-
Add the following code:
from pyspark.sql import SparkSession WORKSPACE_URL = "https://<workspace-id>.cloud.databricks.com" UNITY_CATALOG_URL = f"{WORKSPACE_URL}/api/2.1/unity-catalog/iceberg-rest" CATALOG_NAME = "your_catalog_name" SCHEMA_NAME = "your_schema_name" TABLE_NAME = "your_table_name" OAUTH_CLIENT_ID = "your_client_id" OAUTH_CLIENT_SECRET = "your_client_secret" # Configure explicit storage credentials only when not using credential vending # AWS_ACCESS_KEY_ID = "your_access_key" # AWS_SECRET_ACCESS_KEY = "your_secret_key" spark = SparkSession.builder \ .appName("Iceberg-REST-UC") \ .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0," "org.apache.iceberg:iceberg-aws-bundle:1.5.0") \ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ .config(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog") \ .config(f"spark.sql.catalog.{CATALOG_NAME}.type", "rest") \ .config(f"spark.sql.catalog.{CATALOG_NAME}.uri", UNITY_CATALOG_URL) \ .config(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", CATALOG_NAME) \ .config(f"spark.sql.catalog.{CATALOG_NAME}.scope", "all-apis") \ .config(f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.type", "oauth2") \ .config(f"spark.sql.catalog.{CATALOG_NAME}.oauth2-server-uri", f"{WORKSPACE_URL}/oidc/v1/token") \ .config(f"spark.sql.catalog.{CATALOG_NAME}.credential", f"{OAUTH_CLIENT_ID}:{OAUTH_CLIENT_SECRET}") \ # Use the following configurations to enable credential vending. # Do not configure explicit storage credentials when using this option. .config(f"spark.sql.catalog.{CATALOG_NAME}.header.X-Iceberg-Access-Delegation", "vended-credentials") \ .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.iceberg.aws.s3.S3V4RestSignerClient") \ # Alternatively, configure explicit S3 credentials (disable the above settings if used). # .config("spark.hadoop.fs.s3a.access.key", AWS_ACCESS_KEY_ID) \ # .config("spark.hadoop.fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY) \ .getOrCreate() print(f"Namespaces in catalog: {CATALOG_NAME}") spark.sql(f"SHOW NAMESPACES IN {CATALOG_NAME}").show(truncate=False) print(f"Tables in namespace: {CATALOG_NAME}.{SCHEMA_NAME}") spark.sql(f"SHOW TABLES IN {CATALOG_NAME}.{SCHEMA_NAME}").show(truncate=False) spark.sql(f"DESCRIBE EXTENDED {CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}").show(truncate=False) full_table_name = f"{CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}" print(f"Querying: {full_table_name}") df = spark.sql(f"SELECT * FROM {full_table_name}") df.printSchema() df.show(20, truncate=False) count = df.count() print(f"Total records: {count}") spark.stop()
-
-
Run the script.
-
Query managed Delta tables with Unity Catalog-managed storage.
You can query managed Delta tables that use Unity Catalog-managed (native) storage from watsonx.data Spark by reading them as Iceberg tables through the Iceberg REST API.
-
To enable this capability, you must first configure the Delta table in Databricks to support the Uniform format.
Ensure that you have the required permissions to modify table properties and run maintenance operations on the target table in Databricks.
-
Run the following commands in the Databricks SQL Editor:
ALTER TABLE spark_deegandh.test_schema.sensor_readings SET TBLPROPERTIES ( 'delta.enableDeletionVectors' = 'false' ); REORG TABLE spark_deegandh.test_schema.sensor_readings APPLY (PURGE); ALTER TABLE spark_deegandh.test_schema.sensor_readings SET TBLPROPERTIES ( 'delta.enableIcebergCompatV2' = 'true', 'delta.columnMapping.mode' = 'name', 'delta.universalFormat.enabledFormats' = 'iceberg' ); SHOW TBLPROPERTIES spark_deegandh.test_schema.sensor_readings;
After completing these steps, the Delta table is enabled for Uniform and can be accessed as an Iceberg table through the Unity Catalog Iceberg REST API.
You should see the following properties in the output:
delta.universalFormat.enabledFormats = icebergdelta.enableIcebergCompatV2 = true
-
You can then query the Delta table from watsonx.data Spark by using the same script, leveraging credential vending through the Unity Catalog Iceberg REST API endpoint (
/unity-catalog/iceberg-rest), as described in the previous step.Deletion vectors must be disabled and fully purged before enabling Iceberg compatibility. If this step is skipped, the configuration will fail.
For more details about Uniform and creating Delta tables with Uniform (universal format) enabled, see the official Databricks documentation: Delta Uniform: Universal Format for Lakehouse Interoperability.
-
Results
You can now query remote data from Databricks Unity Catalog without copying data. The queries execute directly against the data in the external storage location.
Example output for Iceberg table
Namespaces in catalog: spark_deegandh
+---------------------------+
|namespace |
+---------------------------+
|default |
|delta_share_demo |
|feb14schema |
|information_schema |
|mrmadira_external_schema |
+---------------------------+
Tables in namespace: spark_deegandh.feb14schema:
+-----------+--------------------+-----------+
|namespace |tableName |isTemporary|
+-----------+--------------------+-----------+
|feb14schema|avengers |false |
|feb14schema|cims_test_result |false |
|feb14schema|iceberg_orders |false |
|feb14schema|mrmadira_csv_table |false |
+-----------+--------------------+-----------+
Querying: spark_deegandh.feb14schema.iceberg_orders:
Schema:
root
|-- order_id: long (nullable = true)
|-- customer_id: long (nullable = true)
|-- order_ts: timestamp (nullable = true)
|-- total_amt: decimal(12,2) (nullable = true)
Data (first 20 rows):
+--------+-----------+-------------------+---------+
|order_id|customer_id|order_ts |total_amt|
+--------+-----------+-------------------+---------+
|22222 |22222 |2022-01-01 00:00:00|100.00 |
+--------+-----------+-------------------+---------+
Total records: 1
Example output for the Delta table
Query:
spark.sql(f"SHOW NAMESPACES IN {CATALOG_NAME}").show(truncate=False)
Output:
Namespaces in catalog: spark_deegandh
+----------------------------------+
|namespace |
+----------------------------------+
|default |
|delta_share_demo |
|feb14schema |
|information_schema |
|mrmadira_external_schema |
|test_schema |
|tpcdsdbiceberg_10tb_partitioned_uc|
+----------------------------------+
Query:
spark.sql(f"SHOW TABLES IN {CATALOG_NAME}.{SCHEMA_NAME}").show(truncate=False)
Output:
Tables in namespace: spark_deegandh.test_schema:
+-----------+---------------+-----------+
|namespace |tableName |isTemporary|
+-----------+---------------+-----------+
|test_schema|sensor_readings|false |
+-----------+---------------+-----------+
Query:
df = spark.sql(f"SELECT * FROM {full_table_name}")
df.printSchema()
df.show(20, truncate=False)
count = df.count()
print(f"Total records: {count}")
Output:
Querying: spark_deegandh.test_schema.sensor_readings:
Schema:
root
|-- sensor_id: integer (nullable = true)
|-- location: string (nullable = true)
|-- temperature: double (nullable = true)
|-- humidity: double (nullable = true)
|-- battery_level: double (nullable = true)
|-- reading_timestamp: string (nullable = true)
|-- is_active: boolean (nullable = true)
Data (first 20 rows):
+---------+-----------+-----------+--------+-------------+-------------------+---------+
|sensor_id|location |temperature|humidity|battery_level|reading_timestamp |is_active|
+---------+-----------+-----------+--------+-------------+-------------------+---------+
|1001 |Warehouse-A|24.5 |60.2 |89.5 |2026-04-01 10:00:00|true |
|1002 |Warehouse-B|26.1 |58.0 |76.3 |2026-04-01 10:05:00|true |
|1003 |Warehouse-C|22.8 |65.4 |54.2 |2026-04-01 10:10:00|false |
|1004 |Warehouse-A|25.0 |61.5 |88.0 |2026-04-01 10:15:00|true |
|1005 |Warehouse-B|27.3 |55.2 |70.1 |2026-04-01 10:20:00|true |
|1006 |Warehouse-C|23.5 |66.8 |49.9 |2026-04-01 10:25:00|false |
|1007 |Warehouse-A|24.8 |62.1 |85.7 |2026-04-01 10:30:00|true |
|1008 |Warehouse-B|26.9 |57.6 |72.4 |2026-04-01 10:35:00|true |
|1009 |Warehouse-C|22.2 |67.0 |45.3 |2026-04-01 10:40:00|false |
|1010 |Warehouse-A|25.3 |60.9 |87.2 |2026-04-01 10:45:00|true |
+---------+-----------+-----------+--------+-------------+-------------------+---------+
Total records: 10