Setting up Cloudera integration with Presto engine
About this task
You can configure IBM® watsonx.data to query Hive tables stored in Cloudera HDFS using the Presto engine through zero-copy data federation. This guide covers the setup process for both Kerberos and non-Kerberos authentication methods.
For general information about Cloudera integration, see Integrating Cloudera in watsonx.data.
Before you begin
Ensure that the following prerequisites are met before proceeding.
Cloudera requirements:
- An active Cloudera cluster with HDFS
- Access to Cloudera Query Workspace
- Network connectivity between watsonx.data and Cloudera cluster
- HDFS NameNode hostname and port
- Hive Metastore URI (format:
thrift://<metastore-host>:<port>) - Download the required configuration files:
- Log in to Cloudera Manager.
- Navigate to Clusters.
- Select HDFS from the list.
- Click on Actions.
- Choose Download Client Configuration from the dropdown menu.
Authentication requirements:
Choose one of the following authentication methods:
-
Non-Kerberos authentication:
- HDFS configuration files (
core-site.xml,hdfs-site.xml) - HDFS user with appropriate permissions
- HDFS configuration files (
-
Kerberos authentication:
- Kerberos principal and keytab file
- Kerberos configuration file (
krb5.conf) - HDFS configuration files (
core-site.xml,hdfs-site.xml) - Access to Cloudera cluster's Kerberos realm
watsonx.data requirements:
- A provisioned watsonx.data Presto engine
- Access to Infrastructure Manager with appropriate permissions
Procedure
-
Create Hive tables directly in Cloudera using the Hue editor querying tables in watsonx.data.
-
Log in to Cloudera Hue interface.
-
Navigate to Hive editor from the left menu.
-
Create a new database or use an existing one.
CREATE DATABASE IF NOT EXISTS <database_name>; USE <database_name>; -
Create a Hive table with desired schema.
CREATE TABLE <database_name>.<table_name> ( id INT, name STRING, department STRING, salary DECIMAL(10,2) ) STORED AS PARQUET LOCATION '/user/hive/warehouse/<database_name>.db.<table_name>'; -
Insert data into the table.
INSERT INTO <database_name>.<table_name> VALUES (1, 'John Doe', 'IT', 75000), (2, 'Jane Smith', 'HR', 65000), (3, 'Bob Johnson', 'Finance', 80000); -
Query the table to verify the data inserted in the table.
SELECT * FROM <database_name>.<table_name>;
-
-
Create HDFS storage component in watsonx.data.
-
Log in to watsonx.data console.
-
Navigate to Infrastructure Manager from the left sidebar.
-
Click Add Component button in top right corner.
-
Select HDFS as Component Type.
-
Provide a Display Name for the component.
-
Enter HDFS URI in format:
hdfs://<namenode-host>:<port>- Example:
hdfs://namenode.example.com:8020
- Example:
-
Enter Hive Metastore URI in format:
thrift://<metastore-host>:<port>- Example:
thrift://metastore.example.com:9083
- Example:
-
Select Authentication Type:
- For Non-Kerberos authentication: Select Non-Kerberos and provide HDFS User (e.g.,
hdfs,hive). - For Kerberos authentication: Select Kerberos and provide:
- Kerberos Principal in format:
<principal>@<REALM>(Example:hive/hostname@EXAMPLE.COM) - Kerberos Realm (e.g.,
EXAMPLE.COM)
- Kerberos Principal in format:
- For Non-Kerberos authentication: Select Non-Kerberos and provide HDFS User (e.g.,
-
Upload configuration files:
-
For Non-Kerberos authentication:
- Click Upload Configuration Files.
- Upload
core-site.xml. - Upload
hdfs-site.xml. - Verify files are uploaded successfully.
-
For Kerberos authentication:
- Upload the following files in order:
core-site.xml- HDFS core configurationhdfs-site.xml- HDFS site configurationkrb5.conf- Kerberos configuration- Keytab File - Kerberos authentication credentials
- Ensure all four files are uploaded successfully before proceeding.
- Upload the following files in order:
Ensure all required files are uploaded successfully before proceeding.
-
-
Click Test Connection button to validate successful connection.
-
Select Apache Hive as Catalog Type.
-
Provide a Catalog Name (e.g.,
sparkcatalogfor non-Kerberos or<<catalog>>for Kerberos).
While creating component, the only available option under Associated catalog is Apache Hive.
- Click Create or Save button.
- Wait for component creation to complete (may take longer for Kerberos setup).
- Verify component appears in Infrastructure Manager list with Active status.
-
-
Associate catalog with Presto engine
-
In Infrastructure Manager, go to Engines section.
-
Select your Presto engine.
-
Click Associate Catalogs or Manage Catalogs.
-
Find your catalog in the available catalogs list.
-
Click Add or toggle the switch to enable.
-
Wait for the association to complete.
-
Verify catalog appears in the engine's catalog list.
Engine restart may be required for Kerberos configuration to take effect.
-
-
Verify catalog activation
-
Go to Query Workspace in watsonx.data.
-
Run the following query:
SHOW CATALOGS;Expected output should include your catalog name along with system and other catalogs.
-
Results
The setup enables direct, zero-copy querying of Hive tables from Cloudera HDFS in watsonx.data, eliminating the need for data replication. Both Kerberos and non-Kerberos authentication methods are supported.
After completing the setup, you can query your Cloudera tables. For information about querying operations, see Querying Cloudera tables using Presto engine.