Integrating Cloudera in watsonx.data

You can integrate Cloudera with IBM® watsonx.data to enable zero-copy querying of remote data. Cloudera provides an enterprise data platform that enables organizations to manage, process, and analyze data across hybrid and multi-cloud environments.

By integrating Cloudera with watsonx.data, you can query Hive tables stored in Cloudera HDFS without copying data, enabling seamless data federation across your data landscape.

How it works

  1. Create Hive tables in Cloudera using the Hue editor.
  2. Configure HDFS storage component in watsonx.data.
  3. Associate the catalog with your Presto engine.
  4. Query the remote tables using watsonx.data Presto engine without copying data.

Architecture overview

The integration works through the following components:

  1. Cloudera HDFS - Distributed file system storing Hive table data
  2. Hive Metastore - Centralized metadata repository for table definitions
  3. watsonx.data Presto engine - Query engine that executes queries
  4. HDFS Storage Component - Bridge between watsonx.data and Cloudera

Supported table and storage formats

  • Hive tables - Query Hive tables stored in various formats (Parquet, ORC, Avro, Text)
  • Storage formats - Parquet, ORC, Avro, Text files

Key features

  • Zero-copy data access
  • Support for both Kerberos and non-Kerberos authentication
  • Query federation through watsonx.data
  • Integration with watsonx.data Presto engine
  • Direct access to Hive Metastore

Important limitations

  • Only Presto engine is supported for querying Cloudera tables
  • Tables are read-only from watsonx.data
  • UPDATE and DELETE operations are not supported when querying Hive tables through watsonx.data
  • Data modifications must be performed directly in Cloudera

Security considerations

Authentication:

  • Non-Kerberos: Suitable for development and testing environments with basic HDFS user authentication
  • Kerberos: Recommended for production environments with enterprise-grade security

Data access:

  • All queries execute with the permissions of the authenticated user or principal
  • HDFS enforces file-level security policies
  • Storage credentials must have appropriate read permissions on HDFS locations

Network security:

  • Ensure network connectivity between watsonx.data and Cloudera cluster
  • Configure firewall rules to allow traffic on required ports (HDFS NameNode, Hive Metastore)
  • For Kerberos, ensure connectivity to KDC (Key Distribution Center)

Next steps

Related information