Integrating Cloudera in watsonx.data
You can integrate Cloudera with IBM® watsonx.data to enable zero-copy querying of remote data. Cloudera provides an enterprise data platform that enables organizations to manage, process, and analyze data across hybrid and multi-cloud environments.
By integrating Cloudera with watsonx.data, you can query Hive tables stored in Cloudera HDFS without copying data, enabling seamless data federation across your data landscape.
How it works
- Create Hive tables in Cloudera using the Hue editor.
- Configure HDFS storage component in watsonx.data.
- Associate the catalog with your Presto engine.
- Query the remote tables using watsonx.data Presto engine without copying data.
Architecture overview
The integration works through the following components:
- Cloudera HDFS - Distributed file system storing Hive table data
- Hive Metastore - Centralized metadata repository for table definitions
- watsonx.data Presto engine - Query engine that executes queries
- HDFS Storage Component - Bridge between watsonx.data and Cloudera
Supported table and storage formats
- Hive tables - Query Hive tables stored in various formats (Parquet, ORC, Avro, Text)
- Storage formats - Parquet, ORC, Avro, Text files
Key features
- Zero-copy data access
- Support for both Kerberos and non-Kerberos authentication
- Query federation through watsonx.data
- Integration with watsonx.data Presto engine
- Direct access to Hive Metastore
Important limitations
- Only Presto engine is supported for querying Cloudera tables
- Tables are read-only from watsonx.data
UPDATEandDELETEoperations are not supported when querying Hive tables through watsonx.data- Data modifications must be performed directly in Cloudera
Security considerations
Authentication:
- Non-Kerberos: Suitable for development and testing environments with basic HDFS user authentication
- Kerberos: Recommended for production environments with enterprise-grade security
Data access:
- All queries execute with the permissions of the authenticated user or principal
- HDFS enforces file-level security policies
- Storage credentials must have appropriate read permissions on HDFS locations
Network security:
- Ensure network connectivity between watsonx.data and Cloudera cluster
- Configure firewall rules to allow traffic on required ports (HDFS NameNode, Hive Metastore)
- For Kerberos, ensure connectivity to KDC (Key Distribution Center)