Remote lakehouse access
Remote lakehouse platforms enable continuous data ingestion and processing, allowing organizations to capture, store, and analyze data as it flows through their systems. These platforms act as a central nervous system for real-time data, connecting various data sources and making sharing data immediately available for analytics.
IBM® watsonx.data integrates with leading third-party lakehouse platforms to provide seamless access to remote data without copying or moving it. These integrations enable you to query remote data using familiar SQL interfaces and powerful compute engines.
How remote lakehouse integrations work
Third-party lakehouse platforms automatically convert remote data into query-ready table formats such as Apache Iceberg. This zero-copy approach, often called "data federation" or "query federation," eliminates the need for traditional data pipelines by:
- Capturing sharing data from various sources (applications, IoT devices, databases, etc.)
- Materializing data into open table formats in cloud storage (AWS S3, Azure Blob Storage, Google Cloud Storage)
- Maintaining tables automatically with schema evolution, compaction, and optimization
- Exposing data through REST catalog endpoints for external query engines
watsonx.data compute engines connect directly to these remote tables, enabling zero-copy analytics without data duplication or movement.
Key capabilities
When you integrate third-party lakehouse platforms with watsonx.data, you can:
- Query remote data in real-time: Access the latest data as it arrives, with minimal latency
- Eliminate data copying and ETL complexity: Remove the need for custom data pipelines and transformation jobs
- Use familiar SQL interfaces: Query remote lakehouse data using standard SQL through Spark or Presto engines
- Leverage open table formats: Work with industry-standard formats like Apache Iceberg and Delta Lake
- Maintain data governance: Apply watsonx.data security and governance policies to remote
- Scale independently: Separate storage and compute for flexible scaling and cost optimization
- Preserve data lineage: Track data from source to analytics with built-in metadata management
Integration architecture
Remote lakehouse platforms materialize remote data into table formats in cloud storage (AWS S3, Azure Blob, Google Cloud Storage). watsonx.data engines connect to these tables through REST catalog endpoints, enabling direct querying without data duplication.
Storage options
Third-party lakehouse platforms typically offer two storage models:
- Platform-managed storage: The lakehouse platform automatically provisions and manages cloud storage, simplifying setup and maintenance.
- Customer-managed storage: You provide your own cloud storage (AWS S3, Azure Blob, or Google Cloud Storage), maintaining full control over data location, access policies, and lifecycle management.
Both options are supported by watsonx.data, though specific engine capabilities may vary.
Choosing an engine
- Spark engine: Supports both managed and customer-managed storage, full Iceberg feature support.
- Presto engine: Supports customer-managed storage (AWS S3, Azure, GCS) with some limitations on managed storage.