Accessing data in external data platforms
External data platforms enable continuous data ingestion and processing, allowing organizations to capture, store, and analyze data as it flows through their systems. These platforms act as a central nervous system for real-time data, connecting various data sources and making sharing data immediately available for analytics.
IBM® watsonx.data integrates with leading external data platforms to provide seamless access to remote data without copying or moving it. These integrations enable you to query remote data using familiar SQL interfaces and powerful compute engines.
How external data platforms integrations work
External data platforms automatically convert remote data into query-ready table formats such as Apache Iceberg. This zero-copy approach, often called "data federation" or "query federation," eliminates the need for traditional data pipelines by:
- Capturing sharing data from various sources (applications, IoT devices, databases, etc.)
- Materializing data into open table formats in cloud storage (AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage)
- Maintaining tables automatically with schema evolution, compaction, and optimization
- Exposing data through REST catalog endpoints for external query engines
watsonx.data compute engines connect directly to these remote tables, enabling zero-copy analytics without data duplication or movement.
Key capabilities
When you integrate external data platforms with watsonx.data, you can:
- Query remote data in real-time: Access the latest data as it arrives, with minimal latency
- Eliminate data copying and ETL complexity: Remove the need for custom data pipelines and transformation jobs
- Use familiar SQL interfaces: Query external data platforms using standard SQL through Spark or Presto engines
- Leverage open table formats: Work with industry-standard formats like Apache Iceberg and Delta Lake
- Maintain data governance: Apply watsonx.data security and governance policies to remote
- Scale independently: Separate storage and compute for flexible scaling and cost optimization
- Preserve data lineage: Track data from source to analytics with built-in metadata management
Integration architecture
External data platforms materialize remote data into table formats in cloud storage (AWS S3, Azure Blob, Google Cloud Storage). watsonx.data engines connect to these tables through REST catalog endpoints, enabling direct querying without data duplication.
Storage options
External data platforms typically offer two storage models:
- Platform-managed storage: The data platform automatically provisions and manages cloud storage, simplifying setup and maintenance.
- Customer-managed storage: You provide your own cloud storage (AWS S3, Azure Blob, or Google Cloud Storage), maintaining full control over data location, access policies, and lifecycle management.
Both options are supported by watsonx.data, though specific engine capabilities may vary.
Choosing an engine
- Spark engine: Supports both managed and customer-managed storage, full Iceberg feature support.
- Presto engine: Supports customer-managed storage (AWS S3, Azure, GCS) with some limitations on managed storage.