Connecting and Querying across multiple data sources

This tutorial guides you through the process of using federated queries to analyze sales data for a fictional Great Retail Company, which is stored in multiple locations.

Introduction

Data federation overview

Data federation is a software process that enables several databases to work together. It allows you to mix data from multiple sources to get insights. It allows you to access all of your data across numerous dispersed databases with a single query.

Presto (Java) engine's data federation capability

The Presto (Java) engine's federated query functions allows organizations to effortlessly mix data from several sources, including current databases and new data in IBM® watsonx.data. By leveraging the capability of watsonx.data to integrate with Presto (Java), your business can now seamlessly combine and analyze data across various sources, gaining deeper insights. This streamlined approach not only enhances operational efficiency but also empowers decision-makers with timely and accurate data-driven insights.

Use Case Scenario

Analyzing purchasing methods across multiple data sources: The objective of this use case is to analyze the purchasing methods associated with the largest orders. The sales data is available in Db2. A portion of this data is moved to Iceberg tables within watsonx.data. The sales data is now in two places - Db2 and watsonx.data and you need to perform a Presto (Java) query from both Db2 and Iceberg to analyze the data, aiming to identify the purchasing method that is linked to the largest orders.

The following video provides a visual method to learn the concepts and tasks in this documentation.

Objective

Registering your Db2 data source with watsonx.data
Moving part of sales data from Db2 to watsonx.data
Running query to retrieve insights

Before you begin

This tutorial requires:

Subscription of watsonx.data on cloud.
Availability of Presto (Java) engine
Db2 database with GOSALESDW data
Credentials of Db2 database

Registering your Db2 data source

To register your Db2 data source, see IBM Db2. Use the following details when you register the Db2 data source.

Database name : Enter the database name as BLUDB.
Hostname : Enter the hostname as db2w-sucqakq.us-south.db2w.cloud.ibm.com
Username : db2inst
Password : Usertutorials1!
Port : 50001

Associating Db2 with Presto (Java) engine

After you register the Db2 database, you must associate the catalog with the Presto (Java) engine. For more information, see Associating a catalog with an engine.

Copying data from Db2 database to Iceberg

After you associate the catalog with the engine, copy data (a single table) from Db2 to Iceberg. To do that, complete the following steps:

From the navigation menu, select Data manager. Create a schema inside Iceberg_data catalog. For more information on how to create a schema, see Creating schema.
From the navigation menu, select Query Workspace.
Write a query to copy the data fromGOSALESDW table present in the Db2 database and create a new table (here SLS_SALES_FACT) inside Iceberg catalog.

Example query:
```
create table "iceberg_data"."wxgosalesdw"."sls_sales_fact" as select * from "db2catalog". "GOSALESDW"."SLS_SALES_FACT";
```
Click the Run on starter button to run the query.
Refresh Iceberg_data catalog to view the new table SLS_SALES_FACT.

Data federation

Now, the sales data is split between Db2 and Iceberg catalogs. You can run query from both Db2 and Iceberg to analyze the data and generate insights about the purchasing methods that are associated with the largest orders. To do that:

From the Query Workspace, run the following query to understand which purchasing method is associated with the largest orders.

You can use the following sample query to determine which purchasing method is associated with the largest orders. The query accesses five tables, one of which is in watsonx.data object storage (iceberg_data), and other four are in Db2 (db2catalog).

Example query:

select pll.product_line_en as product,

md.order_method_en as order_method,

sum(sf.quantity) as total

from

db2catalog.GOSALESDW.SLS_ORDER_METHOD_DIM as md,

db2catalog.GOSALESDW.SLS_PRODUCT_DIM as pd,

db2catalog.GOSALESDW.SLS_PRODUCT_LINE_LOOKUP as pll,

db2catalog.GOSALESDW.SLS_PRODUCT_BRAND_LOOKUP as pbl,

iceberg_data.wxgosalesdw.sls_sales_fact as sf

where

pd.product_key = sf.product_key

and md.order_method_key = sf.order_method_key

and pll.product_line_code = pd.product_line_code

and pbl.product_brand_code = pd.product_brand_code

group by pll.product_line_en, md.order_method_en

order by product, order_method;
Click the Run on button to run the query.
Select the Result set or Details tab to view the result.
Click the Explain link. The Explain page opens, which displays the visual explain output for the query. You can scroll through the visual explain output to view the five ScanProject leaf nodes (here, five tables are used in the query) in the tree. These correspond to the five tables being read.

With these capabilities, your enterprise can drive smarter operations, optimize purchasing methods, and ultimately improve overall business performance.