IBM Cloud Docs
Merging and querying data

Merging and querying data

You might keep only the most recent data locally in a database and use data lakes as your long term storage. With NPSaaS, you can seamlessly query both local and remote data without having to load the remote data into a database first.

Before you begin

In the examples, the publicly available New York taxi trip record data for yellow taxis in January 2021 and 2022 is used. To follow this example, make sure that the data is in an accessible S3 bucket.

1. Create an external data source.

External datasources allow an administrator to grant access to S3 without providing the keys directly to a user.

a) Set ENABLE_EXTERNAL_DATASOURCE.

set ENABLE_EXTERNAL_DATASOURCE = 1;

b) Create an external data source.

create EXTERNAL DATASOURCE 'DATA SOURCE' on 'REMOTE SOURCE'
using (
 ACCESSKEYID 'ACCESS KEY ID' SECRETACCESSKEY 'SECRET ACCESS KEY' BUCKET 'BUCKET' REGION 'REGION'
);

Example:

create EXTERNAL DATASOURCE EXAMPLEDATALAKE on AWSS3 
using (
 ACCESSKEYID 'XXXX' SECRETACCESSKEY 'XXXX' BUCKET 'exampledatalakebucket' REGION 'US-EAST-1'
);

For more information, see CREATE EXTERNAL DATASOURCE command.

2. Identify the data from NPSaaS to merge and compare.

In this example, data that was loaded into NPSaaS (YELLOW_TAXI_JANUARY_2022) is compared with data from a data like (YELLOW_TAXI_JANUARY_2021).

To follow this example, ensure that YELLOW_TAXI_JANUARY_2022 is in your NPSaaS database.

a) Create an external table for the data that you want to load (YELLOW_TAXI_JANUARY_2022).

Ensure that you have the necessary privileges as described in Privileges for creating external tables.

create EXTERNAL table 'TABLE NAME' on 'DATA SOURCE'
using ( 
  DATAOBJECT ('DATA OBJECT') FORMAT 'PARQUET' 
);

Example:

create EXTERNAL table YELLOW_TAXI_JANUARY_2022 on EXAMPLEDATALAKE
using ( 
  DATAOBJECT ('/yellow_tripdata_2022-01.parquet') FORMAT 'PARQUET' 
);

b) Load the data (YELLOW_TAXI_JANUARY_2022) into NPSaaS.

create table 'TABLE NAME LOADED' as
   select
       * 
   from
       'TABLE';

Example:

create TABLE YELLOW_TAXI_JANUARY_2022_LOADED as
   select
       * 
   from
       YELLOW_TAXI_JANUARY_2022;

INSERT 0 2463931

3. Create an external table for the data from a data lake.

Create an external table that accesses the yellow taxi data from January 2021.

Ensure that you have the necessary privileges as described in Privileges for creating external tables.

create EXTERNAL table 'TABLE NAME' on 'DATA SOURCE'
using ( 
  DATAOBJECT ('DATA OBJECT') FORMAT 'PARQUET' 
);

The DATAOBJECT argument must reference a single file in the parquet format. If you want to query from multiple parquet files, you must create additional external tables.

Example:

create EXTERNAL table YELLOW_TAXI_JANUARY_2021 on EXAMPLEDATABASE 
using ( 
  DATAOBJECT ('/yellow_tripdata_2021-01.parquet') FORMAT 'PARQUET' 
);

4. Query the remote and loaded data.

Now, you can query both the local 2022 data that was loaded and the 2021 data from a data lake all in the same query.

The parquet column names are case sensitive. You must use double quotation marks ("") when you are querying specific columns.

  • To identify which year had the most passengers, run:

    select
     (
         select
             sum("passenger_count")
         from
             YELLOW_TAXI_JANUARY_2022_LOADED) as "passengers 2022",
     (
         select
             sum("passenger_count")
         from
             YELLOW_TAXI_JANUARY_2021) as "passengers 2021";
    

    Output:

    passengers 2022  | passengers 2021
    -----------------+-----------------
          3324167    |     1794615
    (1 row)
    
  • To compare how many passengers travelled between 1:00 AM and 6:00 PM in 2021 and 2022, run:

    select
     (
         select
             sum("passenger_count")
         from
             YELLOW_TAXI_JANUARY_2022_LOADED
         where
             "tpep_pickup_datetime"::time > '1:00am'
             and "tpep_pickup_datetime"::time < '6:00am') as "overnight passengers 2022",
     (
         select
             sum("passenger_count")
         from
             YELLOW_TAXI_JANUARY_2021
         where
             "tpep_pickup_datetime"::time > '1:00am'
             and "tpep_pickup_datetime"::time < '6:00am') as "overnight passengers 2021"; 
    

    Output:

    overnight passengers 2022  | overnight passengers 2021
    ---------------------------+---------------------------
    163058                     | 33469
    (1 row)