IBM Cloud Docs
Querying data from data lakes

Querying data from data lakes

Before you begin

In the examples, the publicly available New York taxi trip record data for yellow taxis in January 2022 is used. To follow this example, make sure that the data is in an accessible S3 bucket.

1. Create an external data source.

External datasources allow an administrator to grant access to S3 without providing the keys directly to a user.

a) Set ENABLE_EXTERNAL_DATASOURCE.

set ENABLE_EXTERNAL_DATASOURCE = 1;

b) Create an external data source.

create EXTERNAL DATASOURCE 'DATA SOURCE' on 'REMOTE SOURCE'
using (
    ACCESSKEYID 'ACCESS KEY ID' SECRETACCESSKEY 'SECRET ACCESS KEY' BUCKET 'BUCKET' REGION 'REGION'
);

Example:

create EXTERNAL DATASOURCE EXAMPLEDATALAKE on AWSS3 
using (
    ACCESSKEYID 'XXXX' SECRETACCESSKEY 'XXXX' BUCKET 'exampledatalakebucket' REGION 'US-EAST-1'
);

For more information, see CREATE EXTERNAL DATASOURCE command.

2. Create an external table.

After you created an external data source, you can create an external table that accesses the yellow taxi data from January 2022.

Ensure that you have the necessary privileges as described in Privileges for creating external tables.

create EXTERNAL table 'TABLE NAME' on 'DATA SOURCE'
using ( 
    DATAOBJECT ('DATA OBJECT') FORMAT 'PARQUET' 
);

The DATAOBJECT argument must reference a single file in the parquet format. If you want to query from multiple parquet files, you must create additional external tables.

Example:

create EXTERNAL table YELLOW_TAXI_JANUARY_2022 on EXAMPLEDATALAKE 
using ( 
    DATAOBJECT ('/yellow_tripdata_2022-01.parquet') FORMAT 'PARQUET' 
);

3. Query your data.

You can query external parquet format tables like you would any other NPSaaS table without having to load the data into the database.

The parquet column names are case sensitive. You must use double quotation marks ("") when you are querying specific columns.

  • To identify the total number of passengers that travelled by taxis in New York in January 2022, run:

    select
        sum("passenger_count") 
    from YELLOW_TAXI_JANUARY_2022;
    

    Output:

    SUM
    -----
    3324167
    (1 row)
    
  • To identify the vendor that had the most passengers between 1:00 AM and 6:00 AM, run:

    select
        "VendorID",
        sum("passenger_count") as "passengers"
    from
        YELLOW_TAXI_JANUARY_2022
    where
        "tpep_pickup_datetime"::time > '1:00am'
        and "tpep_pickup_datetime"::time < '6:00am'
    group by
        "VendorID"
    order by
        "passengers" desc;
    

    Output:

    VendorID| passengers
    --------|----------
    2       | 122251
    1       | 40807
    6       |
    5       |
    (4 rows)
    

    You do not have to load whole tables into NPSaaS. parquet is a columnar format so the NPSaaS engine can query a subset of columns without having to transfer the entire table over the internet. This way, if you work with large tables, you can significantly reduce ingress traffic and achieve faster load times. The query engine always uses only the columns from a parquet table that are needed.