IBM Cloud Docs
Ingesting data from data lakes

Ingesting data from data lakes

If you plan on regularly querying your data, load it into NPSaaS first to get the best performance benefits.

Before you begin

In the examples, the publicly available New York taxi trip record data for yellow taxis in January 2022 is used. To follow this example, make sure that the data is in an accessible S3 bucket.

1. Create an external data source.

External datasources allow an administrator to grant access to S3 without providing the keys directly to a user.

a) Set ENABLE_EXTERNAL_DATASOURCE.

set ENABLE_EXTERNAL_DATASOURCE = 1;

b) Create an external data source.

create EXTERNAL DATASOURCE 'DATA SOURCE' on 'REMOTE SOURCE'
using (
 ACCESSKEYID 'ACCESS KEY ID' SECRETACCESSKEY 'SECRET ACCESS KEY' BUCKET 'BUCKET' REGION 'REGION'
);

Example:

create EXTERNAL DATASOURCE EXAMPLEDATALAKE on AWSS3 
using (
 ACCESSKEYID 'XXXX' SECRETACCESSKEY 'XXXX' BUCKET 'exampledatalakebucket' REGION 'US-EAST-1'
);

For more information, see CREATE EXTERNAL DATASOURCE command.

2. Create an external table for the data from a data lake.

After you created an external data source, you can create an external table that accesses the yellow taxi data from January 2022.

Ensure that you have the necessary privileges as described in Privileges for creating external tables.

create EXTERNAL table 'TABLE NAME' on 'DATA SOURCE'
using ( 
  DATAOBJECT ('DATA OBJECT') FORMAT 'PARQUET' 
);

The DATAOBJECT argument must reference a single file in the parquet format. If you want to query from multiple parquet files, you must create additional external tables.

Example:

create EXTERNAL table YELLOW_TAXI_JANUARY_2022 on EXAMPLEDATALAKE
using ( 
  DATAOBJECT ('/yellow_tripdata_2022-01.parquet') FORMAT 'PARQUET' 
);

3. Load the data into NPSaaS.

To load data from the data lake into a NPSaaS table, run CREATE TABLE AS SELECT from the external table that you want to load.

create table 'TABLE NAME LOADED' as
select
    * 
from
     'TABLE';

Example:

create table YELLOW_TAXI_JANUARY_2022_LOADED as
select
    * 
from
        YELLOW_TAXI_JANUARY_2022;

INSERT 0 2463931

4. Query the loaded data.

Now, you can query the loaded data by using the improved read/write performance, zonemaps, etc.

The parquet column names are case sensitive. You must use double quotation marks ("") when you are querying specific columns.

  • To identify the passenger count, run:

    select
       sum("passenger_count")
    from YELLOW_TAXI_JANUARY_2022_LOADED;
    

    Output:

    SUM
    -------
    3324167
    (1 row)
    

    You do not have to load whole tables into NPSaaS. parquet is a columnar format so the NPSaaS engine can load a subset of columns without having to transfer the entire table over the internet. This way, if you work with large tables, you can significantly reduce ingress traffic and achieve faster load times. The query engine always uses only the columns from a parquet table that are needed.