Querying data from data lakes
Before you begin
In the examples, the publicly available New York taxi trip record data {: external} for yellow taxis in January 2022 is used. To follow this example, make sure that the data is in an accessible S3 bucket.
To access S3 files, you need to have an AWS account with proper permissions to provide your access key ID and secret access key.
AWS S3 example
1. Create an external data source.
External datasources allow an administrator to grant access to S3 without providing the keys directly to a user.
Data source creation:
a) Set ENABLE_EXTERNAL_DATASOURCE.
set ENABLE_EXTERNAL_DATASOURCE = 1;
b) Create an external data source.
create EXTERNAL DATASOURCE 'DATA SOURCE' on 'REMOTE SOURCE'
using (
ACCESSKEYID 'ACCESS KEY ID' SECRETACCESSKEY 'SECRET ACCESS KEY' BUCKET 'BUCKET' REGION 'REGION'
);
Example:
create EXTERNAL DATASOURCE AWS_TAXI_DATASET on AWSS3
using (
ACCESSKEYID '...' SECRETACCESSKEY '...' BUCKET 'nyc-tlc' REGION 'us-east-1'
);
For more information, see CREATE EXTERNAL DATASOURCE command.
2. Create an external table.
After you created an external data source, you can create an external table that accesses the yellow taxi data from January 2022.
Ensure that you have the necessary privileges as described in Privileges for creating external tables.
create EXTERNAL TABLE 'TABLE NAME' on 'DATA SOURCE'
using (
DATAOBJECT ('DATA OBJECT') FORMAT 'PARQUET'
);
The DATAOBJECT
argument must reference a single file in the parquet
format. If you want to query from multiple parquet
files, you must create more external tables.
Example:
create EXTERNAL TABLE YELLOW_TAXI_JANUARY_2022 on AWS_TAXI_DATASET
using (
DATAOBJECT ('/trip data/yellow_tripdata_2022-01.parquet') FORMAT 'PARQUET'
);
3. Query your data.
You can query external parquet format tables like any other NPSaaS table without having to load the data into the database.
The parquet column names are case-insensitive unless it introduces a collision in column names.
-
To identify the total number of passengers that traveled by taxis in New York in January 2022, run:
select sum(passenger_count) from YELLOW_TAXI_JANUARY_2022;
Output:
SUM ----- 3324167 (1 row)
-
To identify the vendor that had the most passengers between 1:00 AM and 6:00 AM, run:
select VendorID, sum("passenger_count") as passengers from YELLOW_TAXI_JANUARY_2022 where tpep_pickup_datetime::time > '1:00am' and tpep_pickup_datetime::time < '6:00am' group by VendorID order by passengers desc;
Output:
VendorID| passengers --------|---------- 2 | 122251 1 | 40807 6 | 5 | (4 rows)
You do not have to load whole tables into NPSaaS. parquet is a columnar format so the NPSaaS engine can query a subset of columns without having to transfer the entire table over the internet. This way, if you work with large tables, you can significantly reduce ingress traffic and achieve faster load times. The query engine always uses only the columns from a parquet table that are needed.
Azure BLOB example
Set ENABLE_AZURE_DATALAKE_SUPPORT.
set ENABLE_AZURE_DATALAKE_SUPPORT = true;
1. Create an external data source.
create EXTERNAL DATASOURCE 'DATA SOURCE' on 'REMOTE SOURCE'
using (
ACCOUNT 'ACCOUNT NAME' ACCOUNTKEY 'SECRET ACCESS KEY' CONTAINER 'CONTAINER NAME'
);
ACCOUNTKEY might be omitted for anonymous access.
create EXTERNAL DATASOURCE AZURE_TAXI_DATASET on AZUREBLOB
using (
ACCOUNT 'azureopendatastorage' CONTAINER 'nyctlc'
);
For more information, see CREATE EXTERNAL DATASOURCE command.
2. Create an external table.
After you created an external data source, you can create an external table that accesses the green taxi data from January 2018.
Ensure that you have the necessary privileges as described in Privileges for creating external tables.
Example:
create EXTERNAL table GREEN_TAXI_JANUARY_2018 on AZURE_TAXI_DATASET
using (
DATAOBJECT ('/green/puYear=2018/puMonth=1/part-00036-tid-4753095944193949832-fee7e113-666d-4114-9fcb-bcd3046479f3-2606-1.c000.snappy.parquet') FORMAT 'PARQUET'
);
3. Query your data.
You can query external parquet format tables like any other NPSaaS table without having to load the data into the database.
The parquet column names are case-insensitive unless it introduces a collision in column names.
-
To identify the total number of passengers that traveled by taxis in New York in January 2018, run:
select sum(passengercount) from GREEN_TAXI_JANUARY_2018;
Output:
SUM ----- 1081283 (1 row)
-
To identify the vendor that had the most passengers between 1:00 AM and 6:00 AM, run:
select VendorID, sum(passengercount) as passengers from GREEN_TAXI_JANUARY_2018 where lpeppickupdatetime::time > '1:00am' and lpeppickupdatetime::time < '6:00am' group by VendorID order by passengers desc;
Output:
VendorID| passengers --------|---------- 2 | 122251 1 | 40807 6 | 5 | (4 rows)