Querying data from watsonx.data
Before you begin
In the examples, the publicly available New York taxi trip record data {: external} for yellow taxis in January 2021 and 2022 is used. To follow this example, make sure that the data is in an accessible S3 bucket and the table was loaded into watsonx.data into an Apache Iceberg table in the Hive Metastore server (HMS).
1. Create a database by using the required metastoreuri.
          External datasources allow an administrator to grant access to S3/Azure without providing the keys directly to a user.
Example (S3):
LOCALDB.ADMIN(ADMIN)=> create database mylake with metastoreuri 'thrift://mymetastoreserverhostname:9083' catalogtype 'hive' on awss3 using ( ACCESSKEYID 'xxxx' SECRETACCESSKEY 'xxxx' BUCKET 'example-bucket' REGION 'us-east-1');
NOTICE:  589 tables from the datalake are available in MYLAKE
Example (Azure):
create database mylake with metastoreuri 'thrift://mymetastoreserverhostname:9083' catalogtype 'hive' on azureblob using (ACCOUNT 'xxxx' KEY 'xxxx' CONTAINER 'example_container');
NOTICE:  589 tables from the datalake are available in MYLAKE
See CREATE EXTERNAL DATASOURCE for more details on datasource options and authentication options for Azure.
Only a single datasource can be assigned to a database, if you require another datasource, you must create another datalake database.
2. Connect to the database.
LOCALDB.ADMIN(ADMIN)=> \c mylake
You are now connected to mylake database.
3. List the available schemas.
MYLAKE.NETEZZA_SCHEMA(ADMIN)=> show schema;
By default, you get connected to a reserved schema called NETEZZA_SCHEMA.
Output:
DATABASE |                   SCHEMA                   | OWNER
----------+--------------------------------------------+-------
MYLAKE   | DEFAULT                                    | ADMIN
MYLAKE   | DEFINITION_SCHEMA                          | ADMIN
MYLAKE   | DEMO                                       | ADMIN
MYLAKE   | INFORMATION_SCHEMA                         | ADMIN
MYLAKE   | NETEZZA_SCHEMA                             | ADMIN
MYLAKE   | TAXIDATA                                   | ADMIN
MYLAKE   | TEST                                       | ADMIN
(7 rows)
4. From your schemas list, set the schema you want to connect to.
MYLAKE.NETEZZA_SCHEMA(ADMIN)=> set schema taxidata;
SET SCHEMA
You can also query your data by using a full path SELECT * from mydb.myschema.mytable.
5. List the available tables.
MYLAKE.TAXIDATA(ADMIN)=> show table;
Output:
DATABASE |  SCHEMA  |          TABLE           |      TYPE      | OWNER
----------+----------+--------------------------+----------------+-------
MYLAKE   | TAXIDATA | YELLOW_TAXI_JANUARY_2022 | DATALAKE TABLE | ADMIN
MYLAKE   | TAXIDATA | YELLOW_TAXI_JANUARY_2021 | DATALAKE TABLE | ADMIN
(2 rows)
6. Select * from the required table.
MYLAKE.TAXIDATA(ADMIN)=> select * from YELLOW_TAXI_JANUARY_2021 limit 1;
Output:
VENDORID | TPEP_PICKUP_DATETIME | TPEP_DROPOFF_DATETIME | PASSENGER_COUNT | TRIP_DISTANCE | RATECODEID | STORE_AND_FWD_FLAG | PULOCATIONID | DO
LOCATIONID | PAYMENT_TYPE | FARE_AMOUNT | EXTRA | MTA_TAX | TIP_AMOUNT | TOLLS_AMOUNT | IMPROVEMENT_SURCHARGE | TOTAL_AMOUNT | CONGESTION_SURCHARGE |AIRPORT_FEE
----------+----------------------+-----------------------+-----------------+---------------+------------+--------------------+--------------+---
-----------+--------------+-------------+-------+---------+------------+--------------+-----------------------+--------------+------------------
----+-------------
        1 | 2021-01-01 00:30:10  | 2021-01-01 00:36:12   |               1 |           2.1 |          1 | N                  |          142 |
        43 |            2 |           8 |     3 |     0.5 |          0 |            0 |                   0.3 |         11.8 |
2.5 |
(1 row)
- 
              To identify the total number of passengers that travelled by taxis in New York in January 2022, run: MYLAKE.TAXIDATA(ADMIN)=> select sum(PASSENGER_COUNT) FROM YELLOW_TAXI_JANUARY_2022;Output: SUM --------- 3324167 (1 row)
- 
              To identify the vendor that had the most passengers between 1:00 AM and 6:00 AM, run: MYLAKE.TAXIDATA(ADMIN)=> SELECT VENDORID, SUM(PASSENGER_COUNT) as "passengers" FROM YELLOW_TAXI_JANUARY_2022 WHERE TPEP_PICKUP_DATETIME::time > '1:00am'AND "TPEP_PICKUP_DATETIME"::time < '6:00am' GROUP BY VENDORID;Output: VENDORID | passengers ----------+------------ 2 | 122251 1 | 40807 6 | 5 | (4 rows)