Data ingestion, Time travel, and Table rollback

This tutorial guides you through the process of ingesting data in watsonx.data, performing time travel, and table roll back operations.

Data ingestion

Data ingestion is the process of importing and loading data into watsonx.data. You can perform data ingestion in watsonx.data in the following methods:

Ingesting data by using Create table from file option in the Data manager page.
Ingesting data by using SQL queries.
Ingesting data by using IBM-lh command-line tool.
Ingesting data by using Spark engine.

For more information, see Data ingestion.

Time travel

The Time travel capability in Apache Iceberg tables allows you to fetch and inspect the table history. It uses table snapshots to support the time travel functions. A table snapshot represents the state of a table at a particular point in time. Iceberg generates the table snapshot record when the table is created or modified (insert, update, delete). You can use this snapshot record to query to see the table details then.

Hive tables do not support the time travel feature.

Rollback: The table rollback feature allows you to roll back a table to a previous point in time by using table snapshots.

Use Case Scenario

Consider you are a Data Engineer for your company and you are tasked with managing Car records in the company database. Upload the existing data records in the storage. Add a new set of data to the existing table. After review, you see that the new set of data added is not appropriate and decide to rollback the table to a previous state.

The following video provides a visual method to learn the concepts and tasks in this documentation.

Objective

Create a table and load the sample data by using Spark engine to a table.
Load more records into the cars table.
Retrieve table records by using table snapshots.
Perform table rollback operations.

Before you begin

Download the cars.csv file from (https://ibm.box.com/v/data-cars-csv).
You must have a Lite plan instance provisioned.
The Lite plan instance must include Spark and Presto engines in running state.
Associate a catalog to both Presto and Spark engines. Here, Apache Iceberg catalog.

Procedure

Creating table and loading data

The section guides you through the procedure to create a table that is named cars and ingest data present cars.csv file by using Spark engine.

Log in to watsonx.data console.
From the navigation menu, select Data manager.
Click Ingest Data and select Local System. For more information on how to ingest data, see Ingesting data from local system.
Click Browse data tab. Refresh the Iceberg catalog to view the new schema and table with data. You can view the Table columns, Time travel, Data sample, and DDL tabs.
Click Time Travel. You can view that the table has 406 records.

Loading more records to the table

Perform multiple transactions (two transactions) to update the cars table. In the first transaction, ingest two other car records for the model TESLA and in the second transaction, ingest one record for the modelKia Optima.

To do that:

Go to Query workspace.
Select Presto engine.

Use the following SQL query to insert the first set of data into the cars table.

INSERT INTO iceberg_data.automobiles.cars VALUES
('Tesla Model S', 102, 0, 0, 670, 4766, 3.1, 2023, 'USA'),
('Tesla Model 3', 134, 0, 0, 283, 3582, 5.8, 2023, 'USA');

Run the query. The data is added and the total record is 408.

Use the following SQL query to insert the second set of data into the cars table.

INSERT INTO iceberg_data.automobiles.cars VALUES
('Kia Optima', 27, 4, 2457, 185, 3200, 8.5, 2023, 'South Korea');

Run the query. The data is added and the total record becomes 409.
Go to Data manager.
Select the table cars. From the cars section, select the Time travel tab. You can view the snapshot records (with SnapshotID) corresponding to the data transactions.

Retrieving table records by using table snapshots.

This section of the tutorial guides you through the procedure to retrieve table records by using table snapshots. You run queries to retrieve data for auditing and regulatory purposes.

Go to Query workspace.
Run the following query to retrieve the snapshot records corresponding to the three insert transactions performed.
```
SELECT * FROM iceberg_data.automobiles."cars$snapshots" ORDER BY committed_at;
```
The result set include three snapshot records with snapshot_id, date, time, and time zone. In this tutorial, the SnapshotID, date, time for each transaction are referred as: <Tran1Time>, <Tran1SnapshotID>, <Tran2Time>, <Tran2SnapshotID>, <Tran3Time>, and <Tran3SnapshotID>.

You can also view the snapshot records from Data manager > cars table > Time travel tab.
Provide the snapshot ID or time in the following queries and run the queries to retrieve the table information at that point.

Sample scenarios
Scenario	Query	Result
Retrieve data from the initial state by using snapshot ID.	`SELECT * FROM iceberg_data.automobiles.cars FOR VERSION AS OF <Tran1SnapshotID> ORDER BY Snapshot ID;`	Result section displays 406 records.
Retrieve data from the initial state by using timestamp.	`SELECT * FROM iceberg_data.automobiles.cars FOR TIMESTAMP AS OF CAST('<Tran1Time>' AS TIMESTAMP WITH TIME ZONE) ORDER BY Snapshot ID;`	The result section displays 406 records.
Retrieve data post-second transaction by using snapshot ID.	`SELECT * FROM iceberg_data.automobiles.cars FOR VERSION AS OF <Tran2SnapshotID> ORDER BY Snapshot ID;`	The result section displays 408 records.
Retrieve data post-third transaction by using timestamp.	`SELECT * FROM iceberg_data.automobiles.cars FOR TIMESTAMP AS OF CAST('<Tran3Time>' AS TIMESTAMP WITH TIME ZONE) ORDER BY Snapshot ID;`	The result section displays 409 records.
Retrieve data from a point in time between the second and third transactions.	`SELECT * FROM iceberg_data.automobiles.cars FOR TIMESTAMP AS OF CAST('Choose-a-time-between-<Tran2Time>-and-<Tran3Time>' AS TIMESTAMP WITH TIME ZONE) ORDER BY Snapshot ID;`	The result section displays 408 records.
Retrieve data from a time before the table was created (expected to fail).	`SELECT * FROM iceberg_data.automobiles.cars FOR TIMESTAMP AS OF CAST('2024-01-01 00:00:00.000 UTC' AS TIMESTAMP WITH TIME ZONE) ORDER BY Snapshot ID;`	No records were found.

Performing table rollback

watsonx.data allows you to rollback a table to an earlier point in time by using snapshots.

To rollback the table data to an earlier snapshot, do the following:

Go to Data manager.
Select the table cars. From the cars section, select the Time travel tab. You can view the snapshot records (with SnapshotID).
Click the overflow menu at the end of the row for the second snapshot (the second snapshot includes 408 records) and click Rollback. A Confirm rollback window opens. Click Rollback. The data rollback is successful. Now, the table must have 408 records.
To verify the rollback, run the following query to check whether the rows corresponding to the third transaction is removed.
- Sample query to verify that the table does not include the record 'Kia Optima'.
```
SELECT * FROM iceberg_data.automobiles.cars WHERE car IN ('Kia Optima', 'Tesla Model S', 'Tesla Model 3');
```
The result lists the data records corresponding to 'Tesla Model S' and 'Tesla Model 3' only. The table does not have the record corresponding to 'Kia Optima' (third transaction) as the table is now rollback to the second snapshot.
- Sample query to count the total number of rows in the table.
```
SELECT COUNT(*) FROM iceberg_data.automobiles.cars;
```
You must see a count of 408 rows, which matches the count when you performed the second transaction in the table. The table is back to its previous (second transaction) state.