Preparing for ingesting data
This topic guides you through efficiently ingesting data manually from an external object storage into your IBM® watsonx.data for querying. We support IBM Storage Ceph, IBM Cloud Object Storage (COS), AWS S3, and MinIO as the object storage buckets.
Parquet and CSV are the supported file types.
You can ingest Parquet files directly for optimal performance and CSV files require a staging directory for conversion to Parquet format.
Before you begin
This tutorial requires:
- S3 folder must be created with data files in it for ingesting. The best way to create an S3 folder is by using AWS CLI. The source folder must contain either all parquet file or all CSV files. Use AWS CLI to avoid hidden "0-byte" files that can cause ingestion issues. For detailed information on S3 folder creation, refer to Organizing objects in the Amazon S3 console by using folders.
- Staging folder must be specified for CSV files, individual file ingestion (Parquet or CSV) and local Parquet folders. Staging folder is not required for all files in an S3 folder (source folder ingestion). The exception for this case is when there are type differences between different types of parquet files in the S3 folder or when TIME data type is involved.
- For ingestion job through CLI, the staging bucket must be the same bucket that is associated with the Hive catalog. Staging is possible only in the Hive catalog.
About this task
Scenario: You have a collection of data files in an S3 folder that you need to ingest into your IBM database. You need to run SQL query on data files that is in your object storage bucket.
The objectives of this tutorial are listed as follows:
- Creating infrastructure within the watsonx.data service.
- Establishing connection with the customer data storage.
- Querying from the storage
You can use Spark ingestion to ingest data.
For detailed information on the usage of different parameters, see Options and parameters supported in ibm-lh tool, and for ingesting data files into watsonx.data by using Spark CLI, commands and configuration file, see Spark ingestion through ibm-lh tool command line, Creating an ingestion job by using commands, and Creating an ingestion job by using the configuration file.
Procedure
Ingesting Parquet or CSV files from an S3 Folder
In this section, you have a collection of Parquet/CSV files in an S3 folder that you need to ingest into your IBM database.
-
Prepare the source S3 folder:
- Use AWS CLI to copy the Parquet /CSV files into a common S3 folder. Avoid creating empty folders through the console to prevent hidden 0-byte files.
-
Specify staging directory (For CLI ingestion):
- Provide the staging-location parameter to designate a staging directory for CSV or specific Parquet files to Parquet conversion. The ingest tool will create it if it does not exist.
See Staging location for more details.
-
Create schema file to specify CSV file properties:
- Provide the schema parameter to specify CSV file properties such as field delimiter, line delimiter, escape character, encoding and whether header exists in the CSV file.
See Schema file specifications for more details.
-
Initiate server-mode ingestion:
- Employ the CLI (server-mode) to start the ingestion process.
-
CSV or specific Parquet to Parquet conversion:
- The ingest tool converts the specific Parquet or CSV files into Parquet format and stores them in the staging directory.
Results
- Optimizes data transfer performance.
- Simplifies the ingestion process.
- Provides clear troubleshooting in case of errors.