About data ingestion

Data ingestion is the process of importing and loading data into IBM® watsonx.data. From the user interface (UI) of watsonx.data, you can use the Ingest data module from the Data manager page to securely and easily load data. Alternatively, you can also ingest local or remote data files to create tables by using the Create table from file option.

When you ingest a data file into the watsonx.data, the table schema is generated and inferred when a query is run. The files to be ingested must be of the same format type and same schema. watsonx.data auto-discovers the schema based on the source file being ingested.

Following are some of the requirements or behavior of data ingestion:

Schema evolution is not supported.
The target table must be an iceberg format table.
IBM Storage Ceph, IBM Cloud Object Storage (COS), AWS S3, and MinIO object storage are supported.
pathStyleAccess property for object storage is not supported.
Parquet, CSV, JSON, ORC, and AVRO file formats are supported as source data files.
The maximum limit for the cumulative size of files must be within 500 MB for local ingestion.
Parquet, JSON, AVRO, and ORC files exceeding 2 MB cannot be previewed, but they will still be ingested successfully.
JSON files with complex nested objects and arrays shall not be previewed in the UI.
Complex JSON files shall be ingested as-is, resulting in arrays as table entries. This is not recommended for optimal data visualization and analysis.
Keys within JSON files must be enclosed in quotation marks for proper parsing and interpretation.

Loading or ingesting data through CLI

An ingestion job in watsonx.data can be run with the ibm-lh tool. The tool must be pulled from the ibm-lh-client and installed in the local system to run the ingestion job through the CLI. For more details and instructions to install ibm-lh-client package and use the ibm-lh tool for ingestion, see Installing ibm-lh-client and Setting up the ibm-lh command-line utility.

The ibm-lh tool supports the following features:

Auto-discovery of schema based on the source file or target table.
Advanced table configuration options for the CSV files:
- Delimiter
- Header
- File encoding
- Line delimiter
- Escape characters
Ingestion of a single, multiple files, or a single folder (no sub folders) of S3 and local Parquet files.
Ingestion of a single, multiple files, or a single folder (no sub folders) of S3 and local CSV files.