About data ingestion
Data ingestion is the process of importing and loading data into IBM® watsonx.data. You can use the Create table option from the Data manager page to load local or external sources of data files to create tables.
When you ingest a data file into the watsonx.data, the table schema is generated and inferred when a query is run. Data ingestion in watsonx.data supports CSV and Parquet formats. The files to be ingested must be of the same format type and same schema. watsonx.data auto-discovers the schema based on the source file being ingested.
Following are some of the requirements or limitations of the ibm-lh tool:
- Schema evolution is not supported.
- Target table must be an iceberg format table.
- Partitioning is not supported.
- IBM Storage Ceph, IBM Cloud Object Storage (COS), AWS S3, and MinIO object storage are supported.
pathStyleAccess
property for object storage is not supported.- Only Parquet and CSV file formats are supported as source data files.
Loading or ingesting data through CLI
An ingestion job in watsonx.data can be run with the ibm-lh tool. The tool must be pulled from the ibm-lh-client
and installed in the local system to run the ingestion job through the CLI. For more details and instructions
to install ibm-lh-client
package and use the ibm-lh tool for ingestion, see Installing ibm-lh-client and Setting up the ibm-lh command-line utility.
The ibm-lh tool supports the following features:
-
Auto-discovery of schema based on the source file or target table.
-
Advanced table configuration options for the CSV files:
- Delimiter
- Header
- File encoding
- Line delimiter
- Escape characters
-
Ingestion of a single, multiple file(s), or a single folder (no sub folders) of S3 and local Parquet file(s).
-
Ingestion of a single, multiple file(s), or a single folder (no sub folders) of S3 and local CSV file(s).