Ingestion options and parameters supported in ibm-lh utility

This topic lists all the options and parameters that can be used with the ibm-lh utility for ingestion. It does not provide instructions to perform ingestion.

If you are looking for instructions to ingest data, see the following topics:

The following ingestion modes are supported:

PRESTO
SPARK_LEGACY
SPARK

SPARK is the default mode.

Parameters and variables

The ibm-lh utility supports various parameters and variables that can be invoked by the ibm-lh data-copy command. The following tables lists the parameters and the corresponding details.

Command line options and variables
Parameter	Description	Declaration	Modes of ingestion
`create-if-not-exist`	Create target table if it does not exist.	`--create-if-not-exist`	`PRESTO` and `SPARK_LEGACY`
`dbpassword`	Database password that is used to do ingestion. This is a mandatory parameter to run an ingestion job unless the default user is used.	`--dbpassword <DBPASSWORD>`	`PRESTO`
`dbuser`	Database username that is used to do ingestion. This is a mandatory parameter to run an ingestion job unless the default user is used.	`--dbuser <DBUSER>`	`PRESTO`
`debug`	Debug the logs of ingestion jobs. The short command for this parameter is `-d`.	`--debug`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`engine-id`	Engine id of Spark engine when using REST API based `SPARK` ingestion. The short command for this parameter is `-e`.	`--engine-id <spark-enginename>`	`SPARK`
`escape-char`	CSV escape property character. Default value is /.	`--escape-char <escape_character_value>`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`encoding`	CSV encoding property character. Default value is `utf-8`.	`--encoding <encoding_value>`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`field-delimiter`	CSV file field delimiter value. Default value is `,`.	`--field-delimiter <field_delimiter_value>`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`header`	Mandatory parameter for CSV files with or without a header. Default value is `true`.	`--header <true/false>`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`ingest-config`	Configuration file for data migration	`--ingest-config <INGEST_CONFIGFILE>`	`PRESTO` and `SPARK_LEGACY`
`ingestion-engine-endpoint`	Endpoint of ingestion engine. hostname=`<hostname>`, port=`<port>`. This is a mandatory parameter to run an ingestion job.	`--ingestion-engine-endpoint <INGESTION_ENGINE_ENDPOINT>`	`PRESTO` and `SPARK_LEGACY`
`instance-id`	Identify unique instances. In SaaS environment, CRN is the instance id. The short command for this parameter is `-i`.	`--instance-id <instance-CRN>`	`SPARK`
`job-id`	Job id is generated when REST API or UI-based ingestion is initiated. This job id is used in getting the status of ingestion job. This parameter is used only with `ibm-lh get-status` command in the interactive mode of ingestion. The short command for this parameter is `-j`.	`ibm-lh get-status --job-id <Job id>`	`SPARK`
`all-jobs`	This all-jobs parameter gives the history of all ingestion jobs. This parameter is used only with `ibm-lh get-status` command in the interactive mode of ingestion.	`ibm-lh get-status --all-jobs`	`SPARK`
`line-delimiter`	CSV file line delimiter value. Default value is ` .	`--line-delimiter <line_delimiter_value>`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`log-directory`	This option is used to specify the location of log files. See Log directory.	`--ingest-config <ingest_config_file> --log-directory <directory_path>`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`partition-by`	Supports the functions for year, month, day, and hour for timestamp in the `partition-by` list. If a target table already exists or the `create-if-not-exist` parameter is not specified, `partition-by` does not effect the data. `create-if-not-exist` parameter is no longer supported for `SPARK`.	`ibm-lh data-copy --partition-by "<columnname1>, <columnname2>"`	`SPARK_LEGACY` and `SPARK`
`password`	Password of the user connecting to the instance. In SaaS, API key of the instance is used. The short command for this parameter is `-pw`.	`--password <apikey>`	`SPARK`
`schema`	Schema file that includes CSV specifications, and more. For more details, see Schema file specifications.	`--schema </path/to/schemaconfig/file>`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`source-data-files`	Data files or folders for data migration. File name ending with `/` is considered a folder. Single or multiple files can be used. This is a mandatory parameter to run an ingestion job. Example: `<file1_path>,<file2_path>,<folder1_path>`. File names are case-sensitive. The short command for this parameter is `-s`.	`--source-data-files <SOURCE_DATA_FILE>`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`staging-location`	Location where CSV files and in some circumstances parquet files are staged, see Staging location. This is a mandatory parameter to run an ingestion job.	`--staging-location <STAGING_LOCATION>`	`PRESTO`
`staging-hive-catalog`	If the default catalog for staging is not used, use this parameter to specify the name of the Hive catalog that is configured in watsonx.data. The default catalog is `hive_data`.	`--staging-hive-catalog <catalog_name>`	`PRESTO`
`staging-hive-schema`	The schema name associated with the staging hive catalog for ingestion. Create and pass in a custom schema name by using this parameter. Default schema: `lhingest_staging_schema`. If schema is created as default, this parameter is not required.	`--staging-hive-schema <schema_name>`	`PRESTO`
`sync-status`	This parameter is used in REST API based ingestion. The default value is `false`. When this parameter is set to `true`, `ibm-lh data-copy` tool waits and polls to get continuous status after an ingestion job is submitted.	`--sync-status <IS THERE ANY ENTRY?>`	`SPARK`
`system-config`	This parameter is used to specify system-related parameters. For more information, see System config.	`--system-config <path/to/system/configfile>`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`target-catalog-uri`	Target catalog uri	`--target-catalog-uri <TARGET_CATALOG_URI>`	`SPARK_LEGACY`
`target-table`	Data migration target table. `<catalog>.<schema>.<table1>`. This is a mandatory parameter to run an ingestion job. Example: `<iceberg.demo.customer1>`. The short command for this parameter is `-t`. For more information, see Target table.	`--target-table <TARGET_TABLE>`	`PRESTO`, `SPARK_LEGACY`, and `SPARK`
`trust-store-path`	Path of the truststore to access the ingestion engine. This is used to establish SSL connections. This parameter is mandatory for non-root user.	`--trust-store-path <TRUST_STORE_PATH>`	`PRESTO` and `SPARK_LEGACY`
`trust-store-password`	Password of truststore to access the ingestion engine. This is used to establish SSL connections. This parameter is mandatory for non-root user.	`--trust-store-password <TRUST_STORE_PASSWORD>`	`PRESTO` and `SPARK_LEGACY`
`user`	Username of the user connecting to the instance. The short command for this parameter is `-u`.	`--user <username>`	`SPARK`
`url`	Base url of the location of watsonx.data cluster. The short command for this parameter is `-w`.	`--url <url>`	`SPARK`

The following parameters are listed in separate sections because they have more details that cannot be accommodated in the table.

System config

The system-config parameter refers to a file and is used to specify system-related parameters.

For the command line, the parameter is declared as follows:

--system-config /path/to/systemconfig/file

The format of the system config parameter is as follows:

[system-config]
<param_name1>:<param_val>
<param_name2>:<param_val>
<param_name3>:<param_val>
...

Currently, only the memory-limit parameter is supported. This parameter specifies the maximum memory in watsonx.data that an ingestion job can use. The default value for memory-limit is 500 M. The limit can be in bytes, K, M, or G. The system-config is applicable for PRESTO, SPARK_LEGACY, and SPARK ingestion modes.

The following are some examples of how the memory-limit parameter can be specified in the system-config file.

[system-config]
memory-limit:500M

[system-config]
memory-limit:5000K

[system-config]
memory-limit:1G

[system-config]
memory-limit:10000000 #This is in bytes

The memory-limit parameter is applicable for PRESTO ingestion mode.

Staging location

This parameter is applicable for PRESTO ingestion mode.

The staging location is used for:

CSV file or folder ingestion
Local Parquet file or folder ingestion.
S3 Parquet file ingestion
In some circumstances, when the source file or files in the S3 Parquet folder contains special column types, such as TIME or are associated with different column types.

For ingestion job through CLI, the staging bucket must be the same bucket that is associated with the Hive catalog. Staging is possible only in the Hive catalog.

The internal MinIO buckets (iceberg-data, hive-data, wxd-milvus, wxd-system) and their associated catalogs cannot be used for staging, as their endpoints are not externally accessible. Users can use their own storage buckets that are exposed and accessible by external connections.

Schema file specification

The schema parameter points to the schema file. The schema file can be used to specify CSV file properties such as field delimiter, line delimiter, escape character, encoding and whether header exists in the CSV file. This parameter is applicable for PRESTO, SPARK_LEGACY, and SPARK ingestion modes.

The following is the schema file specification:

[CSV]
DELIMITER:<delim> #default ','

#LINE_DELIMITER:
#A single char delimiter other than ' '(blank), need not be enclosed in quotes.
#Must be enclosed in quotes if it is one of:  '\n' for newline, '\t' for TAB, ' ' for space.
LINE_DELIMITER:<line_delim> #default '\n'

HEADER:<true|false> #default 'true'
#HEADER is a mandatory entry within schema file.

#single character value
ESCAPECHAR:<escape_char>   #default '\\'

#Encoding (Example:"utf-8")
ENCODING:<encoding>    #default None

The encoding values supported by Presto ingestion are directly dependent on encoding values supported by Python and the encoding values supported by Spark ingestion are directly dependent on encoding values supported by Java.

The encoding values except HEADER must be enclosed in single quotation marks.

The following is an example of schema specification:

$ more /tmp/schema.cfg
[CSV]
DELIMITER:','
HEADER:false
LINE_DELIMITER:'\n'

Log directory

The ingest log files are generated in the log directory. By default, the ingest log file is generated as /tmp/ingest.log. By using the --log-directory parameter, you can specify a new location for ingest log files. A separate log file is created for each ingest command invocation. The new log file name is in the format ingest_<timestamp)_<pid>.log. The log directory must exist before invocation of the ibm-lh ingest tool.

This parameter is applicable only in the command-line option for PRESTO, SPARK_LEGACY, and SPARK ingestion modes.

Example when using the command line:

ibm-lh data-copy --source-data-files s3://cust-bucket/warehouse/a_source_file1.csv,s3://cust-bucket/warehouse/a_source_file2.csv
--staging-location s3://cust-bucket/warehouse/staging/
--target-table iceberg_target_catalog.ice_schema.cust_tab1
--ingestion-engine-endpoint "hostname=localhost,port=8080"
--create-if-not-exist
--log-directory /tmp/mylogs

Example when using a config file:

ibm-lh data-copy --ingest-config ext.cfg --log-directory /tmp/mylogs

Target table

The ability to handle special characters in table and schema names for ingestion is constrained by the underlying engines (Presto, Legacy Spark, Spark) and the special characters they support. When using schema or table names with special characters, not all special characters are accepted by Spark, Presto, Legacy Spark. Consult the documentation for the special characters support.

The SQL identifier of the target table for data migration is <catalog>.<schema>.<table>. Use double quotation marks " or backticks ` to escape parts with special characters.

Examples:

`ibm-lh data-copy --target-table 'catalog."schema 2.0"."my table!"'`

ibm-lh data-copy --target-table 'catalog.`schema 2.0`.`my table!`'

`ibm-lh data-copy --target-table catalog.'"schema 2.0"'.'"my table!"'`

ibm-lh data-copy --target-table "catalog.\`schema 2.0\`.\`my table!\`"

`ibm-lh data-copy --target-table catalog.\"schema\ 2.0\".\"my\ table!\"`

Both double quotation marks " and backticks ` are accepted, but quotation mark styles cannot be mixed. In order to include a literal quotation inside an identifier, double the quoting character (for example, "" or ``).

From watsonx.data version 2.0.0 and later, target-tables is deprecated and target-table must be used.