Additional Information about ingestion
command usage and special cases
This topic provides guidance on using the cpdctl wx-data ingestion create
command for various ingestion scenarios, including folder ingestion, adhoc ingestion, and ingestion from databases or snapshots. It highlights special cases,
edge conditions, and practical examples to help users perform data ingestion effectively.
All the examples in this topic are referenced based on the examples you get using the help command explained in the "How to use wx-data command --help (-h)" section.
Scenario 1: Basic ingestion examples
The following are some examples of basic ingestion commands:
-
Ingestion with registered storage with Spark engine
cpdctl wx-data ingestion create \ --source-data-files s3://bucketcos/titanic-parquet.txt \ --engine-id spark690 \ --target-table iceberg_data.schema1.cli_table5
-
Lite ingestion with registered storage
cpdctl wx-data ingestion create \ --source-data-files s3://bucketcos/jsonFile.json \ --engine-id lite-ingestion \ --target-table iceberg_data.schema1.cli_table2
-
Ingestion with registered database with Spark engine
cpdctl wx-data ingestion create \ --database-id postgresql241 \ --database-schema Tm_Lh_Engine \ --database-table admission \ --engine-id spark66 \ --target-table iceberg_data.schema1.cli_table6 \ --sync-status
Scenario 2: Folder ingestion
-
Supported engine: Folder ingestion is only supported by the Spark engine.
-
Required parameter: Users must specify the
--source-file-type
when ingesting from a folder.Example:
cpdctl wx-data ingestion create --source-data-files s3://bucketcos/csv_folder --source-file-type csv --target-table iceberg_data.cpdctl_test.test1 --engine-id spark66 --job-id cli-test2
Scenario 3: Adhoc ingestion (without registered storage)
Users can perform ingestion without registering storages by providing credentials directly through CLI parameters.
S3 and ADLS storage credentials can be given using --storage-details
argument or the corresponding independent arguments.
-
Ingestion using S3 storage
Example using
--storage-details
:cpdctl wx-data ingestion create --source-data-files s3://bucketcos/titanic-parquet.txt --storage-details '{"secret_key":"*****","endpoint":"https://s3.us-west.cloud-object-storage.test.appdomain.cloud","type":"ibm_cos", "access_key":"*****","name":"bucketcos", "region":"us-south"}' --engine-id lite-ingestion --target-table iceberg_data.schema1.cli_table1
Example using individual storage arguments:
cpdctl wx-data ingestion create --source-data-files s3://bucketcos/userdata5.avro --storage-access-key ****** --storage-endpoint https://s3.us-west.cloud-object-storage.test.appdomain.cloud --storage-name bucketcos --storage-region us-south --storage-secret-key ****** --storage-type ibm_cos --engine-id lite-ingestion --target-table iceberg_data.schema1.cli_table1 --target-write-mode overwrite
-
Ingestion using ADLS gen1 storage
Example:
cpdctl wx-data ingestion create \ --source-data-files wasbs://lhcasblob2@lhcastest2.blob.core.windows.net/ingest_data_folder/employees_new_comma.orc \ --storage-details '{"name":"lhcasblob2-lhcastest2", "endpoint":"wasbs://lhcasblob2@lhcastest2.blob.core.windows.net", "type":"adls_gen1", "access_key":"*******", "container_name":"lhcasblob2", "account_name":"lhcastest2"}' \ --engine-id lite-ingestion \ --target-table iceberg_data.schema1.cli_table3
-
Ingestion using ADLS gen2 storage
Example:
cpdctl wx-data ingestion create \ --source-data-files abfss://pyspark@sparkadlsiae.dfs.core.windows.net/ingest_data_folder/iris.parquet \ --storage-details '{"name":"pyspark-sparkadlsiae", "endpoint":"abfss://pyspark@sparkadlsiae.dfs.core.windows.net", "type":"adls_gen2", "application_id":"*****", "directory_id":"*****", "secret_key":"*******", "container_name":"pyspark", "account_name":"sparkadlsiae"}' \ --engine-id spark66 \ --target-table iceberg_data.schema1.cli_table4 \ --sync-status
-
Ingestion using database source
Users can ingest data from databases either by using registered database IDs or by providing connection details directly.
Example with direct connection details:
cpdctl wx-data ingestion create \ --engine-id spark66 \ --database-type netezza \ --database-name conopsdb \ --database-host 9.46.64.138 \ --database-isssl=true \ --database-port 5480 \ --database-user-id **** \ --database-password ***** \ --database-schema TM_LAKEHOUSE_ENGINE \ --database-table STUDENTS \ --database-certificate "$(cat /Useshibilrahmanp/Documents/db_certs/netezpem)" \ --target-table iceberg_data.schemcli_table10 \ --target-write-mode overwrite \ --sync-status
--database-certificate
accepts the certificate as a string. To use a file path, use command:--database-certificate "$(cat </path/to/certificate.pem>)"
.--database-isssl=true
is required for SSL-enabled databases and must provide certificate using the parameter--database-certificate
.- For Oracle,
--database-connection-mode
(eithersid
orservice_name
) and--database-connection-mode-value
parameters are required.
Scenario 4: Ingestion using Iceberg snapshot-id
with Spark engine
Example:
cpdctl wx-data ingestion create \
--instance-id 1735472262311515 \
--iceberg-catalog sample_iceberg_catalog \
--iceberg-schema sample_iceberg_schema \
--iceberg-snapshot-id 7823318841638214979 \
--iceberg-table sample_iceberg_table \
--iceberg-warehouse sample_iceberg_warehouse \
--target-table sample_catalog.sample_schema.sample_table \
--engine-id spark266 \
--storage-name iceberg-data
Additional Help
To explore more options and flags supported by the ingestion
command, run the following:
cpdctl wx-data ingestion create --help