IBM Cloud Docs
Additional Information about ingestion command usage and special cases

Additional Information about ingestion command usage and special cases

This topic provides guidance on using the cpdctl wx-data ingestion create command for various ingestion scenarios, including folder ingestion, adhoc ingestion, and ingestion from databases or snapshots. It highlights special cases, edge conditions, and practical examples to help users perform data ingestion effectively.

All the examples in this topic are referenced based on the examples you get using the help command explained in the "How to use wx-data command --help (-h)" section.

Scenario 1: Basic ingestion examples

The following are some examples of basic ingestion commands:

  1. Ingestion with registered storage with Spark engine

    cpdctl wx-data ingestion create  \
    --source-data-files s3://bucketcos/titanic-parquet.txt \
    --engine-id spark690 \
    --target-table iceberg_data.schema1.cli_table5
    
  2. Lite ingestion with registered storage

    cpdctl wx-data ingestion create  \
    --source-data-files s3://bucketcos/jsonFile.json \
    --engine-id lite-ingestion \
    --target-table iceberg_data.schema1.cli_table2
    
  3. Ingestion with registered database with Spark engine

    cpdctl wx-data ingestion create  \
    --database-id postgresql241 \
    --database-schema Tm_Lh_Engine \
    --database-table admission \
    --engine-id spark66 \
    --target-table iceberg_data.schema1.cli_table6 \
    --sync-status
    

Scenario 2: Folder ingestion

  • Supported engine: Folder ingestion is only supported by the Spark engine.

  • Required parameter: Users must specify the --source-file-type when ingesting from a folder.

    Example:

    cpdctl wx-data ingestion create --source-data-files s3://bucketcos/csv_folder --source-file-type csv
    --target-table iceberg_data.cpdctl_test.test1
    --engine-id spark66
    --job-id cli-test2
    

Scenario 3: Adhoc ingestion (without registered storage)

Users can perform ingestion without registering storages by providing credentials directly through CLI parameters.

S3 and ADLS storage credentials can be given using --storage-details argument or the corresponding independent arguments.

  1. Ingestion using S3 storage

    Example using --storage-details:

    cpdctl wx-data ingestion create --source-data-files s3://bucketcos/titanic-parquet.txt
    --storage-details '{"secret_key":"*****","endpoint":"https://s3.us-west.cloud-object-storage.test.appdomain.cloud","type":"ibm_cos", "access_key":"*****","name":"bucketcos", "region":"us-south"}'
    --engine-id lite-ingestion
    --target-table    iceberg_data.schema1.cli_table1
    

    Example using individual storage arguments:

    cpdctl wx-data ingestion create
    --source-data-files    s3://bucketcos/userdata5.avro
    --storage-access-key ******
    --storage-endpoint https://s3.us-west.cloud-object-storage.test.appdomain.cloud
    --storage-name bucketcos
    --storage-region us-south
    --storage-secret-key ******
    --storage-type ibm_cos
    --engine-id lite-ingestion
    --target-table iceberg_data.schema1.cli_table1 --target-write-mode overwrite
    
  2. Ingestion using ADLS gen1 storage

    Example:

    cpdctl wx-data ingestion create  \
    --source-data-files wasbs://lhcasblob2@lhcastest2.blob.core.windows.net/ingest_data_folder/employees_new_comma.orc \
    --storage-details '{"name":"lhcasblob2-lhcastest2", "endpoint":"wasbs://lhcasblob2@lhcastest2.blob.core.windows.net", "type":"adls_gen1", "access_key":"*******", "container_name":"lhcasblob2", "account_name":"lhcastest2"}' \
    --engine-id lite-ingestion \
    --target-table iceberg_data.schema1.cli_table3
    
  3. Ingestion using ADLS gen2 storage

    Example:

    cpdctl wx-data ingestion create  \
    --source-data-files abfss://pyspark@sparkadlsiae.dfs.core.windows.net/ingest_data_folder/iris.parquet \
    --storage-details '{"name":"pyspark-sparkadlsiae", "endpoint":"abfss://pyspark@sparkadlsiae.dfs.core.windows.net", "type":"adls_gen2", "application_id":"*****", "directory_id":"*****", "secret_key":"*******", "container_name":"pyspark", "account_name":"sparkadlsiae"}' \
    --engine-id spark66 \
    --target-table iceberg_data.schema1.cli_table4 \
    --sync-status
    
  4. Ingestion using database source

    Users can ingest data from databases either by using registered database IDs or by providing connection details directly.

    Example with direct connection details:

    cpdctl wx-data ingestion create  \
    --engine-id spark66 \
    --database-type netezza \
    --database-name conopsdb \
    --database-host 9.46.64.138 \
    --database-isssl=true \
    --database-port 5480 \
    --database-user-id **** \
    --database-password ***** \
    --database-schema TM_LAKEHOUSE_ENGINE \
    --database-table STUDENTS \
    --database-certificate "$(cat /Useshibilrahmanp/Documents/db_certs/netezpem)" \
    --target-table iceberg_data.schemcli_table10 \
    --target-write-mode overwrite \
    --sync-status
    
    • --database-certificate accepts the certificate as a string. To use a file path, use command: --database-certificate "$(cat </path/to/certificate.pem>)".
    • --database-isssl=true is required for SSL-enabled databases and must provide certificate using the parameter --database-certificate.
    • For Oracle, --database-connection-mode (either sid or service_name) and --database-connection-mode-value parameters are required.

Scenario 4: Ingestion using Iceberg snapshot-id with Spark engine

Example:

cpdctl wx-data ingestion create \
--instance-id 1735472262311515 \
--iceberg-catalog sample_iceberg_catalog \
--iceberg-schema sample_iceberg_schema \
--iceberg-snapshot-id 7823318841638214979 \
--iceberg-table sample_iceberg_table  \
--iceberg-warehouse sample_iceberg_warehouse \
--target-table sample_catalog.sample_schema.sample_table \
--engine-id spark266 \
--storage-name iceberg-data

Additional Help

To explore more options and flags supported by the ingestion command, run the following:

cpdctl wx-data ingestion create --help