Ingesting data from remote storage

You can ingest data from remote storage systems into IBM® watsonx.data by using the Spark ingestion UI. This flow supports cloud storage platforms such as Amazon S3, IBM Cloud Object Storage, and MinIO.

Before you begin

  • Review the prerequisites for using the Spark ingestion UI.
  • Ensure that the source storage system is accessible from the watsonx.data environment.
  • Have the appropriate credentials and permissions to access the storage system.

Supported storage systems

  • Amazon S3
  • IBM Cloud Object Storage
  • Azure Data Lake Storage (ADLS)
  • Google Cloud Storage (GCS)
  • MinIO
  • S3-compatible storage systems

Supported file formats

  • CSV (Comma-Separated Values)
  • TXT
  • Parquet
  • JSON (JavaScript Object Notation)
  • Avro
  • ORC (Optimized Row Columnar)

Procedure

  1. Log in to the watsonx.data console.
  2. From the navigation menu, select Data manager.
  3. Click Ingest data.
  4. Select Storages as the ingestion flow.
  5. From the Select storage dropdown, choose an existing storage connection, or click Add + to add a new storage connection.
  6. If adding a new storage connection, select the Storage type and enter the connection details as per Add Storage.
  7. The file selection interface has two tabs:
    • All files: Displays all files in the selected storage bucket
    • Selected files: Shows only the files you have selected for ingestion
  8. Browse the bucket contents to locate and select the files you want to ingest.
  9. Switch to the Selected files tab to review your selections.
  10. Click Next to proceed to file details configuration.

You can select multiple files for batch ingestion. All selected files must have the same format and schema.

  1. Review the detected file format. If incorrect, select the correct format from the File format list.
  2. Configure format-specific options:

For CSV and TXT files:

  - **Delimiter**: Specify the delimiter character (default: comma)
  - **Header**: Select whether the first row contains column headers
  - **Infer schema**: Enable to automatically detect column data types
  - **Quote character**: Specify the character used for quoting values (default: double quote)
  - **Escape character**: Specify the character used for escaping special characters (default: backslash)

For JSON files:

  - **Multi-line**: Enable if each JSON record spans multiple lines
  - **Infer schema**: Enable to automatically detect the schema from the JSON structure

For Parquet, Avro, and ORC files:

  - Schema is automatically detected from the file metadata
  1. Click Preview data to view a sample of the data with the current configuration.
  2. Verify that the data is parsed correctly. If not, adjust the configuration options.
  3. Click Next to proceed to target table configuration.
  4. See Configuring target table settings in the parent topic.
  5. See Configuring job details in the parent topic.
  6. Review the ingestion configuration summary.
  7. Verify that all settings are correct.
  8. Click Submit to start the ingestion job.

Results

After the ingestion job completes successfully, the data from the remote storage files is loaded into the target table. The source files remain in the remote storage bucket and are not modified.

Related information