IBM Cloud Docs
Ingesting data by using Spark

Ingesting data by using Spark

You can ingest data into IBM® watsonx.data by using IBM Analytics Engine (Spark) through the web console.

Before you begin

To enable your Spark application to work with the watsonx.data catalog and storage, you must have Metastore admin role. Without Metastore admin privilege, you cannot ingest data to storage using Native Spark engine. For more information about the Spark configuration, see Working with the watsonx.data catalog and storage

  • You must have the Administrator role and privileges in the catalog to do ingestion through the web console.
  • Add and register IBM Analytics Engine (Spark). See Registering an engine.
  • Add buckets for the source data files and target catalog. See Adding a bucket-catalog pair.
  • Optionally, you can create a schema in the catalog for the target table. See Creating schemas.
  • Optionally, you can also create a target table in the schema. See Creating tables.

Ingesting data

  1. Log in to IBM® watsonx.data console.

  2. From the navigation menu, select Data manager.

  3. Select the Ingestion jobs tab and click Create ingestion job. The Ingest data window opens with an auto-generated job ID.

  4. If required, modify the auto-generated ingestion job ID in the Enter job ID field.

  5. Select a registered IBM Analytics Engine (Spark) from the Select engine list.

  6. Select size (Small, Medium, Large, or Custom) for Spark engine resource configurations based on the size of source data that is getting ingested. If you want to customize the configurations, select Custom, and configure your own Spark driver cores, executor cores, and memory resources.

  7. In the Select file(s) tab, click Select remote files.

  8. From the Bucket drop-down, select the bucket from where you want to ingest the data.

  9. Select the required file type based on the source data. The available options are CSV and Parquet.

  10. From the source directory, select the source data files to be ingested and click Next.

    You can apply the configuration for Header, Encoding, Escape character, Field delimiter, and Line delimiter for the CSV files.

  11. View the selected files and the corresponding file previews in the File(s) selected and File preview tabs. File preview enables to preview first 10 rows of the selected source file.

  12. In the Target tab, select the target catalog from the Select catalog list.

  13. Select one of the schema options:

    1. Existing schema: To ingest source data into an existing schema. Corresponding target schemas are listed in the Select schema dropdown.
    2. New schema: Enter the target schema name in Schema name to create a new schema from the source data.
  14. Select the corresponding Target table options based on the selection in step 12.

    1. Existing table:To ingest source data into an existing table. Corresponding target tables are listed in the Select table dropdown.
    2. New table: Enter the target table name in Table name to create a new table from the source data.
  15. Click Next.

  16. Validate the details in the summary page. Click Ingest.

Limitations

Following are some of the limitations of Spark ingestion:

  • Spark ingestion supports only source data files from object storage bucket. Local files are not supported.
  • The default buckets in watsonx.data are not exposed to Spark engine. Hence, iceberg-bucket and hive-bucket are not supported for source or target table. Users can use their own MinIo or S3 compatible buckets that are exposed and accessible by Spark engine.