Running Spark notebook from Watson Studio on Cloud Pak for Data

The option to register external Spark engines in watsonx.data is deprecated in this release and will be removed in version 2.3. watsonx.data already includes built-in Spark engines that you can provision and use directly, including the Gluten-accelerated Spark engine and the native watsonx.data Spark engine.

Applies to : Spark engine Gluten accelerated Spark engine

The topic provides the procedure to run a sample Spark application by using Watson Studio notebooks. The notebook resides in a Watson Studio project that is available IBM Cloud Pak for Data (CPD) cluster.

You can download and run the Spark use case sample in Watson Studio to explore the following functions in watsonx.data:

Accessing tables
Loading data
Modifying schema
Performing table maintenance activities

Watson Studio provides sample note books that allow to run small pieces of code that process your data, and immediately view the results of your computation. The notebook includes a sample use case that the users can readily download and start working on.

Prerequisites

Install Watson Studio on the CPD cluster.
Retrieve watsonx.data credentials

Get the following information from watsonx.data:
- <wxd_hms_endpoint> : Thrift endpoint. For example, thrift://81823aaf-8a88-4bee-a0a1-6e76a42dc833.cfjag3sf0s5o87astjo0.databases.appdomain.cloud:32683. To get the details, log in to your watsonx.data instance, Click on the Iceberg data catalog from Infrastructure manager. In the Details tab, copy Metastore host, which is your <wxd_hms_endpoint>.
- <wxd_hms_username> : This is by default ibmlhapikey.
- <wxd_hms_password> : Hive Metastore (HMS) password. Get the password from the watsonx.data administrator.
Starting with watsonx.data version 2.2.0, authentication using ibmlhapikey and ibmlhtoken as usernames is deprecated. These formats will be phased out in a future release. To ensure compatibility with upcoming versions, use the new format:ibmlhapikey_username and ibmlhtoken_username.
Source bucket details: If you bring your own Jupiter notebook, you must require the following details of your source bucket where data resides.
- <source_bucket_endpoint> : Endpoint of the source bucket. For example, for a source bucket in Dallas region, the endpoint is s3.direct.us-south.cloud-object-storage.appdomain.cloud. Use public endpoint.
- <source_bucket_access_key> : Access key of the source bucket.
- <source_bucket_secret_key> : Secret key of the source bucket.
Download the sample notebook.

Procedure

To run the Spark sample notebook, follow the steps:

Log in to your Watson Studio account in IBM Cloud Pak for Data cluster.
Create a project. For more information, see Creating a project.
Select the project and add the Jupyter Notebook.
Click New Assets to create a new asset of Jupyter Notebook. The New Assets page opens. For more information, see Creating notebooks.
Click Code editors.
Search and select Jupyter Notebook editor. The New notebook page opens.
Specify the following details:
- Name: Type the name of the notebook.
- Select the Spark runtime. It must be Spark 3.4 with Python 3.10 or 3.11. For other supported Spark versions, see Supported Spark version.
Upload and run IBM published Spark notebook. Follow the steps:
- From the left window, click Local file.
- In the Notebook file field, drag the IBM Spark notebook file (that is provided by IBM) from your local computer.
- Update the watsonx.data credentials, source bucket and catalog bucket details in the Configuring IBM Analytics Engine section in the notebook.
Click Create. The uploaded notebook opens.
You can step through the notebook execution cell by cell, by selecting Shift-Enter or you can run the entire notebook by clicking Run All from the menu.