About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Running Spark notebook from Watson Studio on Cloud Pak for Data
Applies to : Spark engine Gluten accelerated Spark engine
The topic provides the procedure to run a sample Spark application by using Watson Studio notebooks. The notebook resides in a Watson Studio project that is available IBM Cloud Pak for Data (CPD) cluster.
You can download and run the Spark use case sample in Watson Studio to explore the following functions in watsonx.data:
- Accessing tables
- Loading data
- Modifying schema
- Performing table maintenance activities
Watson Studio provides sample note books that allow to run small pieces of code that process your data, and immediately view the results of your computation. The notebook includes a sample use case that the users can readily download and start working on.
Prerequisites
-
Install Watson Studio on the CPD cluster.
-
Retrieve watsonx.data credentials
Get the following information from watsonx.data:
-
<wxd_hms_endpoint> : Thrift endpoint. For example, thrift://81823aaf-8a88-4bee-a0a1-6e76a42dc833.cfjag3sf0s5o87astjo0.databases.appdomain.cloud:32683. To get the details, log in to your watsonx.data instance, Click on the Iceberg data catalog from Infrastructure manager. In the Details tab, copy Metastore host, which is your <wxd_hms_endpoint>.
-
<wxd_hms_username> : This is by default
ibmlhapikey
. -
<wxd_hms_password> : Hive Metastore (HMS) password. Get the password from the watsonx.data administrator.
-
-
Source bucket details: If you bring your own Jupiter notebook, you must require the following details of your source bucket where data resides.
-
<source_bucket_endpoint> : Endpoint of the source bucket. For example, for a source bucket in Dallas region, the endpoint is s3.direct.us-south.cloud-object-storage.appdomain.cloud. Use public endpoint.
-
<source_bucket_access_key> : Access key of the source bucket.
-
<source_bucket_secret_key> : Secret key of the source bucket.
-
-
Download the sample notebook.
Procedure
To run the Spark sample notebook, follow the steps:
-
Log in to your Watson Studio account in IBM Cloud Pak for Data cluster.
-
Create a project. For more information, see Creating a project.
-
Select the project and add the Jupyter Notebook.
-
Click New Assets to create a new asset of Jupyter Notebook. The New Assets page opens. For more information, see Creating notebooks.
-
Click Code editors.
-
Search and select Jupyter Notebook editor. The New notebook page opens.
-
Specify the following details:
-
Name: Type the name of the notebook.
-
Select the Spark runtime. It must be Spark 3.4 with Python 3.10 or 3.11. For other supported Spark versions, see Supported Spark version.
-
-
Upload and run IBM published Spark notebook. Follow the steps:
-
From the left window, click Local file.
-
In the Notebook file field, drag the IBM Spark notebook file (that is provided by IBM) from your local computer.
-
Update the watsonx.data credentials, source bucket and catalog bucket details in the Configuring IBM Analytics Engine section in the notebook.
-
-
Click Create. The uploaded notebook opens.
-
You can step through the notebook execution cell by cell, by selecting Shift-Enter or you can run the entire notebook by clicking Run All from the menu.