Provisioning Gluten accelerated Spark engine

Gluten accelerated Spark engine is an optimized, high-performance engine in watsonx.data. The Spark engine uses Gluten for offloading SQL execution to Velox, which is an open source execution engine(implemented in C++) thereby accelerating the computation of SparkSQL to reduce the cost for running the workloads.

Features of Gluten accelerated Spark engine

Supports file formats Apache Parquet and Apache Avro.
Improved table scan performance.
Accelerates larger SQL queries with joins and aggregation.
Supports Delta, Hudi, Iceberg and Hive catalogs.

Provisioning Gluten accelerated Spark engine

IBM watsonx.data allows you to provision a Gluten accelerated Spark engine to run complex large-scale workloads. Gluten delivers exceptional performance when run on large hardware.

You can use the following methods to provision Gluten Accelerated Spark engine:

Provisioning through Console
Provisioning through API

Prerequisites

You must have a subscription of watsonx.data on Cloud.
To use Gluten accelerated Spark engine, you must contact the IBM Support team and enable this feature in your watsonx.data instance.
<engine-home-bucket> : You must create a storage in watsonx.data, that will be associated with your Gluten Accelerated Spark engine to store the logs.

Provisioning through Console

To add a Gluten accelerated Spark engine, complete the following steps.

Log in to watsonx.data console.
From the navigation menu, select Infrastructure manager.
To add a Gluten accelerated Spark engine, click Add component, Click IBM Spark and click Next.
In the Add component-IBM Spark page, from the Type section, select Gluten accelerated Spark engine.

In the Add component - IBM Spark page, configure the following details:

a. In the Add component - Gluten accelerated Spark engine window, enter the Display name for your Gluten accelerated Spark engine. c. Configure the following details:

Provisioning Gluten accelerated Spark engine
Field	Description
Default Spark version	Select the Spark runtime version that must be considered for processing the applications. Gluten accelerated Spark engine support version 3.4.
Engine home bucket	Select the registered Cloud Object Storage bucket from the list to store the Spark events and logs that are generated while running spark applications. Note Make sure you do not select the IBM-managed bucket as Spark engine home. If you select an IBM-managed bucket, you cannot access it to view the logs. For more information, see Before you begin.
Reserve capacity	Select the Node Type. Enter the number of nodes in the No of nodes field.
Associated catalogs (optional)	Select the catalogs that must be associated with the engine.

Click Create. The engine is provisioned and is displayed in the Infrastructure Manager page.

Provisioning through API

Use the following CURL command to create a Gluten Accelerated Spark engine.

curl -X POST https://`<region>`.lakehouse.cloud.ibm.com/lakehouse/api/v2/spark_engines   -H "content-type: application/json" -H "accept: application/json" -H "AuthInstanceId: `<CRN>`" -d {

    "description": "",

    "engine_details": {

        "default_version": "3.4",

        "scale_config": {

            "node_type": "small",

            "number_of_nodes": 1

        },

        "engine_home_bucket_name": "`<engine-home-bucket>`"

    },

    "engine_display_name": "`<gluten_engine_name>`",

    "associated_catalogs": [

        "<catalog_name>"

    ],

    "origin": "native",
    "type": "gluten"

}

   }

Parameter values:

<region>: The region where the watsonx.data instance is available.
<CRN>: The watsonx.data instance CRN. You can retrieve the CRN from the watsonx.data information page.
<engine-home-bucket> : The storage that enables you to monitor and debug the Spark application.
<gluten_engine_name>: Specify a name for the Gluten accelerated Spark engine.
<catalog_name>: Specify a name for the catalog you use. Gluten accelerated Spark supports Iceberg, Hudi, Delta, and Hive catalogs.