Spark application REST API

The IBM Analytics Engine serverless plan provides REST APIs to submit and manage Spark applications. The following operations are supported:

Get the required credentials and set permissions.
Submit the Spark application.
Retrieve the state of a submitted Spark application.
Retrieve the details of a submitted Spark application.
Stop a running Spark application.

For a description of the available APIs, see the IBM Analytics Engine REST APIs for the serverless plan.

The following sections in this topic show samples for each of the Spark application management APIs.

Required credentials and permissions

Before you can submit a Spark application, you need to get authentication credentials and set the correct permissions on the Analytics Engine serverless instance.

You need the GUID of the service instance you noted down when you provisioned the instance. If you didn't make a note of the GUID, see Retrieving the GUID of a serverless instance.
You must have the correct permissions to perform the required operations. See User permissions.
The Spark application REST APIs use IAM based authentication and authorization.

Submitting a Spark application

Analytics Engine Serverless provides you with a REST interface to submit Spark applications. The payload passed to the REST API maps to various command-line arguments supported by the spark-submit command. See Parameters for submitting Spark applications for more details.

When you submit a Spark application, you need to reference the application file. To help you to get started quickly and learn how to use the AE serverless Spark APIs, this section begins with an example that uses pre-bundled Spark application files that are referenced in the submit application API payload. The subsequent section shows you how to run applications that are stored in an Object Storage bucket.

Referencing pre-bundled files

The provided sample application show you how to reference a .py word count application file and a data file in a job payload.

To learn how to quickly get started using pre-bundled sample application files:

Generate an IAM token if you haven’t already done so. See Retrieving IAM access tokens.
Export the token into a variable:
```
export token=<token generated>
```

Prepare the payload JSON file. For example, submit-spark-quick-start-app.json:

{
  "application_details": {
    "application": "/opt/ibm/spark/examples/src/main/python/wordcount.py",
    "arguments": ["/opt/ibm/spark/examples/src/main/resources/people.txt"]
    }
}

Submit the Spark application:

curl -X POST https://api.us-south.ae.cloud.ibm.com/v3/analytics_engines/<instance_id>/spark_applications --header "Authorization: Bearer $token" -H "content-type: application/json"  -d @submit-spark-quick-start-app.json

Referencing files from an Object Storage bucket

To reference your Spark application file from an Object Storage bucket, you need to create a bucket, add the file to the bucket and then reference the file from your payload JSON file.

The endpoint to your IBM Cloud Object Storage instance in the payload JSON file should be the private endpoint. Direct endpoints provide better performance than public endpoints and do not incur charges for any outgoing or incoming bandwidth.

To submit a Spark application:

Create a bucket for your application file. See Bucket operations for details on creating buckets.
Add the application file to the newly created bucket. See Upload an object for adding your application file to the bucket.
Generate an IAM token if you haven’t already done so. See Retrieving IAM access tokens.
Export the token into a variable:
```
export token=<token generated>
```
Prepare the payload JSON file. For example, submit-spark-app.json:
```
{
  "application_details": {
     "application": "cos://<application-bucket-name>.<cos-reference-name>/my_spark_application.py",
     "arguments": ["arg1", "arg2"],
     "conf": {
        "spark.hadoop.fs.cos.<cos-reference-name>.endpoint": "https://s3.direct.us-south.cloud-object-storage.appdomain.cloud",
        "spark.hadoop.fs.cos.<cos-reference-name>.access.key": "<access_key>",
        "spark.hadoop.fs.cos.<cos-reference-name>.secret.key": "<secret_key>",
        "spark.app.name": "MySparkApp"
     }
  }
}
```
Note:
- You can pass Spark application configuration values through the "conf" section in the payload. See Parameters for submitting Spark applications for more details.
- <cos-reference-name> in the "conf" section of the sample payload is any name given to your IBM Cloud Object Storage instance, which you are referencing in the URL in the "application" parameter. See Understanding the Object Storage credentials.
- It might take approximately a minute to submit the Spark application. Make sure to set sufficient timeout in the client code.
- Make a note of the "id" returned in the response. You need this value to perform operations such as getting the state of the application, retrieving the details of the application, or deleting the application.

Submit the Spark application:

curl -X POST https://api.us-south.ae.cloud.ibm.com/v3/analytics_engines/<instance_id>/spark_applications --header "Authorization: Bearer $token" -H "content-type: application/json"  -d @submit-spark-app.json

Sample response:

{
  "id": "87e63712-a823-4aa1-9f6e-7291d4e5a113",
  "state": "accepted"
}

If forward logging was enabled for your instance, you can view the application output in the platform logs that are forwarded to IBM Log Analysis. For details, see Configuring and viewing logs.

Passing the Spark configuration to an application

You can use the "conf" section in the payload to pass the Spark application configuration. If you specified Spark configurations at the instance level, those are inherited by the Spark applications run on the instance, but can be overridden at the time a Spark application is submitted by including the "conf" section in the payload.

See Spark configuration in Analytics Engine Serverless.

Parameters for submitting Spark applications

The following table lists the mapping between the spark-submit command parameters and their equivalent to be passed to the "application_details" section of the Spark application submission REST API payload.

Mapping between the spark-submit command parameters and their equivalents passed to the payload
spark-submit command parameter	Payload to the Analytics Engine Spark submission REST API
`<application binary passed as spark-submit command parameter>`	`application_details` -> `application`
`<application-arguments>`	`application_details` -> `arguments`
`class`	`application_details` -> `class`
`jars`	`application_details` -> `jars`
`name`	`application_details` -> `name` or `application_details` -> `conf` -> `spark.app.name`
`packages`	`application_details` -> `packages`
`repositories`	`application_details` -> `repositories`
`files`	`application_details` -> `files`
`archives`	`application_details` -> `archives`
`driver-cores`	`application_details` -> `conf` -> `spark.driver.cores`
`driver-memory`	`application_details` -> `conf` -> `spark.driver.memory`
`driver-java-options`	`application_details` -> `conf` -> `spark.driver.defaultJavaOptions`
`driver-library-path`	`application_details` -> `conf` -> `spark.driver.extraLibraryPath`
`driver-class-path`	`application_details` -> `conf` -> `spark.driver.extraClassPath`
`executor-cores`	`application_details` -> `conf` -> `spark.executor.cores`
`executor-memory`	`application_details` -> `conf` -> `spark.executor.memory`
`num-executors`	`application_details` -> `conf` -> `ae.spark.executor.count`
`pyFiles`	`application_details` -> `conf` -> `spark.submit.pyFiles`
`<environment-variables>`	`application_details` -> `env` -> `{"key1" : "value1", "key2" : "value2", ..... "`}

Getting the state of a submitted application

To get the state of a submitted application, enter:

curl -X GET https://api.us-south.ae.cloud.ibm.com/v3/analytics_engines/<instance_id>/spark_applications/<application_id>/state --header "Authorization: Bearer $token"

Sample response:

{
    "id": "a9a6f328-56d8-4923-8042-97652fff2af3",
    "state": "finished",
    "start_time": "2020-11-25T14:14:31.311+0000",
    "finish_time": "2020-11-25T14:30:43.625+0000"
}

Getting the details of a submitted application

To get the details of a submitted application, enter:

curl -X GET https://api.us-south.ae.cloud.ibm.com/v3/analytics_engines/<instance_id>/spark_applications/<application_id> --header "Authorization: Bearer $token"

Sample response:

{
  "id": "ecd608d5-xxxx-xxxx-xxxx-08e27456xxxx",
  "spark_application_id": "null",
  "application_details": {
      "application": "cos://sbn-test-bucket-serverless-1.mycosservice/my_spark_application.py",
      "conf": {
          "spark.hadoop.fs.cos.mycosservice.endpoint": "https://s3.direct.us-south.cloud-object-storage.appdomain.cloud",
          "spark.hadoop.fs.cos.mycosservice.access.key": "xxxx",
          "spark.app.name": "MySparkApp",
          "spark.hadoop.fs.cos.mycosservice.secret.key": "xxxx"
      },
      "arguments": [
          "arg1",
          "arg2"
      ]
  },
  "state": "failed",
    "submission_time": "2021-11-30T18:29:21+0000"
}

Stopping a submitted application

To stop a submitted application, run the following:

curl -X DELETE https://api.us-south.ae.cloud.ibm.com/v3/analytics_engines/<instance_id>/spark_applications/<application_id> --header "Authorization: Bearer $token"

Returns 204 – No Content, if the deletion is successful. The state of the application is set to STOPPED.

This API is idempotent. If you attempt to stop an already completed or stopped application, it will still return 204.

You can use this API to stop an application in the following states: accepted, waiting, submitted, and running.

Passing the runtime Spark version when submitting an application

You can use the "runtime" section under "application_details" in the payload JSON script to pass the Spark runtime version when submitting an application. The Spark version passed through the "runtime" section overrides the default runtime Spark version set at the instance level. To learn more about the default runtime version, see Default Spark runtime.

Example of the "runtime" section to run an application in Spark 3.4:

{
    "application_details": {
        "application": "/opt/ibm/spark/examples/src/main/python/wordcount.py",
        "arguments": [
            "/opt/ibm/spark/examples/src/main/resources/people.txt"
            ],
        "runtime": {
            "spark_version": "3.4"
        }
    }
}

Using environment variables

When submitting an application, you can use the "env" section under "application_details" in the payload JSON script to pass environment specific information, which determines the outcome of the application, for example the data sets to use or any secret values.

Example of the "env" section in the payload:

{
    "application_details": {
        "application": "cos://<application-bucket-name>.<cos-reference-name>/my_spark_application.py",
        "arguments": ["arg1", "arg2"],
        "conf": {
            "spark.hadoop.fs.cos.<cos-reference-name>.endpoint": "https://s3.direct.us-south.cloud-object-storage.appdomain.cloud",
            "spark.hadoop.fs.cos.<cos-reference-name>.access.key": "<access_key>",
            "spark.hadoop.fs.cos.<cos-reference-name>.secret.key": "<secret_key>",
            "spark.app.name": "MySparkApp"
            },
        "env": {
            "key1": "value1",
            "key2": "value2",
            "key3": "value3"
            }
        }
}

The environment variables set using "application_details" > "env" as described here, will be accessible to both executor and driver code.

The environment variables can be set using "spark.executorEnv.[EnvironmentVariableName]" configuration (application_details > env) also. They will, however, be accessible only to the tasks running on the executor and not the driver.

The environment variable names in the Shell consist of uppercase letters, digits, and the ( '_' ) and do not begin with a digit.

Example of pyspark application that accesses the environment variables that are passed using the "os.getenv" call.

from pyspark.sql.types import IntegerType
import os

def init_spark():
  spark = SparkSession.builder.appName("spark-env-test").getOrCreate()
  sc = spark.sparkContext
  return spark,sc

def returnExecutorEnv(x):
    # Attempt to access environment variable from a task running on executor
    return os.getenv("TESTENV1")

def main():
  spark,sc = init_spark()

  # dummy dataframe
  df=spark.createDataFrame([("1","one")])
  df.show()
  df.rdd.map(lambda x: (x[0],returnExecutorEnv(x[0]))).toDF().show()
  # Attempt to access environment variable on driver
  print (os.getenv("TESTENV1"))
  spark.stop()

if __name__ == '__main__':
  main()

Run a Spark application with non-default language version

The Spark runtime support Spark application written in the following languages:

Scala
Python
R

A Spark runtime version comes with default runtime language version. IBM extend support for new language versions and remove the existing language version to keep the runtime free from any security vulnerabilities. The system also provides settling time to transition your workloads when ever there is a new language versions. You can test your workload with a language version by passing an environment variable that points to the language version of the application.

Sample Python code:

 {
	"application_details": {
		"application": "/opt/ibm/spark/examples/src/main/python/wordcount.py",
		"arguments": [
			"/opt/ibm/spark/examples/src/main/resources/people.txt"
		],
		"env": {
			"RUNTIME_PYTHON_ENV": "python310"
		}
	}
}

Sample R code:

{
	"application_details": {
		"env": {
			"RUNTIME_R_ENV": "r42"
		},
		"application": "/opt/ibm/spark/examples/src/main/r/dataframe.R"
	}
}

Learn more

When managing your Spark applications, follow the recommended Best practices.