Connecting Apache Spark with Data Engine

IBM Cloud® Data Engine is deprecated. As of 18 February 2024 you can't create new instances, and access to free instances will be removed. Existing Standard plan instances are supported until 18 January 2025. Any instances that still exist on that date will be deleted.

Data Engine catalog provides an interface that is compatible with Apache Hive metastore. This unified metadata repository enables any Big Data engine, such as Apache Spark, to use Data Engine as metastore. The same definition for tables and views can be created once and used from any connected engine. Each instance of Data Engine exports its catalog as a database called default.

Catalog usage within Data Engine

The Catalog can be used in Data Engine in read and write mode. Seamless access is configured without any configuration steps needed.

Connecting Apache Spark with Data Engine

When you use the Hive metastore compatible interface, access is limited to read only operations. Thus, existing tables and views can be used, but not modified.

Depending from where you want to connect to your catalog, the steps may vary.

Usage within Watson Studio Notebooks

Watson Studio has the compatible Hive metastore client already included. It also includes a convenience library to configure the connection to the Hive metastore. Set the required variables (CRN and apikey) and call the helper function to connect to the Hive metastore:

# change the CRN and the APIkey according to your instance
crn='yourDataengineCRN'
apikey='yourAPIkey'

from dataengine import SparkSessionWithDataengine

# call the helper function to create a session builder equipped with the correct config
session_builder = SparkSessionWithDataengine.enableDataengine(crn, apikey, "public")
spark = session_builder.appName("Spark DataEngine integration test").getOrCreate()

Display your tables and run an SQL statement:

spark.sql('show tables').show(truncate=False)

# replace yourTable with a valid table. Do not use the sample tables as you do not have access to the data!
spark.sql('select * from yourTable').show()

Usage with IBM Analytics Engine

The Hive metastore client is also already included in IBM® Analytics Engine. It also includes the convenience library to configure the connection to the Hive metastore. The following example shows a Spark batch job for a show tables example in Python:

import sys
from dataengine import SparkSessionWithDataengine

if __name__ == '__main__':
    crn = sys.argv[1]
    apikey = sys.argv[2]

    print(" Start SparkSessionWithDataengine example")
    session_builder = SparkSessionWithDataengine.enableDataengine(crn, apikey, "public")

    print(" Setup IBM Cloud Object Storage access")
    spark = session_builder.appName("AnalyticEngine DataEngine integration") \
          .config("fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem") \
          .config("fs.stocator.scheme.list", "cos") \
          .config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient") \
          .getOrCreate()

    print(" Got a spark session, listing all tables")
    spark.sql('show tables').show()

    spark.stop()

Prepare a JSON file to start that program, as in the following example (listTablesExample.json):

{
  "application_details": {
     "application": "cos://<your-bucket>.<cos-region>/listTablesExample.py",
     "arguments": ["<Data-Engine-instance-CRN>", "<API-key-to-access-data-engine-instance>"],
     "conf": {
        "ae.spark.executor.count":"1",
        "ae.spark.autoscale.enable":"false",
        "spark.hadoop.fs.cos.<cos-region>.endpoint": "https://s3.direct.us-south.cloud-object-storage.appdomain.cloud",
        "spark.hadoop.fs.cos.<cos-region>.iam.api.key": "<API-key-to-access-python-file>",
        "spark.app.name": "DataEngineHiveAccess"
     }
  }
}

Start the application by using a curl command, as in the following example:

curl -X POST https://api.us-south.ae.cloud.ibm.com/v3/analytics_engines/<GUID of Analytic Engine>/spark_applications --header "Authorization: Bearer $TOKEN" -H "content-type: application/json"  -d @listTablesExample.json

Apache Spark Data Engine integration

For self-hosted Apache Spark installations use the following instructions.

Ensure that Stocator is installed according to the instructions provided.
Download the Hive-compatible client with the provided instructions.

Either download the provided convenience libraries or, in case you don't want to use them, set the following settings in SparkContext yourself:

spark = SparkSession.builder.appName('Python-App') \
    .config("spark.sql.pyspark.jvmStacktrace.enabled", True) \
    .config("spark.hive.metastore.truststore.path", "file:///opt/ibm/jdk/jre/lib/security/cacerts") \
    # to access IBM Cloud Object Storage ensure that stocator is available
    .config("fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem") \
    .config("fs.stocator.scheme.list", "cos") \
    .config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient") \
    .config("fs.stocator.cos.scheme", "cos") \
    # register the required Cloud Object Storage path used in our application, add endpoints for all buckets
    .config("spark.hadoop.fs.cos.us-geo.endpoint", "https://s3.us.cloud-object-storage.appdomain.cloud") \
    .config("spark.hadoop.fs.cos.us-geo.iam.endpoint", "https://iam.cloud.ibm.com/identity/token") \
    .config("spark.hadoop.fs.cos.us-geo.iam.api.key", '<YourAPIkey>') \
    .config("spark.sql.hive.metastore.version", "3.0") \
    # directory where the Hive client has been placed
    .config("spark.sql.hive.metastore.jars", "/tmp/dataengine/*") \
    .config("spark.hive.metastore.uris", "thrift://catalog.<region>.sql-query.cloud.ibm.com:9083") \
    .config("spark.hive.metastore.use.SSL", "true") \
    .config("spark.hive.metastore.truststore.password", "changeit") \
    .config("spark.hive.metastore.client.auth.mode", "PLAIN") \
    .config("spark.hive.metastore.client.plain.username", '<YourDataengineCRN>') \
    .config("spark.hive.metastore.client.plain.password", '<YourAPIkey>') \
    .config("spark.hive.execution.engine", "spark") \
    .config("spark.hive.stats.autogather", "false") \
    .config("spark.sql.warehouse.dir", "file:///tmp") \
    # only spark is allowed as the default catalog
    .config("metastore.catalog.default", "spark") \
    .enableHiveSupport() \
    .getOrCreate()

Apache Hive metastore version 3.1.2 compatible client

Download the Hive-compatible client and place it in a directory of your Apache Spark cluster that is not on the classpath. This step is necessary, as the client is loaded into an isolated classloader to avoid version conflicts. Note that in the examples the files are placed in /tmp/dataengine, when you use a different folder, adjust the example accordingly.

The client differs from the Hive version 3.1.2 release by more enhancements that add support for TLS and authentication through IBM Cloud® Identity and Access Management. For user, specify the CRN and for password a valid API key with access to your Data Engine. Find the endpoint to use in the following table.

Region endpoints
Region	Endpoint
us-south	thrift://catalog.us.dataengine.cloud.ibm.com:9083
eu-de	thrift://catalog.eu-de.dataengine.cloud.ibm.com:9083

Convenience libraries to configure Spark

While the Data Engine catalog is compatible with the Hive metastore and can be used as any other external Hive metastore server, an SDK is provided to minimize the steps that are needed to configure Apache Spark. The SDK simplifies connecting to the Hive metastore and IBM Cloud Object Storage buckets referenced by tables or views.

In case of using Python download both, the Scala and the Python SDK, and place them in a folder that is in the classpath of your Apache Spark cluster. When using Scala, the Scala SDK is enough.

An example of how to use the Python helper can be found in the Watson Studio Notebooks section.

Use the following example to get started with IBM® Analytics Engine (IAE) or Spark runtimes in Watson Studio when using Scala. Submit the following application using a notebook or the spark-submit command:

package com.ibm.cloud.dataengine
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession.{Builder => SessionBuilder}
import SparkSessionBuilderAddOn._
object SparkSessionBuilderHMSConfigTest {
  def main(args: Array[String]) = {
    val spark = SparkSession
      .builder()
      .appName("Spark DataEngine integration")
      .enableDataengine(args(0), args(1), "public")
      .config("fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
      .config("fs.stocator.scheme.list", "cos")
      .config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
      .config("fs.stocator.cos.scheme", "cos")
      .getOrCreate()
    println("Got a spark session, listing all tables")
    val sqlDf = spark.sql("SHOW TABLES")
    sqlDf.show()
  }
}

Troubleshooting error message

If you receive the following error message when you run the SQL statement, check which of the possible causes exist.

AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

The following possible causes might exist.

The CRN is invalid or the service does not exist.
THE APIKEY is invalid.
The Data Engine service has a Lite plan.

Py4JJavaError: An error occurred while calling o8820.sql.
: java.util.concurrent.ExecutionException: com.ibm.stocator.fs.common.exception.ConfigurationParseException: Configuration parse exception: Access KEY is empty. Please provide valid access key

The following possible cause might exist.

Tables that are based on sample data that is provided by Data Engine do not work for a SELECT statement. Therefore, use data stored in your Cloud Object Storage bucket.