IBM Cloud Docs
Working with Spark SQL and an external metastore 

Working with Spark SQL and an external metastore 

Spark SQL uses Hive metastore to manage the metadata of a user's applications tables, columns, partition information.

By default, the database that powers this metastore is an embedded Derby instance that comes with the Spark cluster.  You could choose to externalize this metastore database to an external data store, like to an IBM Cloud Databases for PostgreSQL or an IBM Cloud Data Engine (previously SQL Query) instance.

Placing your metadata outside of the Spark cluster will enable you to reference the tables in different applications across your IBM Analytics Engine instances. This, in combination with storing your data in IBM Cloud Object Storage, helps persisting data and metadata and allows you to work with this data seamlessly across different Spark workloads.

Enabling and testing an external metastore with IBM Analytics Engine

To enable and test an external metastore with IBM Analytics Engine, you need to perform the following steps:

  1. Create a metastore to store the metadata. You can choose to provision either an IBM Cloud Databases for PostgreSQL or an IBM Cloud Data Engine (previously SQL Query) instance.
  2. Configure IBM Analytics Engine to work with the database instance.
  3. Create a table in one Spark application and then access this table from another Spark application.