Best practices

Use the following set of recommended guidelines when provisioning and managing your serverless instances and when running Spark applications.

Best practices when using serverless instances including detailed descriptions and reference links
Best Practice	Description	Reference Link
Use separate IBM Analytics Engine service instances for your development and production environments.	This is a general best practice. By creating separate IBM Analytics Engine instances for different environments, you can test any configuration and code changes before applying them on the production instance.	NA
Upgrade to the latest Spark version	As open source Spark versions are released, they are made available in IBM Analytics Engine after a time interval required for internal testing. Watch out for the announcement of a new Spark versions in the Release Notes section and upgrade the runtime of your instance to move your applications to latest Spark runtime. Older runtimes are be deprecated and eventually removed as newer versions are released. Make sure you test your applications on the new runtime before making changes on the production instances.	Release notes for IBM Analytics Engine serverless instances
Grant role-based access	You should grant role-based access to all users on the IBM Analytics Engine instances based on their requirements. For example, only your automation team should have permissions to submit applications because it has access to secrets and your DevOps team should only be able to see the list of all applications and their states.	Granting permissions to users
Choose the right IBM Cloud Object Storage configuration	Disaster Recovery (DR) Resiliency: You should use the IBM Cloud Object Storage Cross Regional resiliency option that backs up your data across several different cities in a region. In contrast, the Regional resiliency option back ups data in a single data center. Encryption: IBM Cloud Object Storage comes with default built-in encryption. You can also configure Object Storage to work with the BYOK Key Protect service. Service credentials: By default, IBM Cloud Object Storage uses IAM-style credentials. If you want to work with AWS-style credentials, you need to use the "Include HMAC Credential" option as described in Service credentials. Direct endpoints for IBM Cloud Object Storage: Always use direct endpoints for connectivity to the IBM Cloud Object Storage instance. This applies to the IBM Cloud Object Storage home instance as well as endpoints used from your applications (either your code or what you pass as parameters in the configurations at instance level or application level). Direct endpoints provide better performance than public endpoints and do not incur charges for any outgoing or incoming bandwidth.	Disaster Recovery (DR) Resiliency: IBM Cloud Object Storage documentation. Encryption: Getting started with encryption keys and Object Storage manage encryption Service credentials: Service credentials Direct endpoints for IBM Cloud Object Storage: Endpoints and storage locations
Use private endpoints for the external Hive metastore	If you are using Spark SQL and want to use an external metastore such as use IBM Cloud Databases for PostgreSQL as your Hive metastore, you must use the private endpoint for the database connection for better performance and cost savings.	Working with Spark SQL and an external metastore
Running applications with resource overcommitment	There is a quota associated with each Analytics Engine Serverless instance. When applications are submitted on an instance, they are allocated resources from the instance quota. If an application requests resources beyond the available quota, the application will either not start or will run with less than the requested resources, which might result in the application running slower than expected or, in some cases, in the application failing. You should always monitor the current resource consumption on an instance to ensure that your applications are running comfortably within the given limits. You can adjust the limits through a support ticket if required.	Default limits and quotas Get current resource consumption
Static allocation of resources versus autoscaling	When you submit applications, you can specify the number of executors upfront (static allocation) or use the autoscaling option (dynamic allocation). Before you decide whether to use static allocation or autoscaling, you might want to run a few benchmarking tests by varying different data sets with both static and autoscaling to find the right configuration. General considerations: If you know the number of resources (cores and memory) required by your application and it doesn't vary across different stages of the application run, it is recommended to allocate static resources for better performance. If you want to go for an optimized resource utilization, you can opt for autoscaling of executors where the executors are allotted based on the application's actual demand. Note that there might be a slight associated delay when using autoscaling in applications.	Enabling application autoscaling
Enable and fine-tune forward logging	Enable forward logging for your service instance to help troubleshoot, show progress, and print or show outputs of your applications. Note that log forwarding incurs a cost based on the quantity of logs forwarded or retained in the IBM Log Analysis instance. Based on your use case and need, you need to decide the optimal settings. When you enable log forwarding using the Default API, only the driver logs are enabled. If you need executor logs as well, for example, if there are errors that you would see only on executors, you need to customize logging to enable executor logging as well. Executor logs can become very large, so balance out the options to optimize the amount of logs that get forwarded to your logging instance versus the information you get in logs for troubleshooting purposes. Follow the best practices of IBM Log Analysis when choosing the right configuration and searching techniques. For example, you might want to configure the IBM Log Analysis instance plan for a 7 day search with the archival of logs to IBM Cloud Object Storage to save on costs. Also refer to the IBM Log Analysis documentation for techniques on searching for logs of your interest based on keywords, point in time, and so on.	Configuring and viewing logs
Customize your service instance	You might need to customize your service instance to bring in Python or conda packages that are not preinstalled, or bring in some files(certificates or config files) that are to be made available to Spark applications. Based on your needs, customize your instance using library sets and use these library sets when submitting applications. The size of your library set has a bearing on the application startup time and the executor startup time (when you auto-scale applications). Also note there is an upper limit for the size of a library set, namely 2 GB. So if different applications need different library sets, it is better for you to use separate library sets, so that they can be specified individually at the time the application is submitted. Use customization only to bring in files that cannot be brought in by the application details parameters. See Parameters for submitting Spark applications. You must use the standard spark-submit equivalent parameter options such as the `files`, `jars`, `packages` and `pyFiles` options if that fits your use case. Only if you need files that don't fit into any of these categories, for example a self signed certificate, a JAAS configuration file, or a `.so` file, you should use the "customization for file download" option.	Customization options
Apply filters when retrieving list of applications	When you need to retrieve list of applications either in UI or using the API or CLI, it is better to apply the appropriate filters and retrieve the set that you need.	Spark application commands
Use other services or tools for supporting functions	Apart from using an IBM Log Analysis and IBM Cloud Object Storage instance and depending on your use case, you might want to use other supporting tools and services. For instance, you can use Apache Airflow (managed by you) for orchestrating, scheduling and automating your applications. You can also make use of IBM Secrets Manager to store the secrets required for your applications and use your automation scripts to read the secrets from the Secrets Manager before submitting your applications. You can also get creative with your application arguments, passing a token required to read the required secrets from the Secrets Manager directly from within your application.	Configuring Secrets Manager
Use instances in alternate regions for backup and disaster recovery	Currently, IBM Analytics Engine Serverless instances can be created in two regions, namely Dallas(`us-south`) and Frankfurt(`eu-de`). Although it is advisable to create your instances in the same region where your data is located, it is always useful to create a backup instance in an alternate region with the same set of configurations as your primary instance, in case the primary instance becomes unavailable or unusable. Your automations should enable switching application submissions between the two regions if required.	NA
Use separate buckets and service credentials for application files, data files, and home instance	Use the "separation of concerns" principle to distinguish the access between different resources. Do not store data or application files in the home instance bucket. Use separate buckets for data and application files. Use separate access credentials (IAM Key based) with restricted access to the bucket for application files and the bucket that contains your data.	Assigning access to an individual bucket
Applications must run within 72 hours	There is a limit on the number of hours an application or kernel can run. For security and compliance patching, all runtimes that run for more than 72 hours are stopped. If you do have a large application, break your application into smaller chunks that will run within 72 hours. If you are running Spark streaming applications, make sure that you configure checkpoints and have monitoring in place to restart your applications if they are stopped	Application limits
Start and Stop Spark History only when needed	Always stop the Spark history server when you no longer need to use it. Keep in mind that the Spark history server consumes CPU and memory resources continuously while its state is started.	Spark history server