Understanding high availability and disaster recovery for Databases for MongoDB

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

Databases for MongoDB is a regional service that fulfills the defined Service Level Objectives (SLO) with the Standard and Enterprise plans. For more information, see Service Level Agreement (SLA). For more information about the available IBM Cloud regions and data centers for Databases for MongoDB, see Service and infrastructure availability by location.

High availability architecture

Databases for MongoDB provides replication, failover, and high-availability features to protect your databases and data from infrastructure maintenance, upgrades, and some failures. Deployments contain a cluster with three data members - one primary and two secondary members. The two member replica set is kept up to date using asynchronous replication. A distributed consensus mechanism is used to maintain cluster state and handle failovers. If the primary is unavailable, the replica set elects a secondary to be primary and continues normal operation. The old primary rejoins the set when available. The primary and secondary members will always be in different zones of an MZR. If a zone failure results in a member failing, the new replica will be created in a surviving zone.

High availability features

Databases for MongoDB supports the following high availability features.

High availability features
Feature	Description
Automatic failover	Standard on all clusters and resilient against a zone or single member failure.
Member count	Minimum 3 members. Default is a Standard three member deployment. A three-member cluster will automatically recover from a single instance or zone failure (with data loss up to the lag threshold).
Asynchronous replication	Secondaries replicate the primary's operations and apply the operations to their data sets asynchronously. By having the secondaries' data sets reflect the primary's data set, the replica set can continue to function despite the failure of one or more members.

Disaster recovery architecture

The general strategy for disaster recovery is to create a new database, such as the following MongoDB Restore database. The contents of the new database can be a backup of the source database created before the disaster. A new database can be created using the point-in-time feature for the Enterprise plan, if the production database is available.

Disaster recovery features

Databases for MongoDB supports the following disaster recovery features.

Disaster recovery features
Feature	Description	Consideration
Backup restore	Create database from previously created backup; see Managing Cloud Databases backups.	New connection strings for the restored database must be referenced throughout the workload.
Point-in-time restore	Create database from the live production using point-in-time recovery.	This is only possible for the Enterprse plan and if the active database is available and the RPO (disaster) falls within the supported window. It is not useful if the production cluster is unavailable. New connection strings for the restored database must be referenced throughout the workload.

Planning for disaster recovery

The disaster recovery steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

Failure scenarios and resolutions
Failure	Resolution
Hardware failure (single point)	IBM provides a database that is resilient from a single point of hardware failure within a zone - no configuration is required.
Zone failure	Automatic failover. The database members are distributed between zones. Configuring three members will provide additional resiliency to multiple zone failures.
Data corruption	Backup restore. Use the restored database in production or for source data to correct the corruption in the restored database. Point-in-time restore. Use the restored database in production or for source data to correct the corruption in the restored database.
Regional failure	Backup restore. Use the restored database in production.

Application-level high-availability

Applications that communicate over networks and cloud services are subject to transient connection failures. You want to design your applications to retry connections when errors are caused by a temporary loss in connectivity to your deployment or to IBM Cloud.

Your applications must be designed to handle temporary interruptions to the database, implement error handling for failed database commands, and implement retry logic to recover from a temporary interruption.

Several minutes of database unavailability or connection interruption are not expected. Open a support case with details if you have periods longer than a minute with no connectivity so we can investigate.

Your responsibilities for HA and DR

The following information can help you create and continuously practice your plan for HA and DR.

When restoring a database from backups or using point-in-time restore, a new database is created with new connection strings. Existing workloads and processes must be adjusted to consume the new connection strings. Promoting a read replica to a cluster will have a similar impact, although existing read-only portions of the workload will not be impacted.

A recovered database may also need the same customer-created dependencies of the disaster database - make sure the following and other services exist in the recovered region:

IBM® Key Protect for IBM Cloud®
Hyper Protect Crypto Services

Remember that deleting a database also deletes its associated backups. However, deleted databases may be recoverable within a limited timeframe. Refer to the FAQ backups documentation for specific details on database recovery procedures.

It is not possible to copy backups off the IBM Cloud, so consider using the database-specific tools for additional backups. It may be required to recover from malicious database deletion followed by a reclamation-delete of a database. Careful management of IAM access to databases can help reduce exposure to this problem.

The following checklist associated with each feature can help you to create and practice your plan.

Backup restore
- Verify that backups are available at the desired frequency to meet RPO requirements. Managing Cloud Databases backups documents backup frequency. Consider a script using IBM Cloud® Code Engine - Working with the Periodic timer (cron) event producer to create additional on-demand backups to improve RPO if the criticality and size of the database allow.
- There are some restrictions on database restore regions - verify that your restore goals can be achieved by reading managing Cloud Databases backups.
- Verify that the retention period of the backups meet your requirements.
- Schedule test restores regularly to verify that the actual restored times meet the defined RTO. Remember that database size significantly impacts restore time. Consider strategies to minimize restore times, such as breaking down large databases into smaller, more manageable units and purging unused data.
- Verify the Key Protect service.
Point-in-time restore
- Verify the procedures covered earlier.
- Verify desired backup is in the window.

For more information on responsibility ownership between the customer and IBM Cloud for using Databases for MongoDB, see Shared responsibilities for Cloud Databases.

Stay informed: IBM notifications

Updates affecting customer workloads are communicated through IBM Cloud notifications. To stay informed about planned maintenance, announcements, and release notes related to this service, refer to Monitoring notifications and status. In addition, regularly review the Version policy for the latest updates on End-of-Life versions and dates.