Understanding high availability and disaster recovery for IBM Cloudant
High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.
IBM® Cloudant® for IBM Cloud® is a highly available global service designed for availability during a zonal outage. IBM Cloudant is designed to meet the Service Level Objectives (SLO) with the Standard plan.
For more information about the available region and data center locations, see Service and infrastructure availability by location.
High availability architecture
IBM Cloudant provides replication, failover, and high-availability features to protect your databases and data from infrastructure maintenance, upgrades, and some failures. Deployments contain a cluster with three nodes spread across three availability zones in a region. All data is distributed in multiple shards, which are replicated three times in different nodes, so that a shard replica is stored in triplicate across these three separate nodes. The data is kept up to date using eventual consistency replication. A distributed consensus mechanism is used to maintain cluster state and handle failovers. If a node is unavailable, the request is routed to a different node, which has shard replicas, ensuring service and data availability. The old node rejoins the set when available. If a zone failure results in a member failing, the new replica will be created in a surviving zone.
High availability features
IBM Cloudant supports the following high availability features:
Feature | Description | Consideration |
---|---|---|
Automatic failover | Standard on all clusters and resilient against a zone or single member failure. | |
Node count | Out of the box, 3 node deployment. A three-member cluster will automatically recover from a single instance or zone failure (with data loss up to the lag threshold), ensuring service availability. | |
Shard replication | Shards are replicated thrice ensuring data availability. Each replica shard is placed on a different node. | |
Cross-region high availabilty | IBM Cloudant enables cross-region data redundancy and failover. | Optional. Customers can configure cross-region high availability. |
Disaster recovery architecture
Although data is stored redundantly within an IBM Cloudant cluster, it's important to consider extra backup measures. IBM Cloudant provides a supported tool for snapshot backup and restore. The tool is called CouchBackup, and is open source. For more information, see Introducing CouchBackup.
You can also create replication to another IBM Cloudant instance with bidirectional continuous replication, either active-passive or active-active configuration. For more information, see replication setup.
Disaster recovery features
IBM Cloudant supports the following disaster recovery features:
Feature | Description | Consideration |
---|---|---|
Backup restore | Restore a database from previously created backup; see IBM Cloudant backup and recovery. | Open soure tool. Customer configured. |
Cross-region failover | IBM Cloudant replication feature helps you build a flexible disaster recovery capability. | Customer configured. |
Live synchronization | IBM Cloudant bidirectional active-active replication feature helps you build a flexible disaster recovery capability. | Customer configured. |
Planning for DR
The DR steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.
Failure | Resolution |
---|---|
Hardware failure (single point) | IBM Cloudant provides a database that is resilient from single point of hardware failure within a zone. No customer configuration required. |
Zone failure | Automatic failover. The database members are distributed between zones. Configured three members provide additional resiliency to multiple zone failures. Shard replica. The database is distributed in multiple shards with triplicate replication on the members. |
Data corruption | Backup restore. Use the restored database in production or for source data to correct the corruption in the restored database. |
Regional failure | Backup restore. Use the restored database in production. Cross-region replication. |
Your responsibilities for HA and DR
It is your responsibility to continuously test your plan for HA and DR.
Interruptions in network connectivity and short periods of unavailability of a service might occur. It is your responsibility to make sure that application source code includes client availability retry logic to maintain high availability of the application.
For more information about responsibility ownership between you and IBM Cloud for IBM Cloudant, see Understanding your responsibilities when you use IBM Cloudant.
Stay informed: IBM notifications
Updates affecting customer workloads are communicated through IBM Cloud notification. Changes that impact customer workloads are detailed in IBM Cloud notifications. For more information about planned maintenance, announcements, and release notes that impact this service, see Monitoring notifications and status.
How IBM maintains services
Stay updated with Service changes and deprecations and Release notes of IBM Cloudant.