Understanding high availability and disaster recovery for IBM Cloudant

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

IBM® Cloudant® for IBM Cloud® is a highly available global service designed for availability during a zonal outage. IBM Cloudant is designed to meet the Service Level Objectives (SLO) with the Standard plan.

For more information about the available region and data center locations, see Service and infrastructure availability by location.

High availability architecture

IBM Cloudant provides replication, failover, and high-availability features to protect your databases and data from infrastructure maintenance, upgrades, and some failures. Deployments contain a cluster with three nodes spread across three availability zones in a region. All data is distributed in multiple shards, which are replicated three times in different nodes, so that a shard replica is stored in triplicate across these three separate nodes. The data is kept up to date using eventual consistency replication. A distributed consensus mechanism is used to maintain cluster state and handle failovers. If a node is unavailable, the request is routed to a different node, which has shard replicas, ensuring service and data availability. The old node rejoins the set when available. If a zone failure results in a member failing, the new replica will be created in a surviving zone.

High availability features

IBM Cloudant supports the following high availability features:

HA features for IBM Cloudant
Feature	Description	Consideration
Automatic failover	Standard on all clusters and resilient against a zone or single member failure.
Node count	Out of the box, 3 node deployment. A three-member cluster will automatically recover from a single instance or zone failure (with data loss up to the lag threshold), ensuring service availability.
Shard replication	Shards are replicated thrice ensuring data availability. Each replica shard is placed on a different node.
Cross-region high availabilty	IBM Cloudant enables cross-region data redundancy and failover.	Optional. Customers can configure cross-region high availability.

Disaster recovery architecture

Although data is stored redundantly within an IBM Cloudant cluster, it's important to consider extra backup measures. IBM Cloudant provides a supported tool for snapshot backup and restore. The tool is called CouchBackup, and is open source. For more information, see Introducing CouchBackup.

You can also create replication to another IBM Cloudant instance with bidirectional continuous replication, either active-passive or active-active configuration. For more information, see replication setup.

Disaster recovery features

IBM Cloudant supports the following disaster recovery features:

DR features for IBM Cloudant
Feature	Description	Consideration
Backup restore	Restore a database from previously created backup; see IBM Cloudant backup and recovery.	Open soure tool. Customer configured.
Cross-region failover	IBM Cloudant replication feature helps you build a flexible disaster recovery capability.	Customer configured.
Live synchronization	IBM Cloudant bidirectional active-active replication feature helps you build a flexible disaster recovery capability.	Customer configured.

Planning for DR

The DR steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

DR scenarios for IBM Cloudant
Failure	Resolution
Hardware failure (single point)	IBM Cloudant provides a database that is resilient from single point of hardware failure within a zone. No customer configuration required.
Zone failure	Automatic failover. The database members are distributed between zones. Configured three members provide additional resiliency to multiple zone failures. Shard replica. The database is distributed in multiple shards with triplicate replication on the members.
Data corruption	Backup restore. Use the restored database in production or for source data to correct the corruption in the restored database.
Regional failure	Backup restore. Use the restored database in production. Cross-region replication.

Your responsibilities for HA and DR

It is your responsibility to continuously test your plan for HA and DR.

Interruptions in network connectivity and short periods of unavailability of a service might occur. It is your responsibility to make sure that application source code includes client availability retry logic to maintain high availability of the application.

For more information about responsibility ownership between you and IBM Cloud for IBM Cloudant, see Understanding your responsibilities when you use IBM Cloudant.

Stay informed: IBM notifications

Updates affecting customer workloads are communicated through IBM Cloud notification. Changes that impact customer workloads are detailed in IBM Cloud notifications. For more information about planned maintenance, announcements, and release notes that impact this service, see Monitoring notifications and status.

How IBM maintains services

Stay updated with Service changes and deprecations and Release notes of IBM Cloudant.