Configuring IBM Cloudant for cross-region disaster recovery

The IBM Cloudant disaster recovery guide explains that one way to enable disaster recovery is to use IBM Cloudant replication to create redundancy across regions.

For more information, see how to retrieve replication scheduler documents and monitor replication status.

You can configure replication in IBM® Cloudant® for IBM Cloud® by using an "active-active" or "active-passive" topology across data centers.

The following diagram shows a typical configuration that uses two IBM Cloudant accounts, one in each region:

This diagram contains a customer, global load balancer/DNS, and two data centers. The data centers contain a load balancer, three app servers, and a database, which replicates between each database. — Example active-active architecture

Remember these important facts:

Within each data center, IBM Cloudant already offers high availability by storing data in triplicate across three servers.
Replication occurs at the database rather than account level and must be explicitly configured.
IBM Cloudant doesn't provide any Service Level Agreements (SLAs) or certainties about replication latency. IBM Cloudant doesn't monitor individual replications. Your own strategy for detecting failed replications and restarting them is advisable.

Before you begin an active-active deployment

For an active-active deployment, a strategy for managing conflicts must be in place, so be sure to understand how replication and conflicts work before you consider this architecture.

Go to the IBM Cloud Support portal if you need help with how to model data to handle conflicts effectively.

Overview

In the following material, a bidirectional replication is created. This configuration allows two databases to work in an active-active topology.

The configuration assumes that you have two accounts in different regions:

myaccount-dc1.cloudant.com
myaccount-dc2.cloudant.com

After these accounts are created, follow these steps:

Create a pair of peer databases within the accounts.
Set up API keys to use for the replications between these databases.
Grant appropriate permissions.
Set up replications.
Test replications are working as expected.
Configure application and infrastructure for either active-active or active-passive use of the databases.

Step 1. Create your databases

Create the databases that you want to replicate between within each account.

In this example, a database that is called mydb is created.

The names that are used for the databases in this example aren't important, but using the same name is clearer.

curl "https://myaccount-dc1.cloudant.com/mydb" -XPUT -u myaccount-dc1
curl "https://myaccount-dc2.cloudant.com/mydb" -XPUT -u myaccount-dc2

Step 2. Create an API key for your replications

It's a good idea to use an API key for continuous replications. The advantage is that if your primary account details change, for example after a password reset, your replications can continue unchanged.

API keys aren't tied to a single account. This means that a single API key can be created, then granted suitable database permissions for both accounts.

For example, the following command requests an API key for the account myaccount-dc1:

curl -XPOST "https://myaccount-dc1.cloudant.com/_api/v2/api_keys" -u myaccount-dc1

A successful response is similar to the following abbreviated example:

{
  "password": "YPN...Tfi",
  "ok": true,
  "key": "ble...igl"
}

Take careful note of the password. It isn't possible to retrieve the password later.

Step 3. Grant access permission

Give the API Key permission to read and to write on both databases.

If you also need to replicate indexes, assign admin permissions.

Use the IBM Cloudant Dashboard, or see the authorization information for details of how to grant permissions programmatically.

Step 4. Set up replications

Replications in IBM Cloudant are always uni-directional: from one database to another database. To replicate both ways between two databases, two replications are required, one for each direction.

A replication is created in each account that uses the API Key that is created earlier.

First, create a replication from database myaccount-dc1.cloudant.com/mydb to database myaccount-dc2.cloudant.com/mydb.

curl -XPOST "https://myaccount-dc1.cloudant.com/_replicator"
	-u myaccount-dc1
	-H "Content-Type: application/json"
	-d '{
	  "_id": "mydb-myaccount-dc1-to-myaccount-dc2",
	  "source": {
	    "auth": {
	      "basic": {
	        "username": "ble...igl",
	        "password": "YPN...Tfi"
	      }
	    },
	    "url": "https://myaccount-dc1.cloudant.com/mydb"
	  },
	  "target": {
	    "auth": {
	      "basic": {
	        "username": "ble...igl",
	        "password": "YPN...Tfi"
	      }
	    },
	    "url": "https://myaccount-dc2.cloudant.com/mydb"
	  },
	  "continuous": true
}'

Next, create a replication from database myaccount-dc2.cloudant.com/mydb to database myaccount-dc1.cloudant.com/mydb.

curl -XPOST "https://myaccount-dc2.cloudant.com/_replicator"
	-u myaccount-dc2
	-H "Content-Type: application/json"
	-d '{ 
	  "_id": "mydb-myaccount-dc2-to-myaccount-dc1",
	  "source": {
	    "auth": {
	      "basic": {
	        "username": "ble...igl",
	        "password": "YPN...Tfi"
	      }
	    },
	    "url": "https://myaccount-dc2.cloudant.com/mydb"
	  },
	  "target": {
	    "auth": {
	      "basic": {
	        "username": "ble...igl",
	        "password": "YPN...Tfi"
	      }
	    },
	    "url": "https://myaccount-dc1.cloudant.com/mydb"
	  },
	  "continuous": true
}'

If this step fails because the _replicator database doesn't exist, create it.

Step 5. Test your replication

Test the replication processes by creating, modifying, and deleting documents in either database.

After each change in the database, check that you can also see that change in the other database.

Step 6. Configure your application

The databases are set up to remain synchronized with each other.

The next decision is whether to use the databases in an active-active or active-passive manner.

Active-active

In an active-active configuration, different application instances can write to different databases.

For example, application "A" might write to database myaccount-dc1.cloudant.com/mydb, while application "B" might write to database myaccount-dc2.cloudant.com/mydb.

This configuration offers several benefits:

Load can be spread over several accounts.
You can configure applications to access an account with reduced latency (not always the geographically closest).

An application can be set up to communicate with the "nearest" IBM Cloudant account. For applications hosted in DC1, it's appropriate to set their IBM Cloudant URL to "https://myaccount-dc1.cloudant.com/mydb". Similarly, for applications that are hosted in DC2, you would set their IBM Cloudant URL to "https://myaccount-dc2.cloudant.com/mydb".

Active-passive

In an active-passive configuration, all instances of an application are configured to use a primary database. However, the application can fail over to the other backup database, if circumstances make it necessary. The failover might be implemented within the application logic itself, or by using a load balancer, or by using some other means.

A simple test of whether a failover is required would be to use the main database endpoint as a "heartbeat". For example, a simple GET request that is sent to the main database endpoint normally returns details about the database. If no response is received, it might indicate that a failover is necessary.

Other configurations

You might consider other hybrid approaches for your configuration.

For example, in a "Write-Primary, Read-Replica" configuration, all writes go to one database, but the read load is distributed among the replicas.

Step 7. Next steps

Consider monitoring the replications between the databases. Use the data to determine whether your configuration might be optimized further.
Consider how your design documents and indexes are deployed and updated. You might find it more efficient to automate these tasks.

Failing over between IBM Cloudant regions

Typically, the process of managing a failover between regions or datacenters is handled higher up within your application stack, for example by configuring application server failover changes, or by balancing the load.

IBM Cloudant doesn't provide a facility for you to manage explicitly any failover or reroute requests between regions. This constraint is partly for technical reasons, and partly because the conditions under when it might happen tend to be application-specific. For example, you might want to force a failover in response to a custom performance metric.

However, if you decide that you need the ability to manage failover, consider the following possible options:

Put your own HTTP proxy in front of IBM Cloudant. Configure your application to talk to the proxy rather than the IBM Cloudant instance. This configuration means that the task of changing the IBM Cloudant instances that are used by applications can be handled through a modification to the proxy configuration rather than a modification to the application settings. Many proxies can balance the load, based on user-defined health checks.
Use a global load balancer such as IBM Cloud® Internet Services to route to IBM Cloudant. This option requires a CNAME definition that routes to different IBM Cloudant accounts, based on a health check or latency rule.

Recovering from failover

If a single IBM Cloudant instance is unreachable, avoid redirecting traffic back to it as soon as it becomes reachable again. The reason is that some time is required for intensive tasks such as synchronizing the database state from any peers, and ensuring that indexes are up to date.

It's helpful to have a mechanism for monitoring these tasks to help decide when a database is in a suitable state to service your production traffic.

As a guide, a typical list of checks to apply include:

Replications
Indexes

If you implement rerouting for requests or failover based on a health test, you might want to incorporate corresponding checks to avoid premature rerouting back to a service instance that is still recovering.

Replications

Are any replications in an error state?
Do any replications need restarting?
How many pending changes are still waiting for replication into the database?

For more information, see how to retrieve replication scheduler documents and monitor replication status.

If a database is being changed continuously, the replication status is unlikely to be zero. You must decide what status threshold is acceptable, or what represents an error state.

Indexes

Are the indexes sufficiently up to date? Verify that indexes are updated by using the active tasks endpoint.
Test the level of "index readiness" by sending a query to the index, and deciding whether it returns within an acceptable time.