Advanced replication

You can learn about advanced replication concepts and tasks, such as the ones in the following list and more:

Maintaining your replication database
Scheduling and monitoring replications
Authenticating during replication

You might also find it helpful to review details of the underlying replication protocol, and review the API reference documentation.

Replication database maintenance

A replication database must be monitored like any other database. Without regular database maintenance, you might accumulate invalid documents that were caused by interruptions to the replication process. Having many invalid documents can result in an excess load on your cluster when the replicator process is restarted by IBM® Cloudant® for IBM Cloud® operations.

To maintain a replication database, remove old documents. You can remove old documents by determining their age and deleting them if they're no longer needed.

The replication scheduler

The new IBM Cloudant Replication Scheduler provides a number of improvements and enhancements when compared with the previous IBM Cloudant replication mechanism.

In particular, network usage during replication is more efficient. The scheduler accounts for the current load for individual database nodes within a cluster when it determines the allocation of replication tasks.

Finally, the state of a replication is now more detailed, and consists of seven distinct states:

initializing - The replication was added to the scheduler, but isn't yet initialized or scheduled to run. The status occurs when a new or updated replication document is stored within the _replicator database.
error - The replication can't be turned into a job. This error might be caused in several different ways. For example, the replication must be filtered, but it wasn't possible to fetch the filter code from the source database.
pending - The replication job is scheduled to run, but isn't yet running.
running - The replication job is running.
crashing - A temporary error occurred that affects the replication job. The job is automatically retried later.
completed - The replication job completed. This state doesn't apply to continuous replications.
failed- The replication job failed. The failure is permanent. This state means that no further attempt is made to replicate by using this replication task. The failure might be caused in several different ways, for example, if the source or target URLs aren't valid.

The transition between these states is illustrated in the following diagram:

The transition between states is , , , and . — Replication Scheduler states

The scheduler introduces two new endpoints:

You can manage and determine replication status more quickly and easily by using these endpoints.

See the typical process for using the replication scheduler to manage and monitor replications:

Create a replication document that describes the needed replication, and store the document in the replicator database.
Monitor the status of the replication by using the /_scheduler/docs endpoint.

Authentication during replication

In any production application, security of the source and target databases is essential. In order for replication to continue, authentication is necessary to access the databases. Checkpoints for replication are enabled by default, which means that replicating the source database requires write access.

To enable authentication during replication, include a username and password in the database URL. The replication process uses the supplied values for HTTP Basic Authentication.

See the following example of specifying username and password values for accessing source and target databases during replication:

{
  "source": {
    "url": "https://example.com/db",
    "auth": {
      "basic": {
        "username": "$USERNAME",
        "password": "$PASSWORD"
      }
    }
  },
  "target": {
    "url": "https://$ACCOUNT.cloudant.com/db",
    "auth": {
      "basic": {
        "username": "$USERNAME",
        "password": "$PASSWORD"
      }
    }
  }
}

For IAM credentials, use the example below to authenticate with an IAM API key:

{
  "source": {
    "url": "https://example.com/db",
    "auth": {
      "iam": {
        "apikey": "$APIKEY"
      }
    }
  },
  "target": {
    "url": "https://$ACCOUNT.cloudant.com/db",
    "auth": {
      "iam": {
        "apikey": "$APIKEY"
      }
    }
  }
}

Filtered replication

Sometimes you don't want to transfer all documents from source to target. To choose which documents to transfer, include one or more filter functions in a design document on the source. You can then tell the replicator to use these filter functions.

Filtering documents during replication is similar to the process of filtering the _changes feed.

A filter function takes two arguments:

The document to be replicated.
The replication request.

A filter function returns a true or false value. If the result is true, the document is replicated.

To set up filtering, use the selector field whenever possible. When you use the selector field, you can specify a filter without having to replicate the entire database. This method makes filtering faster and causes less load on IBM Cloudant. For more information, see the selector field documentation.

See the following example of a filter function:

function(doc, req) {
	return !!(doc.type && doc.type == "foo");
}

Filters are stored under the topmost filters key of the design document.

See the following example of storing a filter function in a design document:

{
	"_id": "_design/myddoc",
	"filters": {
		"myfilter": "function goes here"
	}
}

A filtered replication is started by using a JSON statement that identifies the following items:

The source database.
The target database.
The name of the filter that is stored under the filters key of the design document.

See example JSON for starting a filtered replication:

{
  "source": {
    "url": "https://example.org/example-database",
    "auth": {
      "basic": {
        "username": "$USERNAME",
        "password": "$PASSWORD"
      }
    }
  },
  "target": {
    "url": "https://$ACCOUNT.cloudant.com/example-database",
    "auth": {
      "basic": {
        "username": "$USERNAME",
        "password": "$PASSWORD"
      }
    }
  },
  "filter": "myddoc/myfilter"
}

Arguments can be supplied to the filter function by including key: value pairs in the query_params field of the invocation.

See example JSON for starting a filtered replication with supplied parameters:

{
  "source": {
    "url": "https://example.org/example-database",
    "auth": {
      "basic": {
        "username": "$USERNAME",
        "password": "$PASSWORD"
      }
    }
  },
  "target": {
    "url": "https://$ACCOUNT.cloudant.com/example-database",
    "auth": {
      "basic": {
        "username": "$USERNAME",
        "password": "$PASSWORD"
      }
    }
  },
  "filter": "myddoc/myfilter",
  "query_params": {
    "key": "value"
  }
}

The selector option provides performance benefits when compared with using the filter option. Use the selector option whenever possible. For more information, see the selector documentation.

Eliminating conflicts that use replication

Use the winning_revs_only: true option to replicate winning document revisions only. These revisions are the revisions that would be returned by the GET $ACCOUNT/$DATABASE/$DOCID API endpoint by default, or appear in the _changes feed with the default parameters.

{
	"source": {
	  "url": "https://example.org/example-database",
	  "auth": {
	    "basic": {
	      "username": "$USERNAME",
	      "password": "$PASSWORD"
	    }
	  }
	},
	"target": {
	  "url": "https://$ACCOUNT.cloudant.com/example-database",
	  "auth": {
	    "basic": {
	      "username": "$USERNAME",
	      "password": "$PASSWORD"
	    }
	  }
	},
	"winning_revs_only": true
}

Replication with this mode discards conflicting revisions, so it might be one way to remove conflicts through replication.

Replication IDs and checkpoint IDs, generated by winning_revs_only: true replications are different than those replications generated by default, so it is possible to first replicate the winning revisions, then later, to backfill the rest of the revisions with a regular replication job.

The winning_revs_only: true option can be combined with filters or other options like continuous: true or create_target: true.

Named document replication

Sometimes you don't want to replicate documents. For simple replications, you don't need to write a filter function. Instead, to replicate specific documents, add the list of keys as an array in the doc_ids field.

See the following example replication of specific documents:

{
	"source": {
	  "url": "https://example.org/example-database",
	  "auth": {
	    "basic": {
	      "username": "$USERNAME",
	      "password": "$PASSWORD"
	    }
	  }
	},
	"target": {
	  "url": "https://127.0.0.1:5984/example-database",
	  "auth": {
	    "basic": {
	      "username": "$USERNAME",
	      "password": "$PASSWORD"
	    }
	  }
	},
	"doc_ids": ["foo", "bar", "baz"]
}

The `user_ctx` property and delegations

Replication documents can have a custom user_ctx property. This property defines the user context under which a replication runs.

An older way of triggering replications, by making a POST to the /_replicate/ endpoint, didn't need the user_ctx property. The reason is that at the moment of triggering the replication, all the necessary information about the authenticated user is available.

By contrast, the replicator database is a regular database. The information about the authenticated user is only present at the moment that the replication document is written to the database. In other words, the replicator database implementation is similar to a _changes feed consumption application, with ?include_docs=true set.

For replication, this implementation difference means that for nonadmin users, a user_ctx property that includes the user's name and a subset of their roles must be defined in the replication document. This requirement is addressed by a validation function present in the default design document of the replicator database. The function validates each document update. This validation function also ensures that a nonadmin user can't set a username property in the user_ctx property that doesn't correspond to the correct username. The same principle also applies for roles.

See the following example delegated replication document:

{
	"_id": "my_rep",
	"source": {
	  "url": "https://$SERVER.com:5984/foo",
	  "auth": {
	    "basic": {
	      "username": "$USERNAME",
	      "password": "$PASSWORD"
	    }
	  }
	},
	"target": {
	  "url": "https://$ACCOUNT.cloudant.com/bar",
	  "auth": {
	    "basic": {
	      "username": "$USERNAME",
	      "password": "$PASSWORD"
	    }
	  }
	},
	"continuous":  true,
	"user_ctx": {
		"name": "joe",
		"roles": ["erlanger", "researcher"]
	}
}

For admins, the user_ctx property is optional. If the property is missing, the value defaults to a user context with the name null and an empty list of roles.

The empty list of roles means that design documents aren't written to local targets during replication. If you want to write design documents to local targets, then a user context with the _admin role must be set explicitly.

Also, for admins, the user_ctx property can be used to trigger a replication for another user. This user context is passed to local target database document validation functions.

The user_ctx property applies to local endpoints only.

In summary, for admins, the user_ctx property is optional. While for regular (nonadmin) users, it's mandatory. When the roles property of user_ctx is missing, it defaults to the empty list [ ].

Several performance-related options can be set for a replication, by including them in the replication document.

use_bulk_get: Set to true to improve replication job performance. A replication fetches documents in batches from the source. The increased replication rate can consume the source and target account capacity quicker, and result in concurrent nonreplication requests on those accounts to experience more 429 HTTP errors.
connection_timeout: The maximum period of inactivity for a connection in milliseconds. If a connection is idle for this period, its current request is retried. Default value is 30000 milliseconds (30 seconds).
http_connections: The maximum number of HTTP connections per replication. For push replications, the effective number of HTTP connections that are used is min(worker_processes + 1, http_connections). For pull replications, the effective number of connections that are used corresponds to this parameter's value. Default value is 20.
retries_per_request: The maximum number of retries per request. Before a retry, the replicator waits for a short period before it repeats the request. This period doubles between each consecutive retry, and never goes beyond 5 minutes. The minimum value before the first retry is 0.25 seconds. The default value is 10 retries.
socket_options: A list of options to pass to the connection sockets. The available options can be found in the documentation for the Erlang function setopts of the inet module. Default value is [{keepalive, true},{nodelay, false}].
worker_batch_size: Worker processes run batches of replication tasks, where the batch size is defined by this parameter. The size corresponds to the number of _changes feed rows. Larger values for the batch size might result in better performance. Smaller values mean that checkpointing is done more frequently. Default value is 500.
worker_processes: The number of processes the replicator uses in each replication task to transfer documents from the source to the target database. Higher values might produce better throughput because of greater parallelism in network and disk IO activities, but this improvement comes at the cost of requiring more memory and potentially CPU time. Default value is 4.

See the following example that includes performance options in a replication document:

{
	"source": {
	  "url": "https://example.com/example-database",
	  "auth": {
	    "basic": {
	      "username": "$ACCOUNT1",
	      "password": "$PASSWORD"
	    }
	  }
	},
	"target": {
	  "url": "https://example.org/example-database",
	  "auth": {
	    "basic": {
	      "username": "$ACCOUNT2",
	      "password": "$PASSWORD"
	    }
	  }
	},
	"connection_timeout": 60000,
	"retries_per_request": 20,
	"http_connections": 30
}

The effect of large attachments

Having large numbers of attachments on documents might cause an adverse effect on replication performance.

For more information about the effect of attachments on replication performance, see Performance considerations.

Avoiding the `/_replicate` endpoint

Use the _replicator scheduler instead of the /_replicate endpoint.

If a problem occurs during replication, such as a stall, timeout, or application crash, a replication that is defined within the _replicator database is automatically restarted by the system. However, if you define a replication by sending a request to the /_replicate endpoint, it can't be restarted by the system if a problem occurs because the replication request doesn't persist. Replications that are defined in the _replicator database are easier to monitor.

Advanced replication

Replication database maintenance

The replication scheduler

Authentication during replication

Filtered replication

Eliminating conflicts that use replication

Named document replication

The user_ctx property and delegations

Performance-related options

The effect of large attachments

Avoiding the /_replicate endpoint

The `user_ctx` property and delegations

Avoiding the `/_replicate` endpoint