Understanding high availability and disaster recovery for Databases for Redis

High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.

Databases for Redis is a regional service that fulfills the defined Service Level Objectives (SLO) with the Standard plan.

For more information about the available IBM Cloud regions and data centers for Databases for Redis, see Service and infrastructure availability by location.

High availability architecture

Databases for Redis provides replication, failover, and high-availability features to protect your databases and data from infrastructure maintenance, upgrades, and some failures. Deployments contain a cluster with two data members in a primary plus replica configuration. The replica is kept up to date using asynchronous replication. High availability is monitored and managed with three Redis sentinels

By default, data persistence is enabled on all deployments and your data is written to disk. Databases for Redis uses a combination of RDB snapshots and AOF (Append Only File) to persist data to disk. The interval for Databases for Redis to write to disk (fsync) is set to once every second to balance durability and performance.

You can turn off data persistence, which is useful for configuring Redis as a cache.

High availability features

Databases for Redis supports the following high availability features:

High availability features
Feature	Description	Consideration
Automatic failover	Standard on all clusters and resilient against a zone or single member failure
Member count	Minimum - 2 members. Default is a Standard two member cluster in a primary and replica configuration. A two-member cluster will automatically recover from a single instance or zone failure (with data loss up to the lag threshold).	Three Sentinel nodes to monitor the health of the cluster and coordinate failovers.
Asynchronous replication	Enables replication from primary to replica without blocking the write path, ensuring high availability with low latency. Refer to Asynchronous replication below.	May result in data loss during failover due to replication lag (RPO > 0). Not suitable where strict data durability is required.

Asynchronous replication for Databases for Redis

By default, Databases for Redis uses asynchronous replication, where the primary node does not wait for the replica to acknowledge writes. This ensures low latency and high throughput, making Databases for Redis ideal for caching and performance-sensitive workloads. However, in the event of a primary failure, replication lag can lead to data loss, as the replica may not have received the most recent writes.

Databases for Redis replication is designed for high availability, not strict durability. A failover is automatically triggered if the primary becomes unreachable, promoting the replica to leader. Because replication is asynchronous, some committed writes may be lost during this process. This replication lag defines the Recovery Point Objective (RPO) of Databases for Redis deployments.

To reduce data loss risk, Databases for Redis supports persistence mechanisms like RDB snapshots and AOF(Append Only File), which write data to disk independently of the replication process. These should be configured carefully based on workload requirements.

Asynchronous replication in Databases for Redis ensures fast performance but does not eliminate the possibility of data loss during failover events. It is recommended for workloads where speed and availability outweigh strict data consistency.

Disaster recovery architecture

The general strategy for disaster recovery is to create a new database, like the Restore database below. The contents of the new database can be a backup of the source database created before the disaster.

Disaster recovery features

Databases for Redis supports the following disaster recovery features:

Disaster recovery features
Feature	Description	Consideration
Backup restore	Create database from previously created backup; see Managing Cloud Databases backups.	New connection strings for the restored database must be referenced throughout the workload.

Planning for disaster recovery

The disaster recovery steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.

Failure scenarios and resolutions
Failure	Resolution
Hardware failure (single point)	(Example) IBM provides a database that is resilient from single point of hardware failure within a zone. No customer configuration required.
Zone failure	Automatic failover. The database members are distributed between zones.
Data corruption	Backup restore. Use the restored database in production or for source data to correct the corruption in the restored database.

Application-level high-availability

Applications that communicate over networks and cloud services are subject to transient connection failures. You want to design your applications to retry connections when errors are caused by a temporary loss in connectivity to your deployment or to IBM Cloud.

Because Databases for Redis is a managed service, regular updates and database maintenance occur as part of normal operations. This can occasionally cause short intervals where your database is unavailable. It can also cause the database to trigger a graceful fail-over, retry, and reconnect. It takes a short time for the database to determine which member is a replica and which is the leader, so you might also see a short connection interruption. Failovers generally take less than 30 seconds.

Your applications must be designed to handle temporary interruptions to the database, implement error handling for failed database commands, and implement retry logic to recover from a temporary interruption.Use IOREDIS, NODEREDIS or any other package of your choice to ensure continuity of your application.For more information, see Error detection and handling with Redis blog post.

Several minutes of database unavailability or connection interruption are not expected. Open a support case with details if you have periods longer than a minute with no connectivity so we can investigate.

Connection limits

Databases for Redis sets a maximum of 10,000 concurrent connections per deployment. This limit ensures performance stability and resource management within your Redis environment. However, not all 10,000 connections are available to clients — a portion is reserved internally for operations that maintain the state and integrity of the deployment. After reaching the connection limit, any attempts at starting a new connection result in an error. For more information, see Managing Redis connections.

Your responsibilities for HA and DR

The following information can help you create and continuously practice your plan for HA and DR.

When restoring a database from backups or using point-in-time restore, a new database is created with new connection strings. Existing workloads and processes must be adjusted to consume the new connection strings.

A recovered database may also need the same customer-created dependencies of the disaster database. Ensure that these and other services exist in the recovered region:

IBM® Key Protect for IBM Cloud®
Hyper Protect Crypto Services

Remember that deleting a database also deletes its associated backups. However, deleted databases may be recoverable within a limited timeframe. For more information, see Backups FAQ.

It is not possible to copy backups off the IBM Cloud, so consider using the database-specific tools for additional backups. It may be required to recover from malicious database deletion followed by a reclamation-delete of a database. Careful management of IAM access to databases can help reduce exposure to this problem.

The following checklist associated with each feature can help you create and practice your plan.

Backup restore
- Verify that backups are available at the desired frequency to meet RPO requirements. Managing Cloud Databases backups documents backup frequency.
There are some restrictions on database restore regions - verify that your restore goals can be achieved by reading managing Cloud Databases backups.
- Verify that the retention period of the backups meet your requirements.
- Schedule test restores regularly to verify that the actual restored times meet the defined RTO. Remember that database size significantly impacts restore time. Consider strategies to minimize restore times, such as breaking down large databases into smaller, more manageable units and purging unused data.
- Verify the Key Protect service.

To find out more about responsibility ownership between the customer and IBM Cloud for using Databases for Redis, see Shared responsibilities for Cloud Databases.

Stay informed: IBM notifications

Updates affecting customer workloads are communicated through IBM Cloud notifications. To stay informed about planned maintenance, announcements, and release notes related to this service, refer to the Monitoring notifications and status page. In addition, regularly review the Version policy page for the latest updates on End-of-Life versions and dates.