High availability

IBM Cloud® Databases for PostgreSQL is a managed cloud database service that is fully integrated into the IBM Cloud environment. The database, storage, and supporting infrastructure all run in IBM Cloud.

Databases for PostgreSQL provides replication, fail-over, and high-availability features to protect your databases and data from infrastructure maintenance, upgrades, and failures. Deployments contain a cluster with two data members, a leader and a replica. Both members contain a copy of your data by using asynchronous replication, with a distributed consensus mechanism to maintain cluster state and handle failovers. If the leader becomes unreachable, the cluster initiates a failover, and the replica is promoted to leader. The replica rejoins the cluster and your cluster continues to operate normally.

Databases for PostgreSQL will, at times, do controlled switchovers under normal operation. These switchovers are no-data-loss events that result in resets of active connections. There is a period of up to 15 seconds where reconnections can fail. At times, unplanned failovers might occur due to unforeseen events on the operating environment. These can take up to 45 seconds, but can be less. In both cases, potential exists for the downtime to be longer.

You can extend high-availability further by adding PostgreSQL members to the instance, for greater in-region redundancy, or by provisioning read-only replicas for cross-regional failover or read offloading.

Databases for PostgreSQL is designed and built to provide a robust, resilient, and performant Database as a Service offering. Review the PostgreSQL documentation on replication techniques to understand the constraints and tradeoffs that are associated with the asynchronous replication strategy that is deployed by default with Databases for PostgreSQL.

In scenarios where a database becomes critically unhealthy, such as a server crash on the leader, Databases for PostgreSQL attempts a failover. This auto failover capability is capped at 16 MB of data lag from leader to follower (a few rows of data once accounting for more PostgreSQL data overhead) and is not performed if the lag threshold is exceeded. If the potential for 16 MB of data loss is intolerable for the application, horizontally scale your Databases for PostgreSQL instance to three members and configure Databases for PostgreSQL to use a synchronous replication strategy on a per user or per database basis.

Synchronous replication

By default, streaming replication is asynchronous. If the primary server crashes, some transactions that were committed might not have been replicated to the standby server, causing data loss. Cloud Databases ensures that data loss is kept to a minimum substantial data loss; however, synchronous replication offers the ability to confirm that all changes were made by a transaction have been transferred to a synchronous member, ensuring consistency across a cluster. This consistency come from confirming that writes are written to a secondary before returning to the connecting client with a success. For variables regarding synchronous replication, see synchronous_commit on the Changing Configuration page.

Synchronous replication brings replica availability into the primary write path. If there is no replica to acknowledge the write, the write hangs until a replica is available. This requires at least three members to function reliably, as synchronous replication is not supported on two-member deployments. You must horizontally scale to at least three members before synchronous replication can be enabled.

While unlikely, it is possible that more than one replica might become unavailable simultaneously. If this happens, the primary database will not be able to complete any writes until a replica comes back online, effectively blocking all write traffic to your database. When you are deciding to use synchronous replication, weigh the relative costs and benefits of higher data durability versus potential availability issues.

Employing synchronous replication negatively impacts the performance of the database. Typically, a performant and effective way to employ this feature is by using it only on specific databases or users that must have the highest degree of data durability available.

Application-level high-availability

Applications that communicate over networks and cloud services are subject to transient connection failures. You want to design your applications to retry connections when errors are caused by a temporary loss in connectivity to your deployment or to IBM Cloud.

Because Databases for PostgreSQL is a managed service, regular updates and database maintenance occurs as part of normal operations. This can occasionally cause short intervals where your database is unavailable. It can also cause the database to trigger a graceful fail-over, retry, and reconnect. It takes a short time for the database to determine which member is a replica and which is the leader, so you might also see a short connection interruption. Failovers generally take less than 30 seconds.

Your applications have to be designed to handle temporary interruptions to the database, implement error handling for failed database commands, and implement retry logic to recover from a temporary interruption.

Several minutes of database unavailability or connection interruption are not expected. Open a support case with details if you have time periods longer than a minute with no connectivity so we can investigate.

Connection limits

Databases for PostgreSQL sets the maximum number of connections to your PostgreSQL database to 115. 15 connections are reserved for the superuser to maintain the state and integrity of your database, and 100 connections are available for you and your applications. After the connection limit is reached, any attempts at starting a new connection results in an error. To prevent overwhelming your deployment with connections, use connection pooling, or scale your deployment and increase its connection limit. For more information, see the Managing PostgreSQL connections page.

High availability, disaster recovery, and SLA resources

Databases for PostgreSQL deployments conform to the IBM Cloud Databases High availability, Disaster recovery, and Service Level Agreement (SLA) information and terms.