High availability (HA)

ATTENTION!! IBM Blockchain Platform SaaS Edition has been replaced by IBM Support for Hyperledger Fabric!! IBM Blockchain Platform SaaS Edition will no longer be supported after July 31, 2023. Customers have been directed to migrate their networks by July 31, 2023. After this date, IBM Blockchain Platform SaaS networks that are not migrated to IBM Support for Hyperledger Fabric will be at risk for potential security vulnerabilities. A migration tool is provided from your console, and the disruption to your network is minimal. See Migrating to IBM Support for Hyperledger Fabric for details.

Use the built-in Kubernetes features along with IBM® Blockchain Platform component deployment strategies to make your blockchain networks more highly available and protect your network from downtime when a failure occurs in your cluster.

Target audience: This topic is designed for architects and system administrators who are responsible for planning and configuring IBM Blockchain on IBM Cloud .

High availability is a core discipline in an IT infrastructure to keep your apps up and running, even after a partial or full site failure. The main purpose of high availability is to eliminate potential points of failures in an IT infrastructure. For example, you can prepare for the failure of one system by adding redundancy and setting up failover mechanisms.

You can achieve high availability on different levels in your IT infrastructure and within different layers of your cluster. The level of availability that is right for you depends on several factors, such as your business requirements, the Service Level Agreements that you have with your organizations, and the cost of redundancy.

Before proceeding, we recommend that you learn more about HA on Kubernetes by reviewing High availability for the IBM Cloud Kubernetes service or OpenShift cluster.

Then, you can use this topic for details on blockchain-specific HA guidance along with the recommendations from the platform-specific topic above.

Overview of potential points of failure in IBM Blockchain Platform

The IBM Blockchain Platform architecture is designed to ensure reliability, low latency processing, and a maximum uptime of the service. However, failures can happen. IBM Blockchain Platform provides several approaches to add more availability to your cluster by adding redundancy and anti-affinity policies. Anti-affinity ensures that blockchain components of the same type and organization are deployed across different worker nodes. By adding redundancy across your blockchain network, you can avoid failures or downtime.

To achieve maximum high availability, it is recommended that you build redundancy by provisioning peers and orderers in Kubernetes clusters in multiple regions. When the components are spread across regions and the blockchain ledger is distributed across those components, a failure in any single region will not impact processing of transactions. CAs are less critical for daily transaction processing. After all the users have been registered and enrolled with the CA, it is no longer required until the next time those services are required.

Peer considerations

HA for peers means always having redundant peers, that is at least two peers available for each organization on the same channel to process requests from client applications. Multiple peers can be deployed to a single worker node, or spread across worker nodes, zones (if you are using IBM Cloud), or even regions. Whenever you deploy multiple peers and join them to the same channel, the peers act as HA pairs because the channel and the data are automatically synchronized across all peers in the channel. By design, a blockchain network is meant to have multiple organizations that transact on the same channels. Therefore, the common deployment model is that for any given channel, there are redundant peers for each organization spread across several organization account clusters that are all synchronizing data between each other. Each organization can have a peer in their own cluster in any region.

For even more robust HA coverage, you can stand up multiple clusters in multiple regions and deploy peers in all of them. However, if high performance is desired, care must be taken when distributing peers to ensure the latency and bandwidth between them is sufficient to achieve your performance targets.

Anchor peers on a channel facilitate cross-organization communication that is required for private data, gossipA protocol for secure, reliable, and scalable communication of information in an network by passing messages among peers., and service discovery to work. If only one anchor peer exists on a channel, and that peer becomes unavailable, the organizations are no longer connected and the cross-organization gossip is no longer possible. Therefore, when you create redundant peers for an organization, be sure to add redundant anchor peers on the channel as well.

Finally, your peer redundancy strategy needs to take into account your smart contract endorsement policies to ensure that you always have enough peers available to satisfy the endorsement policyA policy that defines the peer nodes on a channel that must execute transactions that are attached to a specific chaincode application, and the required combination of endorsements. For example, a policy could require that a transaction be endorsed by a minimum number of endorsing peers, a minimum percentage of endorsing peers, or by all endorsing peers that are assigned to a specific chaincode application. requirements. For example, if an endorsement policy requires a specific number of endorsements, your peer HA strategy needs to ensure that there are always that number of peers available. Alternatively, if the endorsement policy requires a MAJORITY of peers to endorse the transactions, then you need to ensure that a majority of the peers are always available in order for transactions to continue to be processed.

Ordering service considerations

IBM Blockchain Platform is built upon Hyperledger Fabric v2.2.10 that includes the Raft ordering serviceA service that provides a shared communication channel to clients and peers for the broadcast of messages that contain transactions.. Raft is a crash fault tolerant (CFT) ordering service based on an implementation of Raft protocol. By design, Raft ordering nodes automatically synchronize data between them using Raft-based consensus. In IBM Blockchain Platform, an organization network operator can choose to stand up either a single node Raft-based orderer, with no HA, or five orderers in a single region or across multiple regions that are automatically configured for HA via Raft.

Certificate Authority (CA) considerations

Because CAA trusted third-party organization or company that issues the digital certificates. The certificate authority typically verifies the identity of the individuals who are granted the unique certificate. nodes process user registration and enrollment requests for your blockchain network, they are typically not high-throughput nodes. However, if HA is important for your CA strategy, it is possible to configure your nodes for maximum availability. Kubernetes includes built in HA by immediately restarting the pod if the CA becomes unavailable. However, if you cannot tolerate any downtime for your CA, you have the option to configure replica sets for your CA. A replica set is a Kubernetes mechanism that is used to guarantee the availability of the pod. The use of replica sets ensures that multiple replicas of the pod are running at all times. If one of the CA replicas becomes unavailable, Kubernetes immediately switches to another CA replica, therefore there is no downtime waiting for a CA to restart.

HA Checklist

The following table contains a list of options to consider as you plan for increasing degrees of HA.

Table 1. Comparison of deployment scenarios to increase your network HA
This table has row and column headers. The row headers identify the deployment scenarios. The column headers identify available options in each scenario to increase your HA.
	Single cluster with multiple nodes	Multizone
Redundant peers
Redundant anchors peers on a channel
Anti-affinity (peers)
Multi-zone (peers)
Raft ordering service
Ordering nodes	+ anti-affinity	+ anti-affinity
CA replica sets
Anti-affinity (CA replica sets)
Development or Test environment
Production environment

** The default configuration for a Standard Kubernetes cluster on IBM Cloud is a 4 CPU x 16 GB RAM cluster that includes three zones with three worker nodes each. You can scale up or down, by selecting a smaller or larger configuration, according to your needs.

The IBM Blockchain Platform deployer attempts to spread peers, ordering nodes, and CA replica sets across different worker nodes but cannot guarantee that it will happen due to resource limitations. You can also use the IBM Blockchain Platform APIs or the blockchain console to deploy peers or ordering nodes to specific zones in order to ensure that they are resilient to a zone failure. For more information see Multizone HA.

Potential points of failure

IBM Blockchain Platform offers several approaches to add more availability to your network by adding redundancy and using anti-affinity policies. Review the following diagrams to learn more about potential points of failure and how to eliminate them. You can select a model based on your application criticality, service levels, and costs. As general rule, you can implement the redundancy to meet your service levels. All of these scenarios must be weighed against the cost of implementing greater resiliency.

Single region HA

Figure 1. Blockchain HA single region options

Component failure.

Single-zone cluster:

Every time that you deploy a blockchain component, such as a peer, ordering node, or CA, a new pod is created for the component in a worker node. Containers and pods are, by design, short-lived and can fail unexpectedly. For example, a container or pod might crash if an error occurs in your component. So, to make your node highly available, you must ensure that you have enough instances of it to handle the workload plus extra instances in case of a failure.

Peers How many peers are required? In a production scenario, the recommendation is to deploy three peers from the same organization to each channel. This configuration allows one peer to go down (for example, during a maintenance cycle) and still maintain two highly available peers. Therefore, to compensate for a peer failure, and for the most basic level of HA, you can achieve peer redundancy by simply deploying three peers per organization on a channel on your worker node. Note that you need to ensure that you have adequate resources available on your node to support these components.

Ordering service As mentioned above, the HA ordering service is based on Raft, and contains five ordering nodes by default. Because the system can sustain the loss of nodes, including leader nodes, as long as there is a majority of ordering nodes (what’s known as a “quorum”) remaining, Raft is said to be “crash fault tolerant” (CFT). In other words, if you have five nodes in a channel, you can lose two nodes (leaving three remaining nodes). When you deploy an ordering service from the console, choose the five node service for HA.

CA You can configure replica sets, which are represented as shaded CA boxes in the diagram above, for your CA. Replica sets guarantee that if the CA node goes down, the CA replica immediately begins processing requests. You must provision an instance of a PostgreSQL database if you plan to use CA replica sets. See these instructions for more information about how to configure CA replica sets.

This scenario uses redundant peers, ordering nodes, and CAs on a single worker node, which protects against component failure, but cannot protect from node failure. Therefore, it is only suitable for development and testing purposes.
Worker node failure.

Single-zone cluster with multiple worker nodes and anti-affinity:

A worker node is a VM that runs on a physical hardware. Worker node failures include hardware outages, such as power, cooling, or networking, and issues on the VM itself. You can account for a worker node failure by setting up multiple worker nodes when you provision your cluster. When blockchain components are distributed across multiple worker nodes, you are protected from a worker node failure.

Peers The IBM Blockchain Platform deployer anti-affinity policy distributes redundant peers, that is peers from the same organization, across the worker nodes in their cluster.

Ordering service Whenever you deploy a Raft ordering service, the five ordering nodes are automatically distributed across the worker nodes in your cluster, using the anti-affinity policy and based on resource availability on the nodes.

CAs Like peers and ordering nodes, if replica sets are chosen for a CA, an anti-affinity policy automatically distributes the CA replica sets across worker nodes in the cluster, based on resource availability.

This scenario uses redundant peers, ordering nodes, and CA replica sets, across multiple worker nodes in a single cluster or zone, which protects against node failure, but cannot protect from a cluster or zone failure. Therefore, it is not recommended for production.

Multizone HA

Figure 2. Blockchain HA single zone options

Zone failure.

Multizone clusters with multiple worker nodes and anti-affinity:

Think of a zone as a data center. A zone failure affects all physical compute hosts and NFS storage. Failures include power, cooling, networking, or storage outages, and natural disasters, like flooding, earthquakes, and hurricanes. To protect against a zone failure, you must have clusters in at least two different zones that are load balanced by an external load balancer. By default when you deploy a Kubernetes cluster in IBM Cloud, the cluster is configured with multi-zone support, including three zones, although you can choose two zones.

A single zone is sufficient for a development and test environment if you can tolerate a zone outage. Therefore, to leverage the HA benefits of multiple zones, when you provision your cluster, ensure that multiple zones are selected. Two zones are better than one, but three are recommended for HA to increase the likelihood that the two additional zones can absorb the workload of any single zone failure. When redundant peers from the same organization and channel, and ordering nodes, are spread across multiple zones, a failure in any one zone should not affect the ability of the network to process transactions because the workload will shift to the blockchain nodes in the other zones.

You can use the IBM Blockchain Platform console to specify the zone where a CA, peer, or ordering node is created. When you deploy a CA, peer, or ordering service (or a single ordering node), check the Advanced deployment option that is labeled Deployment zone selection to see the list of zones that is currently configured for your Kubernetes cluster.

If you're deploying a CA, peer, or ordering service, you have the option to select the zone from the list of zones available to your cluster or to let your Kubernetes cluster decide for you by leaving the default selected. For a five node ordering service, these nodes will be distributed into multiple zones by default, depending on the relative space available in each zone. You also have the ability to distribute a five node ordering service yourself by unselecting the default option to have the zones chosen for you and distributing these nodes into the zones you have available. If you are deploying a redundant node (that is, another peer when you already have one), it is a best practice to deploy this node into a different zone. You can check which zone the other node was deployed to by opening the tile of the node and looking under the Node location. Alternatively, you can use the APIs to deploy a peer or orderer to a specific zone. For more information on how to do this with the APIs, see Creating a node within a specific zone.

The CA zone selection is only available when the default database type SQLite is used and your cluster is configured with multiple zones.

Other CA considerations: If you have multiple zones configured for your Kubernetes cluster, when you create a new CA with a PostgreSQL database and replica sets, an anti-affinity policy ensures that the CA replica sets are automatically configured across the zones. Replica sets are represented as shaded CA boxes in the diagram above. Adequate resources must exist in the other zones in order for the anti-affinity policy to be used.

Multizone-capable storage:

Before you deploy any nodes, you have the additional option to configure your Kubernetes cluster to use multizone-capable storage as your persistent storage. Without multizone-capable storage, if an entire zone goes down, any blockchain nodes in that zone cannot automatically come up in another zone because their associated persistent storage is unavailable in the failed zone. When multizone-capable storage is configured, if a zone failure occurs, peers and ordering nodes can come up in another zone, with their associated storage intact, ensuring high-availability. In order to leverage this capability with the IBM Blockchain Platform, you need to configure your cluster to use SDS (Portworx) storage. To learn more about multizone-capable storage, see the Comparison of persistent storage options for multizone clusters on OpenShift or IBM Cloud Kubernetes service. When you deploy a peer, ordering service, or ordering node, select the advanced deployment option labeled Deployment zone selection and then select Across all zones.

This scenario uses redundant peers, ordering nodes, and CAs across multiple worker nodes and multiple zones, which protect against zone failure, but does not protect from an unlikely entire region failure.

If you choose to use a multi-zone configuration for CA, peers, or ordering nodes, then you are responsible for configuring the storage for each zone and set the node affinity to zones.

Multi-region HA

This scenario offers the highest level of HA possible.

Figure 3. Blockchain HA multi-region options

Region failure.

Multi-region clusters with multiple work nodes and anti-affinity:

The likelihood of a full regional failure is low. However, to account for this failure, you can set up multiple clusters in different regions where each cluster has its own linked console. If an entire region fails, redundant peers or ordering nodes in the other regions can service the work load. For production environments, configuring your blockchain peers and ordering nodes across multiple regions provides the maximum HA coverage available.

This scenario uses redundant peers and ordering nodes across multiple worker nodes in multiple regions, which provide the highest degree of HA. This approach is also a recommended scenario for a production network when your resiliency requirements merit the investment. The five ordering nodes are spread across three clusters in a 2-1-2 pattern, meaning two nodes in Region 1, one node in Region 2, and two nodes in Region 3. This configuration allows any single region, or all of the ordering nodes in a region to go down, while still maintaining a quorum of nodes in the Raft cluster.

See the topics on setting up multi-region HA deployments for steps to configure your IBM Blockchain Platform peers and ordering nodes across multiple regions.

Disaster recovery (DR)

The most important step for disaster recovery is to configure your environment across multiple regions as documented in the previous section. Multi-region configuration ensures that your environment can handle a datacenter or regional disaster with zero downtime or data loss. However, this does not protect against data corruption or accidental deletion. To do this, you may want to also take periodic backups.

It is recommended that you regularly back up the storage associated with every deployed component. Because the ledger is shared across all the peers and ordering nodes, taking regular backups is critical. For example, if incorrect data is accidentally propagated to a peer's ledger or if data is mistakenly deleted, the incorrect data might spread to the ledgers of some other peers. This would require the restoration of the ledgers of all the peers from an established backup point to ensure synchronicity. You can decide how often to perform the backups based your recovery needs, but a general guideline would be to take daily backups.

All nodes must be stopped in order to ensure a reliable backup.

Table 2. Backup recommendations for storage
Storage solution provider	Guidance
IBM Cloud storage solution	You can leverage the capability provided by IBM Cloud Kubernetes service or OpenShift.
Portworx	While a snapshot capability is available for taking backups, in order to get a reliable backup, the nodes must be stopped.

When you need to restore a backup, the backups would need to be restored on every component across your network.

If you are using CA replica sets and your PostgreSQL database resides in IBM Cloud, backups are included in the service. See the topic on Managing Backups for more information. Otherwise, you need to work with your third-party PostgreSQL database provider to manage the database backups according to your DR needs.

Backup and recovery

For information about backing up your components and how to recover corrupted components or networks, see Backing up and restoring components and networks.