IBM Cloud Docs
Resiliency in IBM Cloud

Resiliency in IBM Cloud

In information technology (IT), resiliency is broadly defined as the ability of an organization or solution to maintain essential systems and applications and to recover from disruptions. Resiliency in IBM Cloud focuses on the perspective of IBM clients, their solution planners, architects, and builders and the resilient solutions that they create on the IBM Cloud platform.

Resiliency and availability often go hand in hand. The more resilient a service or workload is, the more available it will typically be. For example, assuming an identical operating environment, a workload that demonstrates 99.9% availability in a given month, might be considered to have a lower resiliency when compared to a workload that demonstrates 99.99% availability. To achieve higher resiliency - and by extension, higher availability - requires a higher redundancy of components.

Consider the following scenario: You are an avid cyclist who frequently takes long bike rides out in the countryside. While the majority of your rides run smoothly and without issue, there is always the possibility for something to go wrong. For example, maybe you run over a sharp object and you get a puncture - what happens then? If you’re unprepared, you could have a long walk home. But, if you’ve already considered the possibility, you might have fitted tires with higher puncture resistance ahead of time. To be even more resilient, you might take a puncture repair kit and pump along on your ride. Repairing a puncture on the side of the road could take some time though. So, you decide to take a spare inner tube instead. The solution costs more, but increases your bike’s availability by ensuring you are ready to ride again more quickly by replacing the tube instead of fixing the older one. You could always go one step further and invest in tubeless tires that can self-heal smaller punctures on the go without rider intervention. While, typically more expensive, the solution provides much higher resilience to punctures and ensures less downtime assuming there is not an abnormal level of damage. To be resilient to massive damage, you might decide to take an entire spare tire with you … and so on. Taken to the extreme, you might decide that you need to be fully resilient to any failure that could occur to avoid disruption to your ride, so you employ the solution of professional riders and have a support car follow you. The support car might have a complete set of spare parts, multiple spare bikes, and a mechanic. While this solution will ensure near-zero downtime, it’s also not proportional to the issues that most of us encounter on an average bike ride. In other words, there comes a point in which resilience, downtime, and cost meet an equilibrium and the solution meets our needs.

The following guide provides an overview of basic concepts and capabilities, including high availability, disaster recovery, cyber resiliency, and IBM Cloud assurances in these areas. You can find everything that you need to help you design, plan, test, and support operational resiliency regulations for your IBM Cloud solutions. Get a general view of IBM Cloud resiliency capabilities and best practices, which you can use with your solutions, and IBM Cloud service-specific capabilities from the services own documentation.

Notice in this scenario that the more resilient to failure (in this case, a puncture) you want to be, you need a slightly different solution. As you become more resilient to prevent potential downtime, the more the solution costs.

Just as in the cycling scenario, resiliency requirements and capabilities for IT workloads must be considered according to how crital the application or solution is for an organization, balanced against the cost of the resiliency solution and the financial or reputational loss that could incur through downtime.

To help customers design resilience into their workloads, IBM Cloud publishes service level objectives (SLOs) for each service at IBM Cloud service level objectives. This provides the target SLO for each service as an availability target in a highly available configuration. The underlying architecture design of IBM Cloud and its services, provides a highly available platform for your workloads, though it is important that workloads are deployed accordingly, across multiple zones in a region. For example, the availability target for IBM Cloud VPC is listed as 99.999% in the compute services section of our SLO. However, for a workload to take full advantage of that SLO, it must be deployed in a highly available fashion in each of the three zones in a given multi-zone region. Using that VPC example, to take full advantage of the 99.999% SLO VPC has for your workload, you minimally need three virtual server instances (one in each zone) and a load balancer. Perhaps the cost of that is higher than your budget. To reduce the cost, you could just go for two virtual server instances in two zones and a load balancer - but this reduces the resiliency too. Less cost still would be one virtual server instance and no load balancer but the SLO will be reduced further still.

It's important to remember that SLOs are objectives and not guarantees. In practice, this means that while our underlying architecture has been built with the objective of 99.999% availability, there are still events which could derail this, however small their likelihood of coming to pass. Since SLOs do not offer guarantees, they are also not a contractural device. However, IBM Cloud recognizes that customers pay for a service and expect that service to be available. Just like other cloud providers, we publish another set of availability figures which are known as Service Level Agreements (SLAs). When using an IBM Cloud service, the SLA describes the availability that you can expect from that service as a minimum. If that minimum is breached, then customers become entitled to service credits, as set out in the SLA. Also set out in the SLA are certain terms that may include the deployment model for a workload and the subsequent credit entitlement based on how the workload has been deployed. Again, it is important that you are familiar with the SLA for the IBM Cloud components that you use and how the SLA might apply in different workload deployment scenarios.

The key message here is that IBM Cloud operates a shared responsibility model, where in general terms, IBM Cloud is responsible for the resilience and recovery of the cloud, whereas customers are responsible for the resiliency and recovery of their workloads. This solution guide also includes an introduction to shared responsibilities

Ensuring that your workloads are resilent and recoverable might sound daunting. However, the this guide provides an overview of basic concepts and capabilities of IBM Cloud that you need to understand to be successful. This includes includes high availability, disaster recovery, cyber resiliency, and the IBM Cloud assurances in these areas. In this guide,you can find everything that you need to help you design, plan, test, and support operational resiliency regulations for your IBM Cloud solutions. As you read, you will get a general view of IBM Cloud resiliency capabilities and best practices. You can use this knowledge as you design, build and deploy your solutions. In addition, each IBM Cloud service documentation has more information about service-specific capabilities and requirements.