IBM Cloud Docs
Best practices for resiliency

Best practices for resiliency

Follow best practices for resiliency on IBM Cloud to help ensure that your workloads are highly available and that you can recover from a disaster.

Have a plan

Working through any crisis is stressful and disasters are no different. During a disaster, you are likely trying to restart business-critical services and might be under pressure to do so quickly, which might lead to mistakes. In the unlikely event of a disaster, having a clearly defined plan helps your business recover in a predictable way, which helps alleviate stress and reduces mistakes. For more information about creating a disaster recovery plan, see Planning for disaster recovery.

Determine priorities

When you create a disaster recovery plan, make sure that your organization endorses, understands, and can fund the plan. When you get broader stakeholder buy in, you can capture the business's priorities if a disaster occurs. This way, the plan is both comprehensive and proportionate to business expectations. Review the plan with your organization regularly to capture changes in priorities and update the plan as needed.

Understand what qualifies as a disaster

To enact the disaster recovery plan, you need to be sure that what you are experiencing is a disaster. Clearly express what qualifies as a disaster in the plan, otherwise you risk false alarms or inactivity when a disaster occurs. Define several different scenarios that constitute a disaster, how the business can react to them, and how quickly. Also consider the scope of impact. If one component of a workload suffers a disaster, is the response different than a situation where everything fails?

Communication is key

In a disaster situation, effective communication between teams is important. In your plan, clearly state which individuals can call a disaster. Define how they communicate with the organization and the people who are required to enact the disaster recovery plan. Clearly state channels of communication, such as by phone or email, to help ensure that important messages are not missed. Think about backup channels of communication in case primary channels are affected. The plan might also prescribe certain meetings that must take place to capture situation updates.

Test your plan

Reacting to a real disaster isn't the ideal time to test your plan for the first time. Regularly test your plan to help ensure that it works and you understand how long it might take to enact the plan. Where other personnel are involved, testing your plan allows them to better understand their roles. Following the test, make sure that you incorporate any lessons that are learned into the plan. For more information about testing your disaster recovery plan, see Disaster Recovery Testing.

Understand your responsibilities

Each IBM Cloud service has a roles and responsibilities matrix that defines IBM®'s responsibilities, customer responsibilities, and shared responsibilities, including responsibilities related to backup, recovery, and disasters. Make sure that you fully understand the ownership of responsibilities for each of the services that you use, since they determine the actions to successfully recover your service instances and help you plan.

Consider data resiliency and data residency requirements

Data resiliency refers to the ability to access, maintain or quickly recover data if failures or disasters occur. It is related to the concepts of high availability, disaster recovery, and cyber resiliency. For more information about data resiliency, see the IBM Well-Architected Framework.

Another important aspect to consider is data residency and any restrictions or requirements on your data's physical location, not just for production environments but also for backup and recovery.

Understanding data residency in IBM Cloud

IBM Cloud's global network of locations provides you with the flexibility of choosing where you want to run your workloads. Review IBM Cloud's Service rollout policy for guidelines on when to expect and how to request that a service is available in a particular region.

When you provision an instance of a regional and zonal service, you select a region to deploy the instance in accordance with your geographic requirements. IBM Cloud helps ensures that content that is provided by you and your workload, as defined in the Cloud Services Agreement, is stored and processed locally in the selected region location. For a complete list of the locations where IBM Cloud services are available see Service and infrastructure availability by location.

IBM Cloud services support saving encrypted backups of the customer content within the location where the regional or zonal service is located for recovery if data corruption or a major data center disaster occurs.

For client's metadata, including client business contact and account usage information (as defined in the IBM Cloud Service Agreement), IBM Cloud stores and processes them where the control planes of the regional and global services are located.

For a complete list of data attributes that each single IBM Cloud service processes and stores, see the API and SDK reference library.

All data in transit is encrypted. Only TLS 1.2 and 1.3 are supported in IBM Cloud with TLS 1.1 and lower explicitly disabled to prevent rollback to a vulnerable version of the protocol.

Processes and procedures for IBM Cloud data privacy processing are documented within the IBM Cloud Data Processing Addendum (DPA). This DPA and its applicable DPA Exhibits apply to the Processing of Personal Data by IBM Cloud on behalf of Client (Client Personal Data). The processing of Personal Data is subject to the General Data Protection Regulation 2016/679 (GDPR). It is also subject to any other data protection laws that are identified at Data Protection Laws in order to provide services (Services) according to the Agreement between Client and IBM Cloud. The IBM Cloud DPA can be found at Data Processing Addendum.

In addition to the DPA, the cloud services provide DPA exhibits that can be found on the IBM Cloud Terms site.

Design HA and DR into your workloads

When you design cloud-deployed workloads, think about high availability and disaster recovery as part of the requirement-gathering stage. By understanding early in the design process what the resiliency qualities of the workload need to be, you can make decisions that influence the architecture and make recovery easier. For example, if you understand the recovery time objective for a workload, you can decide how to deploy a workload to meet that objective by using the features of available services. Similarly, if you understand the recovery point objective you can make better decisions about data, how to back it up, or how to replicate it. Designing this in at an early stage also allow the business to better understand running costs for the workload.

Consider how you can develop application code to make a switching to a disaster recover site easier. For example, avoid hardcoding connection strings or other configurations that might change as a result of connecting to alternative resources.

Choose the right tools

Think of IBM Cloud as a toolbox with a set of tools or services that you can use to deploy and run workloads. To use any tool properly, you need to understand what it can and can't do, and choose the right one. If you try to use a tool for a job that it wasn't designed for, something can go wrong. When you design your resiliency plan, understand in as much detail as possible the services that your workload uses, what their capabilities are, and their limits. If a service makes can't meet your RTO or RPO objectives, consider other services or tools that might help close the gap. Also consider whether the objectives that you set are realistic, or whether you are introducing undue complexity and cost into your solution for little real gain.

Take backups before you make changes

Running IT systems is never risk-free, and introducing change into your workload environment is a point when risk increases. Have a change release plan and a backout plan to manage changes that you make to your environment. As one of the first steps of any change release plan, include taking backups of data and configurations. If something goes wrong during or shortly after the release, you can recover from your backups.

Create Custom Images

Create a custom image from a boot volume and use it as the golden image, with preinstalled applications and configurations, to reduce the provisioning time for IBM Cloud® Virtual Servers for Virtual Private Cloud instances in the DR site. The boot volume must be attached to a stopped virtual server instance (VSI) to create the custom image.

Use hostnames for subnets

Use hostnames and DNS instead of IP addresses to minimize the changes that are required to redeploy an application in the DR site, particularly with VPC instances, including VSIs. Subnets are zone-specific and do not extend across zones. New IP addresses are assigned to new service instances, which can break any existing configurations such as security rules, application configuration files.

Configure key management services for disaster recovery

For Key Protect, configure the service in the primary site with failover in the DR region to enable automatic rerouting of Key Protect requests if a regional service outage occurs. Create scripts to update the Virtual Private Endpoint (VPE) settings to access the Key Protect Service, specifically the Internet Protocol (IP) address, as part of disaster recovery procedures.

For HPCS, configure a failover crypto unit in the DR region. Initialize and configure failover crypto units the same as the operational crypto units before the disaster happens, so they are available if a regional outage occurs

Stay updated with IBM Cloud notifications

If a disaster occurs that affects an IBM Cloud service or region, you receive notifications from IBM Cloud in your account or by email. Sign up for notifications on your account by reviewing View notifications. You can also view the IBM Cloud Status Overview page.