Understanding high availability and disaster recovery for IBM Cloud Logs Routing
High availabilityThe ability of a service or workload to withstand failures and continue providing processing capability according to some predefined service level. For services, availability is defined in the Service Level Agreement. Availability includes both planned and unplanned events, such as maintenance, failures, and disasters. (HA) is the ability for a service to remain operational and accessible in the presence of unexpected failures. Disaster recoveryThe ability of a service or workload to recover from rare, major incidents and wide-scale failures, such as service disruption. This includes a physical disaster that affects an entire region, corruption of a database, or the loss of a service contributing to a workload. The impact exceeds the ability of the high availability design to handle it. is the process of recovering the service instance to a working state.
IBM Cloud Logs Routing is a highly available, multi-tenant, regional service. You can find the available region and data center locations in the Locations documentation. As a regional service, IBM Cloud Logs Routing fulfills the defined Service Level Objectives (SLO). The SLO is not a warranty and IBM will not issue credits for failure to meet an objective.
High availability architecture
An availability zone is a logically and physically isolated location within an IBM Cloud region where your data is processed and hosted.
- An availability zone has independent power, cooling, and network infrastructures that are isolated from other zones to strengthen fault tolerance by avoiding single points of failure between zones.
- An availability zone offers high bandwidth and low inter-zone latency within a region.
A region (location) is a geographically and physically separate group of one or more availability zones with independent electrical and network infrastructures that are isolated from other regions.
- Regions are designed to remove shared single points of failure with other regions and low inter-zone latency within the region.
- Each region has 3 different data centers (DC) for redundancy.
High availability features
IBM Cloud Logs Routing supports the following high availability features:
Feature | Description |
---|---|
Multi-zone region deployment | IBM Cloud Logs Routing is deployed into multi-zone regions (MZRs) only, and within an MZR, the data plane cluster spans all three zones, ensuring that the loss of a zone does not impact service availability. |
Liveness / readiness monitoring | All microservices are monitored via Kubernetes liveness and readiness probes. |
Multi-zone region deployment for IBM Cloud Logs Routing
Disaster recovery architecture
The strategy for disaster recovery is to continuously maintain application data stored away from client data, so that the application can be restarted in an alternate region with the data that has been backed up. Recovery under these conditions expects user read / write activity to be restored within 8 hours of plan invocation without client intervention.
The new data is moved from one region to another region using cross-region IBM Cloud Object Storage (COS) buckets along with continuous replication of data across sites.
Disaster recovery features
IBM Cloud Logs Routing supports the following disaster recovery features:
Feature | Description | Consideration |
---|---|---|
Recovery center | Alternate site for storing and running the application, separate from the primary data center | No client action required. |
Database backup | A backup of current service metadata is stored off site | Data is continuously streamed to the backup. Backup locations for each region can be found here |
Planning for DR
The DR steps must be practiced regularly. As you build your plan, consider the following failure scenarios and resolutions.
Much of the preparation for disaster recovery for IBM Cloud Logs Routing is included when properly onboarding to the service. Otherwise, you are responsible for monitoring for communications from the service as they come in, since further assistance may be needed when re-establishing normal operations, such as network connectivity for upstream data.
Failure | Resolution | Consideration |
---|---|---|
Hardware failure (single point) | IBM provides a database that is resilient from single point of hardware failure within a zone - no configuration is required. | None needed. Covered by MZR deployment. |
Zone failure | IBM Cloud Logs Routing uses multi-zone region deployment that is resilient from a point of zone failure. No configuration required. | None needed. Covered by MZR deployment. |
Data corruption | In case of data corruption, the database will be rolled back to the last stable state available in the backup site. Configuration required when onboarding. For more information, see Your responsibilities for HA and DR. | None needed. Covered by service rollback plan. |
Regional failure | When a regional failure occurs, a second site in an alternate region is maintained with synchronized data that will handle the workload. Configuration is required when onboarding. For more information, see Your responsibilities for HA and DR. | Backup locations for each region can be found here |
Your responsibilities for HA and DR
IBM Cloud has business continuityThe capability of a business to withstand outages and to operate mission-critical services normally and without interruption in accordance with predefined service-level agreements. plans in place to provide for the recovery of services within hours if a disaster occurs. You are responsible for your data backup and associated recovery of your content once the data has been delivered to its defined target.
To establish cross-region high availability and protect your data from regional disaster, you must configure a cross-region IBM Cloud Logs Routing target to send your data to, as IBM Cloud Logs Routing tenants can only deliver logs to targets within its own region. Learn more.
When IBM Cloud Logs Routing recovers in the region that is down, your configuration is restored.
The following checklist associated with each feature can help you create and practice your plan.
- Subscribe to and follow platform notifications and announcements
- Ensure preferences are set to receive emails about platform notifications.
- Monitor the IBM Cloud status page for general announcements.
For more information about your responsibilities when you are using IBM Cloud Logs Routing, see Shared responsibilities for IBM Cloud Logs Routing.
Recovery time objective (RTO) and recovery point objective (RPO)
IBM Cloud Logs Routing provides ways to protect your data and restore service functions. Business continuity plans are in place to achieve targeted recovery point objectiveIn disaster recovery planning, the time at which data is restored measured in time (seconds, minutes, hours) starting at the recovered instance and ending at the point of disaster. (RPO) and recovery time objectiveIn disaster recovery planning, the duration of time for a business process to be restored after a disaster. (RTO) for the service. The following table outlines the targets for IBM Cloud Logs Routing.
Disaster recovery objective | Target Value |
---|---|
RPO | Within 24 hours |
RTO | Within 24 hours |
Change management
Change management includes tasks such as upgrades, configuration changes, and deletion.
It is recommended that you grant users and processes the IAM roles and actions with the least privilege required for their work. See Managing IAM access for IBM Cloud Logs Routing.
Major, minor, and patch version updates for IBM Cloud Logs Routing interfaces are handled by the IBM Cloud Logs Routing service team.
How IBM® helps ensure disaster recovery
IBM Cloud Logs Routing is a highly available, regional, service.
A multizone region (MZR) consists of 3 or more availability zones that are independent from each other to ensure that that single failure events affect only a single zone.
By default, IBM Cloud Logs Routing is deployed across 3 zones. Each zone is set up with active/active/active
:
- Each zone is located in a different data center in the region.
- The data in each zone is automatically replicated to the other zones with low latency. You do not need to do anything to enable the replication.
- The service is designed to withstand a single zone failure with no interruption.
- If all the data centers in a location fail, IBM Cloud Logs Routing becomes unavailable in that location.
- For more information about the regions where IBM Cloud Logs Routing is available, see Locations.
The MZR architecture offers automatic failover between zones within the region, and high availability for an IBM Cloud Logs Routing deployment within a region.
The following table lists the high-availability (HA) status for the regions (locations) where the IBM Cloud Logs Routing service is available:
Geography | Region | EU-Supported | HA Status |
---|---|---|---|
Asia Pacific | Osaka (jp-osa ) |
NO | MZR |
Asia Pacific | Sydney (au-syd ) |
NO | MZR |
Asia Pacific | Tokyo (jp-tok ) |
NO | MZR |
Europe | Frankfurt (eu-de ) |
YES | MZR |
Europe | London (eu-gb ) |
NO | MZR |
Europe | Madrid (eu-es ) |
YES | MZR |
North America | Dallas (us-south ) |
N/A | MZR |
North America | Toronto (ca-tor ) |
N/A | MZR |
North America | Washington, D.C (us-east ) |
N/A | MZR |
South America | Sao Paulo (br-sao ) |
N/A | MZR |
Where
-
A geography is a geographic area or larger political body that contains one or more regions.
-
A region is a defined geographic territory.
A region might be a specific postal code area, a town, a city, a state, a group of states, or even a group of countries.
A region contains multiple availability zones to meet local access, low latency, and security requirements for the region.
-
N/A
means feature that is not applicable to that geography. -
MZR
means multi-zone region. Learn more.
For more information about service availability within regions and data centers, see Service and infrastructure availability by location.
The data that is managed by IBM Cloud Logs Routing in a region is kept in the data centers near that region.
IBM Cloud Logs Routing data includes information about the targets where logging events are delivered for tenants that are onboarded to the region. A target is a resource where logging events are collected.
IBM Cloud Logs Routing regularly backs up the data in each region:
-
Regular backups are done daily and retained for 30 days.
-
Continuous incremental backups are kept for the last 7 days.
IBM Cloud Logs Routing data is replicated across multiple regions. Regular backups are stored across multiple regions and can be restored to other regions.
How IBM recovers from zone failures
In case of zone failure, IBM Cloud will resolve the zone outage. Since the data plane spans across all three zones in a region, there will be no impact to service availability, and the global load balancer will resume sending data to the restored zone. There will be no need for customer action at this time.
How IBM recovers from regional failures
If the regional state is corrupted, the service is restored to the state of the last internal backup, which is continuously streamed to an alternate data site in a cross-region IBM Cloud Object Storage bucket managed by the service. If backup data has been corrupted, there is a potential for 24-hour’s worth of data loss. These backups are not available for customer-managed disaster recovery.
If IBM can’t restore the service instance, the customer must take steps to restore access to IBM Cloud Logs Routing as described in Your responsibilities for HA and DR.
How IBM maintains services
All upgrades follow the IBM service best practices and have a recovery plan and rollback process in-place. Regular upgrades for new features and maintenance occur as part of normal operations. Such maintenance can occasionally cause short interruption intervals that are handled by client availability retry logic. Changes are rolled out sequentially, region by region and zone by zone within a region. Updates are backed out at the first sign of a defect.
Complex changes are enabled and disabled with feature flags to control exposure.
Changes that impact customer workloads are detailed in notifications. For more information, see monitoring notifications and status for planned maintenance, announcements, and release notes that impact this service.