Configuring SAP HANA scale-up system replication in a SUSE Linux Enterprise High Availability Extension cluster

The following information describes the configuration of a SUSE Enterprise Linux Server (SLES) High Availability Extension (HAE)cluster for managing SAP HANA Scale-Up System Replication. The cluster uses virtual server instances in IBM® Power® Virtual Server as cluster nodes.

The instructions describe how to automate SAP HANA Scale-Up System Replication for a single database deployment in a performance-optimized scenario on a SLES HA Extension cluster.

This information is intended for architects and specialists that are planning a high-availability deployment of SAP HANA on Power Virtual Server.

Before you begin

Review the general requirements, product documentation, support articles, and SAP notes listed in Implementing high availability for SAP applications on IBM Power Virtual Server References.

Prerequisites

A SUSE High Availability cluster is deployed on two virtual server instances in Power Virtual Server.
- Install and set up the SLES High Availability Extension cluster according to Implementing a SUSE Linux Enterprise Server high availability cluster.
- Configure and verify fencing as described in the preceding document.
The virtual server instances need to fulfill hardware and resource requirements for the SAP HANA systems in scope. Follow the guidelines in Planning your deployment.
The hostnames of the virtual server instances must meet the SAP HANA requirement.
SAP HANA is installed on both virtual server instances and SAP HANA System Replication is configured Installing of SAP HANA and setting up HANA System Replication is not specific to the Power Virtual Server environment. You need to follow the standard installation and set up procedures.
A valid SUSE Linux Enterprise Server for SAP Applications license is required to enable the repositories that you need to install SAP HANA and the resource agents for HA configurations.
See the Prerequisites chapter in the SUSE Linux Enterprise Server for SAP applications guide.

Configuring SAP HANA System Replication in a SLES HA Extension cluster on IBM Power Virtual Server

The instructions are based on the SUSE product documentation and articles that are listed in Implementing high availability for SAP applications on IBM Power Virtual Server References.

Preparing environment variables

To simplify the setup, prepare the following environment variables for root on both nodes. These environment variables are used with later operating system commands in this information.

On both nodes, set the following environment variables.

# General settings
export SID=<SID>            # SAP HANA System ID (uppercase)
export sid=<sid>            # SAP HANA System ID (lowercase)
export INSTNO=<INSTNO>      # SAP HANA instance number

# Cluster node 1
export NODE1=<HOSTNAME_1>   # Virtual server instance hostname
export DC1="Site1"          # HANA System Replication site name

# Cluster node 2
export NODE2=<HOSTNAME_2>   # Virtual server instance hostname
export DC2="Site2"          # HANA System Replication site name

# Single zone
export VIP=<IP address>     # SAP HANA System Replication cluster virtual IP address

Setting extra environment variables for implementing a single zone

Review the information in Reserving virtual IP addresses and reserve a virtual IP address for the SAP HANA System Replication cluster. Set the VIP environment variable to the reserved IP address.

Installing SAP HANA resource agents

The SAP Hana resource agent and the SAP HanaTopology Resource agent are part of the SLES for SAP applications distribution.

To install the resource and topology agent, make sure that the package yast2-sap-ha is installed, as described in Setting up an SAP HANA cluster and follow the steps to configure the HANA cluster by using yast2.

For a scale-out scenarios, follow the Installing additional Software section of the SAP HANA System Replication Scale-Up - Performance-Optimized Scenario guide.

Starting the SAP HANA system

Start SAP HANA and verify that HANA System Replication is active. For more information, see Checking System Replication Status Details.

On both nodes, run the following commands.

sudo -i -u ${sid}adm -- HDB start

sudo -i -u ${sid}adm -- <<EOT
    hdbnsutil -sr_state
    HDBSettings.sh systemReplicationStatus.py
EOT

Enabling the SAP HANA srConnectionChanged() hook

Recent versions of SAP HANA provide hooks so that SAP HANA can send out notifications for certain events. For more information, see Implementing a HA/DR Provider.

The srConnectionChanged() hook improves the ability of the cluster to detect a HANA System Replication status change that requires an action from the cluster. The goal is to prevent data loss and corruption by preventing accidental takeovers.

Activating the srConnectionChanged() hook on all SAP HANA instances

Stop the cluster.

On NODE1, run the following command.
```
crm cluster stop --all
```
Then, follow the steps that are described in Setting up SAP HANA HA/DR providers.
Verify that the hook functions.
- Restart both HANA instances and verify that the hook script works as expected.
- Perform an action to trigger the hook, such as stopping a HANA instance.
- Check whether the hook logged anything in the trace files.
On both nodes, run the following commands.

Stop the HANA instance.
```
sudo -i -u ${sid}adm -- HDB stop
```
Start the HANA instance.
```
sudo -i -u ${sid}adm -- HDB start
```
Check that the hook logged messages to the trace files.
```
sudo -i -u ${sid}adm -- sh -c 'grep "ha_dr_SAPHanaSR.*crm_attribute" $DIR_INSTANCE/$VTHOSTNAME/trace/nameserver_* | cut -d" " -f2,3,5,17'
```
After you verify that the hooks function, you can restart the HA cluster.
Start the cluster.

On NODE1, run the following commands.

Start the cluster.
```
crm cluster start --all
```
Check the status of the cluster.
```
crm status --full
```

Configuring general cluster properties

To avoid resource failover during initial testing and post-production, set the following default values for the resource-stickiness and migration-threshold parameters.

These steps are described in Configuring the cluster.

The IBM Power10 Systems provide an integrated hardware watchdog timer that is enabled by default. The Configuring the cluster descriptions suggests as a fallback to use softdog as a software watchdog timer. Use the more reliable IBM Power10 hardware watchdog timer instead.

Testing SAP HANA System Replication cluster

It is vital to thoroughly test the cluster configuration to make sure that the cluster is working correctly. The following information provides a few sample failover test scenarios. It's not a complete list of test scenarios.

For example, the description of each test case includes the following information.

Component that is being tested
Description of the test
Prerequisites and the cluster state before you start the failover test
Test procedure
Expected behavior and results
Recovery procedure

Test 1 - Testing a failure of the primary database instance

Use the following information to test the failure of the primary database instance.

Test 1 - Description

Simulate a crash of the primary HANA database instance that is running on NODE1.

Test 1 - Prerequisites

A functional two-node SLES HA Extension cluster for HANA system replication.
Both cluster nodes are active.
Cluster that is started on NODE1 and NODE2.
Cluster Resource SAPHana_${SID}_${INSTNO} that is configured with AUTOMATED_REGISTER=false.
Check SAP HANA System Replication status:
- Primary SAP HANA database is running on NODE1
- Secondary SAP HANA database is running on NODE2
- HANA System Replication is activated and in sync

A variation of Test 1 is described in Test cases for semi-automation.

Test 1 procedure

Use the following command to run Test 1.

Crash SAP HANA primary by sending a SIGKILL signal as the user ${sid}adm.

On NODE1, run the following command.

sudo -i -u ${sid}adm -- HDB kill-9

Test 1 - Expected behavior

You can expect the following behavior from the test.

SAP HANA primary instance on NODE1 crashes.
The cluster detects the stopped primary HANA database and marks the resource as failed.
The cluster promotes the secondary HANA database on NODE2 to take over as the new primary.
The cluster releases the virtual IP address on NODE1, and acquires it on the new primary on NODE2.
If an application, such as SAP NetWeaver, is connected to a tenant database of SAP HANA, the application automatically reconnects to the new primary.

Test 1 - Recovery procedure

As the cluster resource SAPHana_${SID}_${INSTNO} is configured with AUTOMATED_REGISTER=false, the cluster doesn't restart the failed HANA database and doesn't register it against the new primary. Which means that the status on the new primary (NODE2) also shows the secondary in status 'CONNECTION TIMEOUT'.

To reregister the previous primary as a new secondary use the following commands.

On NODE1, run the following command.

sudo -i -u ${sid}adm -- <<EOT
    hdbnsutil -sr_register \
      --name=${DC1} \
      --remoteHost=${NODE2} \
      --remoteInstance=00 \
      --replicationMode=sync \
      --operationMode=logreplay \
      --online
EOT

Verify the system replication status by using the following command.

sudo -i -u ${sid}adm -- <<EOT
    hdbnsutil -sr_state
    HDBSettings.sh systemReplicationStatus.py
EOT

After the manual register and resource refreshes, the new secondary instance restarts and shows a synced (SOK) status.

On NODE1, run the following command.

crm resource refresh SAPHana_${SID}_${INSTNO}

Test 2 - Testing a failure of the node that is running the primary database

Use the following information to test the failure of the node that is running the primary database.

Test 2 - Description

Simulate a crash of the node that is running the primary HANA database.

Test 2 - Prerequisites

See the following prerequisites before you perform Test 2.

You need a functional two-node SLES HA Extension cluster for HANA system replication.
Make sure that both nodes are active.
Confirm that the cluster is started on NODE1 and NODE2.
Check SAP HANA System Replication status.
- Primary SAP HANA database is running on NODE2.
- Secondary SAP HANA database is running on NODE1.
- HANA System Replication is activated and in sync.

Test 2 - Preparation

Make sure that the cluster resource SAPHana_${SID}_${INSTNO} is configured with AUTOMATED_REGISTER=true.

On NODE1, run the following commands.

crm resource update SAPHana_${SID}_${INSTNO} AUTOMATED_REGISTER=true

crm resource config SAPHana_${SID}_${INSTNO}

Test 2 - Test procedure

Crash primary on NODE2 by sending a crash system request.

On NODE2, run the following command.

sync; echo c > /proc/sysrq-trigger

Test 2 - Expected behavior

You can expect the following behavior from the test.

NODE2 shuts down.
The cluster detects the failed node and sets its state to OFFLINE.
The cluster promotes the secondary HANA database on NODE1 to take over as the new primary.
The cluster acquires the virtual IP address on NODE1 on the new primary.
If an application, such as SAP NetWeaver, is connected to a tenant database of SAP HANA, the application automatically reconnects to the new primary.

Test 2 - Recovery procedure

Use the following information to recover from Test 2.

Log in to the IBM Cloud® Console and start the NODE2 instance.
Wait until NODE2 is available again, then restart the cluster framework.
- On NODE2, run the following commands.
```
crm cluster start
```
```
crm status --full
```

As the cluster resource SAPHana_${SID}_${INSTNO} is configured with AUTOMATED_REGISTER=true, SAP HANA restarts when NODE2 rejoins the cluster and the former primary reregisters as a secondary.

Test 3 - Testing a failure of the secondary database instance

Use the following information to test the failure of the secondary database instance.

Test 3 - Description

Simulate a crash of the secondary HANA database.

Test 3 - Prerequisites

See the following prerequisites before you perform Test 3.

A functional two-node SLES HA Extension cluster for HANA system replication.
Both nodes are active.
Cluster is started on NODE1 and NODE2.
Cluster Resource SAPHana_${SID}_${INSTNO} is configured with AUTOMATED_REGISTER=true.
Check SAP HANA System Replication status.
- Primary SAP HANA database is running on NODE1.
- Secondary SAP HANA database is running on NODE2.
- HANA System Replication is activated and in sync.

Test 3 - Test Procedure

Crash SAP HANA secondary by sending a SIGKILL signal as the user ${sid}adm.

On NODE2, run the following command.

sudo -i -u ${sid}adm -- HDB kill-9

Test 3 - Expected behavior

You can expect the following behavior from the test.

SAP HANA secondary on NODE2 crashes.
The cluster detects the stopped secondary HANA database and marks the resource as failed.
The cluster restarts the secondary HANA database.
The cluster detects that the system replication is in sync again.

Test 3 - Recovery procedure

Use the following information to recover from Test 2.

Wait until the secondary HANA instance starts and syncs again (SOK), then cleanup the failed resource actions as shown in crm status.

On NODE2, run the following command.

crm resource refresh SAPHana_${SID}_${INSTNO}

crm status --full

Test 4 - Testing a manual move of an SAP Hana resource to another node

Use the following information to test the manual move of an SAP Hana resource to another node.

Test 4 - Description

Use cluster commands to move the primary instance to the other node for maintenance purposes.

Test 4 - Prerequisites

See the following prerequisites before you perform Test 4.

A functional two-node SLES HA Extension cluster for HANA system replication.
Both nodes are active.
Cluster is started on NODE1 and NODE2.
Cluster Resource SAPHana_${SID}_${INSTNO} is configured with AUTOMATED_REGISTER=true.
Check SAP HANA System Replication status:
- Primary SAP HANA database is running on NODE1
- Secondary SAP HANA database is running on NODE2
- HANA System Replication is activated and in sync

Test 4 - Test procedure

Move SAP HANA primary to other node by using the crm resource move command.

On NODE1, run the following command.

crm resource move SAPHana_${SID}_${INSTNO}-clone

Test 4 - Expected behavior

You can expect the following behavior from the test.

The cluster creates location constraints to move the resource.
The cluster triggers a takeover to the secondary HANA database.
If an application, such as SAP NetWeaver, is connected to a tenant database of SAP HANA, the application automatically reconnects to the new primary.

Test 4 - Recovery procedure

Use the following information to recover from Test 2.

The automatically created location constraints must be removed to allow automatic failover in the future.

Wait until the primary HANA instance is active and remove the constraints.

The cluster registers and starts the HANA database as a new secondary instance.

On NODE1, run the following command.

crm constraint

crm resource clear SAPHana_${SID}_${INSTNO}-clone

crm constraint

crm status --full