IBM Cloud Docs
Backing up and restoring data for IBM Cloud Pak for Data

Backing up and restoring data for IBM Cloud Pak for Data

IBM Cloud Pak for Data IBM Software Hub

You can back up and restore the data that is associated with your installation in IBM Cloud Pak for Data.

The following table lists the upgrade paths that are supported by the scripts.

Upgrade paths supported by scripts
Version in use Version that you can upgrade to
5.0.x 5.1.x
4.8.x 5.0.x or 5.1.x
4.7.x 4.8.x or 5.0.x
4.6.x 4.7.x or 4.8.x
4.5.x 4.6.x or 4.7.x
4.0.x 4.5.x or 4.6.x

Simpler ways to complete the upgrade is described in the following topics:

If you are upgrading from 4.6.4 or earlier to the latest version, you must upgrade to 4.6.5 before you upgrade to the latest release.

The primary data storage is a PostgreSQL database.

Choose one of the following ways to manage the back up of data:

  • Kubernetes CronJob: Use the $INSTANCE-store-cronjob cron job that is provided for you.
  • backupPG.sh script: Use the backupPG.sh bash script.
  • pg_dump tool: Run the pg_dump tool on each cluster directly. This is a manual option that gives you control over the process.

When you back up data with one of these procedures before you upgrade from one version to another, the workspace IDs of your skills are preserved, but the service instance IDs and credentials change.

Before you begin

  • When you create a backup by using this procedure, the backup includes all of the assistants and skills from all of the service instances. It can include skills and assistants to which you do not have access.
  • The access permissions information of the original service instances is not stored in the backup. Meaning original access rights, which determine who can see a service instance and who cannot, are not preserved.
  • You cannot use this procedure to back up the data that is returned by the search integration. Data that is retrieved by the search integration comes from a data collection in a Discovery instance. See the Discovery documentation to find out how to back up its data.
  • If you back up and restore or otherwise change the Discovery service that your search integration connects to, then you cannot restore the search integration, but must re-create it. When you set up a search integration, you map sections of the assistant's response to fields in a data collection that is hosted by an instance of Discovery on the same cluster. If the Discovery instance changes, your mapping to it is broken. If your Discovery service does not change, then the search integration can continue to connect to the data collection.
  • The tool that restores the data clears the current database before it restores the backup. Therefore, if you might need to revert to the current database, create a backup of it first.
  • The target IBM Cloud Pak for Data cluster where you restore the data must have the same number of provisioned service instances as the environment from which you back up the database. To verify in the IBM Cloud Pak for Data web client, select Services from the main navigation menu, select Instances, and then open the Provisioned instances tab. If more than one user created instances, then ask the other users who created instances to log in and check the number that they created. You can then add up the total sum of instances for your deployment. Not even an administrative user can see instances that were created by others from the web client user interface.

Backing up data by using the CronJob

A CronJob named $INSTANCE-store-cronjob is created and enabled for you automatically when you deploy the service. A CronJob is a type of Kubernetes controller. A CronJob creates Jobs on a repeating schedule. For more information, see CronJob in the Kubernetes documentation.

The store CronJob creates the $INSTANCE-backup-job-$TIMESTAMP jobs. Each $INSTANCE-backup-job-$TIMESTAMP job deletes old logs and runs a backup of the store PostgreSQL database. PostgreSQL provides a pg_dump tool that creates a backup. To create a backup, the pg_dump tool sends the database contents to stdout, which you can then write to a file. The pg_dump tool creates the backups with the pg_dump command and stores them in a persistent volume claim (PVC) named $INSTANCE-store-db-backup-pvc.

You are responsible for moving the backup to a more secure location after its initial creation, preferably a location that can be accessed outside of the cluster where the backups cannot be deleted easily. Ensure this happens for all environments, especially for Production clusters.

The following table lists the configuration values that control the backup cron job. You can edit these settings by editing the cron job after the service is deployed by using the oc edit cronjob $INSTANCE-store-cronjob command.

Cron job variables
Variable Description Default value
store.backup.suspend If True, the cron job does not create any backup jobs. False
store.backup.schedule Specifies the time of day at which to run the backup jobs. Specify the schedule by using a cron expression. For example, {minute} {hour} {day} {month} {day-of-week} where {day-of-week} is specified as 0=Sunday, 1=Monday, and so on. The default schedule is to run every day at 11 PM. 0 23 * * *
store.backup.history.jobs.success The number of successful jobs to keep. 30
store.backup.history.jobs.failed The number of failed jobs to keep in the job logs. 10
store.backup.history.files.weekly_backup_day A day of the week is designated as the weekly backup day. 0=Sunday, 1=Monday, and so on. 0
store.backup.history.files.keep_weekly The number of backups to keep that were taken on weekly_backup_day. 4
store.backup.history.files.keep_daily The number of backups to keep that were taken on all the other days of the week. 6

Accessing backed-up files from Portworx

To access the backup files from Portworx, complete the following steps:

  1. Get the name of the persistent volume that is used for the PostgreSQL backup:

    oc get pv |grep $INSTANCE-store
    

    This command returns the name of the persistent volume claim where the store backup is located, such as pvc-d2b7aa93-3602-4617-acea-e05baba94de3. The name is referred to later in this procedure as the $pv_name.

  2. Find nodes where Portworx is running:

    oc get pods -n kube-system -o wide -l name=portworx-api
    
  3. Log in as the core user to one of the nodes where Portworx is running:

    ssh core@<node hostname>
    sudo su -
    
  4. Make sure that the persistent volume is in a detached state and that no store backups are scheduled to occur during the time you plan to transfer the backup files.

    Remember, backups occur daily at 11 PM (in the time zone that is configured for the nodes) unless you change the schedule by editing the value of the postgres.backup.schedule configuration parameter. You can run the oc get cronjobs command to check the current schedule for the $RELEASE-backup-cronjob job. In the following command, $pvc_node is the name of the node that you discovered in the first step of this task:

    pxctl volume inspect $pv_name |head -40
    
  5. Attach the persistent volume to the host:

    pxctl host attach $pv_name
    
  6. Create a folder where you want to mount the node:

    mkdir /var/lib/osd/mounts/voldir
    
  7. Mount the node:

    pxctl host mount $pv_name --path /var/lib/osd/mounts/voldir
    
  8. Change the directory to /var/lib/osd/mounts/voldir. Transfer backup files to a secure location. Afterward, exit the directory. Unmount the volume:

    pxctl host unmount --path /var/lib/osd/mounts/voldir $pv_name
    
  9. Detach the volume from the host:

    pxctl host detach $pv_name
    
  10. Make sure that the volume is in the detached state. Otherwise, subsequent backups fail:

    pxctl volume inspect $pv_name |head -40
    

Accessing backed-up files from Red Hat OpenShift Container Storage

To access the backup files from Red Hat OpenShift Container Storage (OCS), complete the following steps:

  1. Create a volume snapshot of the persistent volume claim that is used for the PostgreSQL backup:

    cat <<EOF | oc apply -f -
    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshot
    metadata:
      name: wa-backup-snapshot
    spec:
      source:
      
    
        persistentVolumeClaimName: ${INSTANCE_NAME}-store-db-backup-pvc
    
      
      volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass
    EOF
    
  2. Create a persistent volume claim from the volume snapshot:

    cat <<EOF | oc apply -f -
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: wa-backup-snapshot-pvc
    spec:
      storageClassName: ocs-storagecluster-ceph-rbd
      accessModes:
      - ReadWriteOnce
      volumeMode: Filesystem
      dataSource:
        apiGroup: snapshot.storage.k8s.io
        kind: VolumeSnapshot
        name: wa-backup-snapshot
      resources:
        requests:
          storage: 1Gi     
    EOF
    
  3. Create a pod to access the persistent volume claim:

    cat <<EOF | oc apply -f -
    kind: Pod
    apiVersion: v1
    metadata:
      name: wa-retrieve-backup
    spec:
      volumes:
        - name: backup-snapshot-pvc
          persistentVolumeClaim:
           claimName: wa-backup-snapshot-pvc
      containers:
        - name: retrieve-backup-container
          image: cp.icr.io/cp/watson-assistant/conan-tools:20210630-0901-signed@sha256:e6bee20736bd88116f8dac96d3417afdfad477af21702217f8e6321a99190278
          command: ['sh', '-c', 'echo The pod is running && sleep 360000']
          volumeMounts:
            - mountPath: "/watson_data"
              name: backup-snapshot-pvc
    EOF
    
  4. If you do not know the name of the backup file that you want to extract and are unable to check the most recent backup cron job, run the following command:

    oc exec -it wa-retrieve-backup -- ls /watson_data
    
  5. Transfer the backup files to a secure location:

    kubectl cp wa-retrieve-backup:/watson_data/${FILENAME} ${SECURE_LOCAL_DIRECTORY}/${FILENAME}
    
  6. Run the following commands to clean up the resources that you created to retrieve the files:

    oc delete pod wa-retrieve-backup
    oc delete pvc wa-backup-snapshot-pvc
    oc delete volumesnapshot wa-backup-snapshot
    

Extracting PostgreSQL backup by using a debug pod

To extract PostgreSQL backup using a debug pod, complete the following steps:

  1. Get the name of the store cronjob pod:

    export STORE_CRONJOB_POD=`oc get pods -l component=store-cronjob --no-headers | awk 'NR==1{print $1}'`
    
  2. View the list of available store backups to identify the most recent backup:

    oc debug ${STORE_CRONJOB_POD}
    ls /store-backups/
    

    In the list of store backups, you can find the latest backup with the help of the timestamps.

  3. While the debug pod listed in Step 2 remains active, in a separate terminal session, set the STORE_CRONJOB_POD variable to match the name of the store cronjob pod returned in Step 1:

    export STORE_CRONJOB_POD=`oc get pods -l component=store-cronjob --no-headers | awk 'NR==1{print $1}'`
    
  4. Export and save the STORE_DUMP_FILE variable to the name of the most recent store.dump_YYYYMMDD-TIME file from Step 2:

    export STORE_DUMP_FILE=store.dump_YYYYMMDD-TIME
    
  5. Copy the store.dump_YYYYMMDD-TIME file to a directory in a secure location on your system:

    `oc cp ${STORE_CRONJOB_POD}-debug:/store-backups/${STORE_DUMP_FILE} ${STORE_DUMP_FILE}`
    

    You must verify that you copied the store.dump_YYYYMMDD-TIME file to the right directory by running the ls command.

Backing up data by using the script

You cannot backup data by using script in watsonx Assistant for IBM Cloud PakĀ® for Data 4.6.3 or later.{ .note}

The backupPG.sh script gathers the pod name and credentials for one of your PostgreSQL pods. Then, the backupPG.sh script uses the PostgreSQL pod to run the pg_dump command.

To back up data by using the provided script, complete the following steps:

  1. Download the backupPG.sh script.

    Go to GitHub, and find the directory for your version to find the file.

    If the backupPG.sh script doesn't exist in the directory of your version, backup your data by using KubernetesCronJob or pg_dump tool.

  2. Log in to the Red Hat OpenShift project namespace where you installed the product.

  3. Run the script:

    ./backupPG.sh --instance ${INSTANCE} > ${BACKUP_DIR}
    

    Replace the following values in the command:

    • ${BACKUP_DIR}: Specify a file where you want to write the downloaded data. Be sure to specify a backup directory in which to store the file. For example, /bu/backup-file-name.dump creates a backup directory named bu.
    • --instance ${INSTANCE}: Select the specific instance to be backed up.

If you prefer to back up data by using the PostgreSQL tool directly, you can complete the procedure to back up data manually.

Backing up data manually

Complete the steps in this procedure to back up your data by using the PostgreSQL tool directly.

To back up your data, complete these steps:

  1. Fetch a running PostgreSQL pod:

    Only for version 5.1.0 or greater, use:

    oc get pods -l app=${INSTANCE}-postgres-16 -o jsonpath="{.items[0].metadata.name}"
    

    For versions below 5.1.0, use:

    oc get pods -l app=${INSTANCE}-postgres -o jsonpath="{.items[0].metadata.name}"
    

    Replace ${INSTANCE} with the instance of the deployment that you want to back up.

  2. Perform the following two steps only if you have version 5.0.0 or 4.8.5 and before:

    a. Fetch the store VCAP secret name:

    oc get secrets -l component=store,app.kubernetes.io/instance=${INSTANCE} -o=custom-columns=NAME:.metadata.name | grep store-vcap
    

    b. Fetch the PostgreSQL connection values. You will pass these values to the command that you run in the next step. You must have jq installed.

    • To get the database:

      oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.database'
      
    • To get the hostname:

      oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.host'
      
    • To get the username:

      oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.username'
      
    • To get the password:

      oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.password'
      
  3. Perform the following two steps only if you have version 4.8.6 or 5.0.1 and later:

    a. Fetch the store connection secret name:

    oc get secrets -l component=store-subsystem,app.kubernetes.io/instance=${INSTANCE} -o=custom-columns=NAME:.metadata.name | grep store-datastore-connection
    

    b. Fetch the PostgreSQL connection values. You will pass these values to the command that you run in the next step. You must have jq installed.

    • To get the database:

      oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.store_vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.database'
      
    • To get the hostname:

      oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.store_vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.host'
      
    • To get the username:

      oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.store_vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.username'
      
    • To get the password:

      oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.store_vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.password'
      
  4. Run the following command:

    oc exec $KEEPER_POD -- bash -c "export PGPASSWORD='$PASSWORD' && pg_dump -Fc -h $HOSTNAME -d $DATABASE -U $USERNAME" > ${BACKUP_DIR}
    

    The following lists describe the arguments. You retrieved the values for some of these parameters in the previous step:

    Only for version 5.1.0 or greater:

    Use $KEEPER_POD: Any PostgreSQL 16 pod in your instance.

    For versions below 5.1.0:

    Use $KEEPER_POD: Any PostgreSQL pod in your instance.

    For all versions:

    • ${BACKUP_DIR}: Specify a file where you want to write the downloaded data. Be sure to specify a backup directory in which to store the file. For example, /bu/backup-file-name.dump creates a backup directory named bu.
    • $DATABASE: The store database name that was retrieved from the Store VCAP secret in step 3.
    • $HOSTNAME: The hostname that was retrieved from the Store VCAP secret in step 3.
    • $USERNAME: The username that was retrieved from the Store VCAP secret in step 3.
    • $PASSWORD: The password that was retrieved from the Store VCAP secret in step 3.

    To see more information about the pg_dump command, you can run this command:

    oc exec -it ${KEEPER_POD} -- pg_dump --help
    
  5. Take a backup of the secret that contains the encryption key. Ignore this step if the below mentioned secret is not available in that release.

    oc get secret -l service=conversation,app=$INSTANCE-auth-encryption
    oc get secret $INSTANCE-auth-encryption -o yaml > auth-encryption-secret.yaml
    

Restoring data

IBM created a restore tool called pgmig. The tool restores your database backup by adding it to a database that you choose. It also upgrades the schema to the one that is associated with the version of the product where you restore the data. Before the tool adds the backed-up data, it removes the data for all instances in the current service deployment, so any spares are also removed.

Prerequisite:

Setup auth-encryption-secret that you backed up earlier.

oc apply -f auth-encryption-secret.yaml 
oc get secret -l service=conversation,app=$INSTANCE-auth-encryption
  1. Install the target IBM Cloud Pak for Data cluster to which you want to restore the data.

    From the web client for the target cluster, create one service instance of for each service instance that was backed up on the old cluster. The target IBM Cloud Pak for Data cluster must have the same number of instances as there were in the environment where you backed up the database.

  2. Back up the current database before you replace it with the backed-up database.

    The tool clears the current database before it restores the backup. So, if you might need to revert to the current database, be sure to create a backup of it first.

  3. Go to the backup directory that you specified in the ${BACKUP_DIR} parameter in the previous procedure.

  4. Run the following command to download the pgmig tool from the GitHub Watson Developer Cloud Community repository.

    In the first command, update <WA_VERSION> to the version that you want to restore. For example, update <WA_VERSION> to 4.6.0 if you want to restore 4.6.0.

    wget https://github.com/watson-developer-cloud/community/raw/master/watson-assistant/data/<WA_VERSION>/pgmig
    chmod 755 pgmig
    
  5. Create the following two configuration files and store them in the same backup directory:

  6. Get the secret:

    Only for version 5.1.0 or greater, use:

    oc get secret ${INSTANCE}-postgres-16-ca -o jsonpath='{.data.ca\.crt}' | base64 -d | tee ${BACKUP_DIR}/ca.crt | openssl x509 -noout -text
    

    For version below 5.1.0, use:

    oc get secret ${INSTANCE}-postgres-ca -o jsonpath='{.data.ca\.crt}' | base64 -d | tee ${BACKUP_DIR}/ca.crt | openssl x509 -noout -text
    
    • Replace ${INSTANCE} with the name of the instance that you want to back up.
    • Replace ${BACKUP_DIR} with the directory where the postgres.yaml and resourceController.yaml files are located.
  7. Copy the files that you downloaded and created in the previous steps to any existing directory on a PostgreSQL pod.

    a. Only for version 5.1.0 or greater:

    Run the following command to find PostgreSQL pods:

      oc get pods | grep ${INSTANCE}-postgres-16
    

    b. For versions below 5.1.0:

    Run the following command to find PostgreSQL pods:

      oc get pods | grep ${INSTANCE}-postgres
    

    c. The files that you must copy are pgmig, postgres.yaml, resourceController.yaml, ca.crt (the secret file that is generated in step 6), and the file that you created for your downloaded data. Run the following commands to copy the files.

    If you are restoring data to a stand-alone IBM Cloud Pak for Data cluster, then replace all references to oc with kubectl in these sample commands.

      oc exec -it ${POSTGRES_POD} -- mkdir /controller/tmp
      oc exec -it ${POSTGRES_POD} -- mkdir /controller/tmp/bu
      oc rsync ${BACKUP_DIR}/ ${POSTGRES_POD}:/controller/tmp/bu/
    
    • Replace ${POSTGRES_POD} with the name of one of the PostgreSQL pods from the previous step.
  8. Stop the store deployment by scaling the store deployment down to 0 replicas:

    oc scale deploy ibm-watson-assistant-operator -n ${OPERATOR_NS} --replicas=0
    oc get deployments -l component=store
    

    Make a note of how many replicas there are in the store deployment:

    oc scale deployment ${STORE_DEPLOYMENT} --replicas=0
    
  9. Initiate the execution of a remote command in the PostgreSQL pod:

    oc exec -it ${POSTGRES_POD} /bin/bash
    
  10. Run the pgmig tool:

    Only for version 5.1.0 or greater:

    cd /controller/tmp/bu
    export PG_CA_FILE=/controller/tmp/bu/ca.crt
    ./pgmig --resourceController resourceController.yaml --target postgres.yaml --source <backup-file-name.dump>
    export ENABLE_ICP=true
    
  11. Run the pgmig tool:

    For versions below 5.1.0:

    cd /controller/tmp/bu
    export PG_CA_FILE=/controller/tmp/bu/ca.crt
    ./pgmig --resourceController resourceController.yaml --target postgres.yaml --source <backup-file-name.dump>
    
    • Replace <backup-file-name.dump> with the name of the file that you created for your downloaded data.

    For more command options, see PostgreSQL migration tool details.

    As the script runs, you are prompted for information that includes the instance on the target cluster to which to add the backed-up data. The data on the instance you specify is removed and replaced. If there are multiple instances in the backup, you are prompted multiple times to specify the target instance information.

  12. Scale the store deployment back up:

    oc scale deployment ${STORE_DEPLOYMENT} --replicas=${ORIGINAL_NUMBER_OF_REPLICAS}
    oc scale deploy ibm-watson-assistant-operator -n ${OPERATOR_NS} --replicas=1
    

    You might need to wait a few minutes before the data your restored is visible from the web interface.

  13. After you restore the data, you must train the backend model. For more information about retraining your backend model, see Retraining your backend model.

Creating the resourceController.yaml file

The resourceController.yaml file contains details about the new environment where you are adding the backed-up data. Add the following information to the file:

accessTokens: 
  - value
  - value2
host: localhost
port: 5000

To add the values that are required but currently missing from the file, complete the following steps:

  1. To get the accessTokens values list, you need to get a list of bearer tokens for the service instances.

    • Log in to the IBM Cloud Pak for Data web client.
    • From the main IBM Cloud Pak for Data web client navigation menu, select My instances.
    • On the Provisioned instances tab, click your instance.
    • In the Access information of the instance, find the Bearer token. Copy the token and paste it into the accessTokens list.

    A bearer token for an instance can access all instances that are owned by the user. Therefore, if a single user owns all of the instances, then only one bearer token is required.

    If the service has multiple instances, each owned by a different user, then you must gather bearer tokens for each user who owns an instance. You can list multiple bearer token values in the accessTokens section.

  2. To get the host information, you need details for the pod that hosts the UI component:

    oc describe pod -l component=ui
    

    Look for the section that says RESOURCE_CONTROLLER_URL: https://${release-name}-addon-assistant-gateway-svc.zen:5000/api/ibmcloud/resource-controller.

    For example, you can use a command like this to find it:

    oc describe pod -l component=ui | grep RESOURCE_CONTROLLER_URL
    

    Copy the host that is specified in the RESOURCE_CONTROLLER_URL. The host value is the RESOURCE_CONTROLLER_URL value, excluding the protocol at the beginning and everything from the port to the end of the value. For example, for the previous example, the host is ${release-name}-addon-assistant-gateway-svc.zen.

  3. To get the port information, again check the RESOURCE_CONTROLLER_URL entry. The port is specified after <host>: in the URL. In this sample URL, the port is 5000.

  4. Paste the values that you discovered into the YAML file and save it.

Creating the postgres.yaml file

The postgres.yaml file contains details about the PostgreSQL pods in your target environment (the environment where you restore the data). Add the following information to the file:

host: localhost
port: 5432
database: store
username: user
su_username: admin
su_password: password

To add the values that are required but currently missing from the file, complete the following steps:

  1. For version 4.8.6 or 5.0.1 and later:

    To get information about the host, you must get the Store datastore connection strings secret.

    oc get secret ${INSTANCE}-store-datastore-connection-strings -o jsonpath='{.data.store_vcap_services}' | base64 -d
    
    

    For version 5.0.0 or 4.8.5 and before:

    To get information about the host, you must get the Store VCAP secret.

    oc get secret ${INSTANCE}-store-vcap -o jsonpath='{.data.vcap_services}' | base64 -d
    

    The get command returns information about the Redis and PostgreSQL databases. Look for the segment of JSON code for the PostgreSQL database, named pgservice. It looks like this:

    {
      "user-provided":[
        {
          "name": "pgservice",
          "label": "user-provided",
          "credentials":
          {
            "host": "${INSTANCE}-rw",
            "port": 5432,
            "database": "conversation_pprd_${INSTANCE}",
            "username": "${dbadmin}",
            "password": "${password}"
          }
        }
      ],
    }
    
  2. Copy the values for user-provided credentials (host, port, database, username, and password).

    You can specify the same values that were returned for username and password as the su_username and su_password values.

    The updated file looks something like this:

    Only for version 5.1.0 or greater:

    host: wa_inst-postgres-16-rw
    port: 5432
    database: conversation_pprd_wa_inst
    username: dbadmin
    su_username: dbadmin
    su_password: mypassword
    

    For versions below 5.1.0:

    host: wa_inst-postgres-rw
    port: 5432
    database: conversation_pprd_wa_inst
    username: dbadmin
    su_username: dbadmin
    su_password: mypassword
    
  3. Save the postgres.yaml file.

PostgreSQL migration tool details

The following table lists the arguments that are supported by the pgmig tool:

Argument Description
-h, --help Command usage
-f, --force Erase data if present in the target Store
-s, --source string Backup file name
-r, --resourceController string Resource Controller configuration file name

| -t, --target string | Target PostgreSQL server configuration file name |

| -m, --mapping string | Service instance-mapping configuration file name (optional) | | --testRCConnection | Test the connection for Resource Controller, then exit |

| --testPGConnection | Test the connection for PostgreSQL server, then exit |

| -v, --version | Get Build version |

The mapping configuration file

After you run the script and specify the mappings when prompted, the tool generates a file that is named enteredMapping.yaml in the current directory. This file reflects the mapping of the old cluster details to the new cluster based on the interactive inputs that were provided while the script was running.

For example, the YAML file contains values like this:

instance-mappings:
  00000000-0000-0000-0000-001570184978: 00000000-0000-0000-0000-001570194490

Where the first value (00000000-0000-0000-0000-001570184978) is the instance ID in the database backup and the second value (00000000-0000-0000-0000-001570194490) is the ID of a provisioned instance in the service on the system.

You can pass this file to the script for subsequent runs of the script in the same environment. Or you can edit it for use in other back up and restore operations. The mapping file is optional. If it is not provided, the tool prompts you for the mapping details based on information you provide in the YAML files.

Retraining your backend model

Per the number of models in your assistant, you can use one of the following options to retrain your backend model:

Retrain your backend model manually

When you open a dialog skill after a change in the training data, training is initiated automatically. Give the skill time to retrain on the restored data. It usually takes less than 10 minutes to get trained. The process of training a machine learning model requires at least one node to have 4 CPUs that can be dedicated to training. Therefore, open restored assistants and skills during low traffic periods and open them one at a time. If the assistant or dialog skill does not respond, then modify the workspace (for example, add an intent and then remove it). Check and confirm.

Auto-retrain your backend model

When you have a large number of models to retrain, you can use the auto-retrain-all job to train the backend model. To learn more about the auto-retrain-all job and its implementation, refer to the following topics:

Before you begin

Before you begin the auto-retrain-all job, you must ensure that the PostgreSQL database and Cloud Object Storage (Cloud Object Storage), which stores your action and dialog skills along with their snapshots, are active and not corrupted. In addition, you must ensure that your assistants do not receive or send any data during the auto-retrain-all job.

Planning

To get a good estimation of the duration that is required to complete the auto-retrain-all job, you can use the calculate_autoretrain_all_job_duration.sh script:

  #!/bin/bash

  calculate_duration() {
    local input_variable="$1"
    DURATION=$(("$NUM_OF_WORKSPACES_TO_TRAIN"*60 / (input_variable * 2) + "$NUM_OF_WORKSPACES_TO_TRAIN" * 2))
  }

  NUM_OF_WORKSPACES_TO_TRAIN=$(oc exec wa-etcd-0 -n cpd -- bash -c '
  password="$( cat /var/run/credentials/pass.key )"
  etcdctl_user="root:$password"
  export ETCDCTL_USER="$etcdctl_user"

  ETCDCTL_API=3 etcdctl --cert=/etc/etcdtls/operator/etcd-tls/etcd-client.crt --key=/etc/etcdtls/operator/etcd-tls/etcd-client.key --cacert=/etc/etcdtls/operator/etcd-tls/etcd-client-ca.crt --endpoints=https://$(hostname).${CLUSTER_NAME}.cpd.svc.cluster.local:2379 get  --prefix  /bluegoat/voyager-nlu/voyager-nlu-slot-wa/workspaces/ --keys-only | sed '/^$/d' | wc -l')

  echo "Number of workspaces to train $NUM_OF_WORKSPACES_TO_TRAIN"

  calculate_duration 5
  DURATION_5=$DURATION

  calculate_duration 10
  DURATION_10=$DURATION

  calculate_duration 15
  DURATION_15=$DURATION
  echo "Approximate duration of the auto retrain all job if you have 5 Training pods: $DURATION_5 seconds"
  echo "Approximate duration of the auto retrain all job if you have 10 Training pods: $DURATION_10 seconds"
  echo "Approximate duration of the auto retrain all job if you have 15 Training pods: $DURATION_15 seconds"

In addition, you can plan to speed up the auto-retrain-all job after you get the estimation of duration. For more information about speeding up the auto-retrain-all job, see the Speeding up the auto-retrain-all job topic.

Procedure

To retrain your backend model by using the auto-retrain-all job, you do the following steps:

Set up the environment variables for the auto-retrain-all job

Set up the following environment variable before you run the auto-retrain-all job:

  1. Set the AUTO_RETRAIN environment variable to false to disable any existing auto-retrain job:

      export AUTO_RETRAIN="false"
    
  2. To set up the BATCH_RETRAIN_ALL_SIZE environment variable, you multiply the number of available training replicas, CLU_TRAINING_REPLICAS, with 2 based on the assumption that each model takes approximately ~30 seconds to train a model. Use the following command to set up BATCH_RETRAIN_ALL_SIZE:

      export BATCH_RETRAIN_ALL_SIZE=$(($(oc get deploy ${INSTANCE}-clu-training --template='{{index .spec.replicas}}') * 2))
    
  3. Set WAIT_TIME_BETWEEN_BATCH_RETRAIN_IN_SECONDS_FOR_RETRAIN_ALL to (60-${BATCH_RETRAIN_ALL_SIZE}):

      export WAIT_TIME_BETWEEN_BATCH_RETRAIN_IN_SECONDS_FOR_RETRAIN_ALL=$((60-${BATCH_RETRAIN_ALL_SIZE}))
    
  4. Set WAIT_TIME_BETWEEN_TRAININGS_FOR_RETRAIN_ALL to 1:

      export WAIT_TIME_BETWEEN_TRAININGS_FOR_RETRAIN_ALL=1
    
  5. Set AUTO_RETRAIN_ALL_CRON_SCHEDULE to the time that you want to run the auto-retrain-all job:

      export AUTO_RETRAIN_ALL_CRON_SCHEDULE=<value of cron schedule>
    

    For example, you can give a value such as "0 40 19 11 3 ? 2024", which is in the following format:

    (Seconds) (Minutes) (Hours) (Day of Month) (Month) (Day of Week) (Year)

    You must set the time in UTC time zone.

  6. Set AUTO_RETRAIN_ALL_ENABLED to true:

      export AUTO_RETRAIN_ALL_ENABLED="true"
    
Run the auto-retrain-all job
  1. To run the auto-retrain-all job, use the following command:

        export PROJECT_CPD_INST_OPERANDS=<namespace where Cloud Pak for Data and Assistant is installed>
        export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`
    
        cat <<EOF | oc apply -f -
        apiVersion: assistant.watson.ibm.com/v1
        kind: TemporaryPatch
        metadata:
          name: ${INSTANCE}-store-admin-env-vars
          namespace: ${PROJECT_CPD_INST_OPERANDS}
        spec:
          apiVersion: assistant.watson.ibm.com/v1
          kind: WatsonAssistantStore
          name: ${INSTANCE}
          patchType: patchStrategicMerge
          patch:
            store-admin:
              deployment:
                spec:
                  template:
                    spec:
                      containers:
                      - name: store-admin
                        env:
                        - name: AUTO_RETRAIN
                          value: "${AUTO_RETRAIN}"
                        - name: AUTO_RETRAIN_ALL_CRON_SCHEDULE
                          value: "${AUTO_RETRAIN_ALL_CRON_SCHEDULE}"
                        - name: AUTO_RETRAIN_ALL_ENABLED
                          value: "${AUTO_RETRAIN_ALL_ENABLED}"
                        - name: BATCH_RETRAIN_ALL_SIZE
                          value: "${BATCH_RETRAIN_ALL_SIZE}"
                        - name: WAIT_TIME_BETWEEN_BATCH_RETRAIN_IN_SECONDS_FOR_RETRAIN_ALL
                          value: "${WAIT_TIME_BETWEEN_BATCH_RETRAIN_IN_SECONDS_FOR_RETRAIN_ALL}"
                        - name: WAIT_TIME_BETWEEN_TRAININGS_FOR_RETRAIN_ALL
                          value: "${WAIT_TIME_BETWEEN_TRAININGS_FOR_RETRAIN_ALL}"
        EOF
    
  2. After you complete the auto-retrain-all job, you must disable the auto-retrain-all flag and enable auto-retrain flag by using the following commands:

      oc patch temporarypatch ${INSTANCE}-store-admin-env-vars -p '{"metadata":{"finalizers":[]}}' --type=merge -n ${PROJECT_CPD_INST_OPERANDS}
      oc delete temporarypatch ${INSTANCE}-store-admin-env-vars -n ${PROJECT_CPD_INST_OPERANDS}
      oc patch watsonassistantstore/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oppy.ibm.com/temporary-patches\":null}}}" --type=merge
    
Validate the auto-retrain-all job

You can validate the successful completion of the auto-retrain-all job by comparing the number of Affected workspaces found with the Retrained Total count in the store-admin service log. To get the number of Affected workspaces found and the Retrained Total, run the following command:

  oc logs $(oc get pod -l component=store-admin --no-headers |awk '{print $1}') | grep "\[RETRAIN-ALL-SUMMARY\] Affected workspaces found"

If the auto-retrain-all job is successful, the Retrained Total count equals the number of Affected workspaces found. In addition, if the difference between the counts of the Retrained Total and Affected workspaces found is small, the auto-retrain-all job completes successfully by training the remaining models in the background. However, if there is a big difference between Retrained Total and Affected workspaces found, you must look at the store-admin logs to analyze the issue and consider speeding up the auto-retrain-all job.

Speeding up the auto-retrain-all job

The duration to complete the auto-retrain-all job depends on the number of models to train. Therefore, to speed up the training process, you must scale the number of CLU_TRAINING_REPLICAS and its dependencies. For example, if you scale the number of CLU_TRAINING_REPLICAS to x, you must scale the number of dependent replicas per the following calculation:

  • TFMM_REPLICAS to 0.5x
  • DRAGONFLY_CLU_MM_REPLICAS to 0.3x
  • CLU_EMBEDDING_REPLICAS to 0.2x
  • CLU_TRITON_SERVING_REPLICAS to 0.2x.

If your calculation result for the number of models is a decimal number, then you must round-up the result to the next greater whole number. For example, if the number of TFMM_REPLICAS is 2.4, then round-up the value to 3.

Use the following steps to scale the number of models:

  1. Register the values of the number of replicas per your calculation:

      export CLU_TRAINING_REPLICAS=<value from calculation>
      export TFMM_REPLICAS=<value from calculation>
      export DRAGONFLY_CLU_MM_REPLICAS=<value from calculation>
      export CLU_EMBEDDING_REPLICAS=<value from calculation>
      export CLU_TRITON_SERVING_REPLICAS=<value from calculation>
    
  2. Increase the number of REPLICAS by using the following command:

      export PROJECT_CPD_INST_OPERANDS=<namespace where Cloud Pak for Data and Assistant is installed>
      export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'`
    
      cat <<EOF | oc apply -f -
      apiVersion: assistant.watson.ibm.com/v1
      kind: TemporaryPatch
      metadata:
        name: ${INSTANCE}-clu-training-replicas
        namespace: ${PROJECT_CPD_INST_OPERANDS}
      spec:
        apiVersion: assistant.watson.ibm.com/v1
        kind: WatsonAssistantCluTraining
        name: $INSTANCE
        patchType: patchStrategicMerge
        patch:
          clu-training:
            deployment:
              training:
                spec:
                  replicas: ${CLU_TRAINING_REPLICAS}
      EOF
    
      cat <<EOF | oc apply -f -
      apiVersion: assistant.watson.ibm.com/v1
      kind: TemporaryPatch
      metadata:
        name: ${INSTANCE}-clu-runtime-replicas
        namespace: ${PROJECT_CPD_INST_OPERANDS}
      spec:
        apiVersion: assistant.watson.ibm.com/v1
        kind: WatsonAssistantCluRuntime
        name: ${INSTANCE}
        patchType: patchStrategicMerge
        patch:
          tfmm:
            deployment:
              spec:
                replicas: ${TFMM_REPLICAS}
          dragonfly-clu-mm:
            deployment:
              spec:
                replicas: ${DRAGONFLY_CLU_MM_REPLICAS}
      EOF
    
      cat <<EOF | oc apply -f -
      apiVersion: assistant.watson.ibm.com/v1
      kind: TemporaryPatch
      metadata:
        name: ${INSTANCE}-clu-replicas
        namespace: ${PROJECT_CPD_INST_OPERANDS}
      spec:
        apiVersion: assistant.watson.ibm.com/v1
        kind: WatsonAssistantClu
        name: ${INSTANCE}
        patchType: patchStrategicMerge
        patch:
          clu-embedding:
            deployment:
              spec:
                replicas: ${CLU_EMBEDDING_REPLICAS}
          clu-triton-serving:
            deployment:
              spec:
                replicas: ${CLU_TRITON_SERVING_REPLICAS}
      EOF
    
  3. After you complete the auto-retrain-all job, you must revert the number of REPLICAS to the original numbers:

     oc patch temporarypatch ${INSTANCE}-clu-training-replicas -p '{"metadata":{"finalizers":[]}}' --type=merge -n ${PROJECT_CPD_INST_OPERANDS}
     oc patch temporarypatch ${INSTANCE}-clu-runtime-replicas -p '{"metadata":{"finalizers":[]}}' --type=merge -n ${PROJECT_CPD_INST_OPERANDS}
     oc patch temporarypatch ${INSTANCE}-clu-replicas -p '{"metadata":{"finalizers":[]}}' --type=merge -n ${PROJECT_CPD_INST_OPERANDS}
    
     oc delete temporarypatch ${INSTANCE}-clu-training-replicas -n ${PROJECT_CPD_INST_OPERANDS}
     oc delete temporarypatch ${INSTANCE}-clu-runtime-replicas -n ${PROJECT_CPD_INST_OPERANDS}
     oc delete temporarypatch ${INSTANCE}-clu-replicas -n ${PROJECT_CPD_INST_OPERANDS}
    
     oc patch watsonassistantclutraining/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oppy.ibm.com/temporary-patches\":null}}}" --type=merge
     oc patch watsonassistantcluruntime/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oppy.ibm.com/temporary-patches\":null}}}" --type=merge
     oc patch watsonassistantclu/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oppy.ibm.com/temporary-patches\":null}}}" --type=merge
     oc patch watsonassistantclutraining/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oper8.org/temporary-patches\":null}}}" --type=merge
     oc patch watsonassistantcluruntime/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oper8.org/temporary-patches\":null}}}" --type=merge
     oc patch watsonassistantclu/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oper8.org/temporary-patches\":null}}}" --type=merge