Backing up and restoring data for IBM Cloud Pak for Data
IBM Cloud Pak for Data IBM Software Hub
You can back up and restore the data that is associated with your installation in IBM Cloud Pak for Data.
The following table lists the upgrade paths that are supported by the scripts.
Version in use | Version that you can upgrade to |
---|---|
5.0.x | 5.1.x |
4.8.x | 5.0.x or 5.1.x |
4.7.x | 4.8.x or 5.0.x |
4.6.x | 4.7.x or 4.8.x |
4.5.x | 4.6.x or 4.7.x |
4.0.x | 4.5.x or 4.6.x |
Simpler ways to complete the upgrade is described in the following topics:
- Upgrading watsonx Assistant to Version 5.1.x
- Upgrading watsonx Assistant to Version 5.0.x
- Upgrading watsonx Assistant to Version 4.8.x
- Upgrading watsonx Assistant to Version 4.7.x
- Upgrading watsonx Assistant to Version 4.6.x
If you are upgrading from 4.6.4 or earlier to the latest version, you must upgrade to 4.6.5 before you upgrade to the latest release.
The primary data storage is a PostgreSQL database.
Choose one of the following ways to manage the back up of data:
- Kubernetes CronJob: Use the
$INSTANCE-store-cronjob
cron job that is provided for you. - backupPG.sh script: Use the
backupPG.sh
bash script. - pg_dump tool: Run the
pg_dump
tool on each cluster directly. This is a manual option that gives you control over the process.
When you back up data with one of these procedures before you upgrade from one version to another, the workspace IDs of your skills are preserved, but the service instance IDs and credentials change.
Before you begin
- When you create a backup by using this procedure, the backup includes all of the assistants and skills from all of the service instances. It can include skills and assistants to which you do not have access.
- The access permissions information of the original service instances is not stored in the backup. Meaning original access rights, which determine who can see a service instance and who cannot, are not preserved.
- You cannot use this procedure to back up the data that is returned by the search integration. Data that is retrieved by the search integration comes from a data collection in a Discovery instance. See the Discovery documentation to find out how to back up its data.
- If you back up and restore or otherwise change the Discovery service that your search integration connects to, then you cannot restore the search integration, but must re-create it. When you set up a search integration, you map sections of the assistant's response to fields in a data collection that is hosted by an instance of Discovery on the same cluster. If the Discovery instance changes, your mapping to it is broken. If your Discovery service does not change, then the search integration can continue to connect to the data collection.
- The tool that restores the data clears the current database before it restores the backup. Therefore, if you might need to revert to the current database, create a backup of it first.
- The target IBM Cloud Pak for Data cluster where you restore the data must have the same number of provisioned service instances as the environment from which you back up the database. To verify in the IBM Cloud Pak for Data web client, select Services from the main navigation menu, select Instances, and then open the Provisioned instances tab. If more than one user created instances, then ask the other users who created instances to log in and check the number that they created. You can then add up the total sum of instances for your deployment. Not even an administrative user can see instances that were created by others from the web client user interface.
Backing up data by using the CronJob
A CronJob named $INSTANCE-store-cronjob
is created and enabled for you automatically when you deploy the service. A CronJob is a type of Kubernetes controller. A CronJob creates Jobs on a repeating schedule. For more information,
see CronJob in the Kubernetes documentation.
The store CronJob creates the $INSTANCE-backup-job-$TIMESTAMP
jobs. Each $INSTANCE-backup-job-$TIMESTAMP
job deletes old logs and runs a backup of the store PostgreSQL database. PostgreSQL provides a pg_dump
tool that creates a backup. To create a backup, the pg_dump
tool sends the database contents to stdout
, which you can then write to a file. The pg_dump
tool creates the backups with the pg_dump
command and stores them in a persistent volume claim (PVC) named $INSTANCE-store-db-backup-pvc
.
You are responsible for moving the backup to a more secure location after its initial creation, preferably a location that can be accessed outside of the cluster where the backups cannot be deleted easily. Ensure this happens for all environments, especially for Production clusters.
The following table lists the configuration values that control the backup cron job. You can edit these settings by editing the cron job after the service is deployed by using the oc edit cronjob $INSTANCE-store-cronjob
command.
Variable | Description | Default value |
---|---|---|
store.backup.suspend | If True, the cron job does not create any backup jobs. | False |
store.backup.schedule | Specifies the time of day at which to run the backup jobs. Specify the schedule by using a cron expression. For example, {minute} {hour} {day} {month} {day-of-week} where {day-of-week} is specified as 0 =Sunday,
1 =Monday, and so on. The default schedule is to run every day at 11 PM. |
0 23 * * * |
store.backup.history.jobs.success | The number of successful jobs to keep. | 30 |
store.backup.history.jobs.failed | The number of failed jobs to keep in the job logs. | 10 |
store.backup.history.files.weekly_backup_day | A day of the week is designated as the weekly backup day. 0=Sunday, 1=Monday, and so on. | 0 |
store.backup.history.files.keep_weekly | The number of backups to keep that were taken on weekly_backup_day. | 4 |
store.backup.history.files.keep_daily | The number of backups to keep that were taken on all the other days of the week. | 6 |
Accessing backed-up files from Portworx
To access the backup files from Portworx, complete the following steps:
-
Get the name of the persistent volume that is used for the PostgreSQL backup:
oc get pv |grep $INSTANCE-store
This command returns the name of the persistent volume claim where the store backup is located, such as
pvc-d2b7aa93-3602-4617-acea-e05baba94de3
. The name is referred to later in this procedure as the$pv_name
. -
Find nodes where Portworx is running:
oc get pods -n kube-system -o wide -l name=portworx-api
-
Log in as the core user to one of the nodes where Portworx is running:
ssh core@<node hostname> sudo su -
-
Make sure that the persistent volume is in a detached state and that no store backups are scheduled to occur during the time you plan to transfer the backup files.
Remember, backups occur daily at 11 PM (in the time zone that is configured for the nodes) unless you change the schedule by editing the value of the
postgres.backup.schedule
configuration parameter. You can run theoc get cronjobs
command to check the current schedule for the$RELEASE-backup-cronjob
job. In the following command,$pvc_node
is the name of the node that you discovered in the first step of this task:pxctl volume inspect $pv_name |head -40
-
Attach the persistent volume to the host:
pxctl host attach $pv_name
-
Create a folder where you want to mount the node:
mkdir /var/lib/osd/mounts/voldir
-
Mount the node:
pxctl host mount $pv_name --path /var/lib/osd/mounts/voldir
-
Change the directory to
/var/lib/osd/mounts/voldir
. Transfer backup files to a secure location. Afterward, exit the directory. Unmount the volume:pxctl host unmount --path /var/lib/osd/mounts/voldir $pv_name
-
Detach the volume from the host:
pxctl host detach $pv_name
-
Make sure that the volume is in the detached state. Otherwise, subsequent backups fail:
pxctl volume inspect $pv_name |head -40
Accessing backed-up files from Red Hat OpenShift Container Storage
To access the backup files from Red Hat OpenShift Container Storage (OCS), complete the following steps:
-
Create a volume snapshot of the persistent volume claim that is used for the PostgreSQL backup:
cat <<EOF | oc apply -f - apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: name: wa-backup-snapshot spec: source: persistentVolumeClaimName: ${INSTANCE_NAME}-store-db-backup-pvc volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass EOF
-
Create a persistent volume claim from the volume snapshot:
cat <<EOF | oc apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: wa-backup-snapshot-pvc spec: storageClassName: ocs-storagecluster-ceph-rbd accessModes: - ReadWriteOnce volumeMode: Filesystem dataSource: apiGroup: snapshot.storage.k8s.io kind: VolumeSnapshot name: wa-backup-snapshot resources: requests: storage: 1Gi EOF
-
Create a pod to access the persistent volume claim:
cat <<EOF | oc apply -f - kind: Pod apiVersion: v1 metadata: name: wa-retrieve-backup spec: volumes: - name: backup-snapshot-pvc persistentVolumeClaim: claimName: wa-backup-snapshot-pvc containers: - name: retrieve-backup-container image: cp.icr.io/cp/watson-assistant/conan-tools:20210630-0901-signed@sha256:e6bee20736bd88116f8dac96d3417afdfad477af21702217f8e6321a99190278 command: ['sh', '-c', 'echo The pod is running && sleep 360000'] volumeMounts: - mountPath: "/watson_data" name: backup-snapshot-pvc EOF
-
If you do not know the name of the backup file that you want to extract and are unable to check the most recent backup cron job, run the following command:
oc exec -it wa-retrieve-backup -- ls /watson_data
-
Transfer the backup files to a secure location:
kubectl cp wa-retrieve-backup:/watson_data/${FILENAME} ${SECURE_LOCAL_DIRECTORY}/${FILENAME}
-
Run the following commands to clean up the resources that you created to retrieve the files:
oc delete pod wa-retrieve-backup oc delete pvc wa-backup-snapshot-pvc oc delete volumesnapshot wa-backup-snapshot
Extracting PostgreSQL backup by using a debug pod
To extract PostgreSQL backup using a debug pod, complete the following steps:
-
Get the name of the store cronjob pod:
export STORE_CRONJOB_POD=`oc get pods -l component=store-cronjob --no-headers | awk 'NR==1{print $1}'`
-
View the list of available store backups to identify the most recent backup:
oc debug ${STORE_CRONJOB_POD} ls /store-backups/
In the list of store backups, you can find the latest backup with the help of the timestamps.
-
While the debug pod listed in Step 2 remains active, in a separate terminal session, set the
STORE_CRONJOB_POD
variable to match the name of the store cronjob pod returned in Step 1:export STORE_CRONJOB_POD=`oc get pods -l component=store-cronjob --no-headers | awk 'NR==1{print $1}'`
-
Export and save the
STORE_DUMP_FILE
variable to the name of the most recentstore.dump_YYYYMMDD-TIME
file fromStep 2
:export STORE_DUMP_FILE=store.dump_YYYYMMDD-TIME
-
Copy the
store.dump_YYYYMMDD-TIME
file to a directory in a secure location on your system:`oc cp ${STORE_CRONJOB_POD}-debug:/store-backups/${STORE_DUMP_FILE} ${STORE_DUMP_FILE}`
You must verify that you copied the
store.dump_YYYYMMDD-TIME
file to the right directory by running thels
command.
Backing up data by using the script
You cannot backup data by using script in watsonx Assistant for IBM Cloud PakĀ® for Data 4.6.3 or later.{ .note}
The backupPG.sh
script gathers the pod name and credentials for one of your PostgreSQL pods. Then, the backupPG.sh
script uses the PostgreSQL pod to run the pg_dump
command.
To back up data by using the provided script, complete the following steps:
-
Download the
backupPG.sh
script.Go to GitHub, and find the directory for your version to find the file.
If the
backupPG.sh
script doesn't exist in the directory of your version, backup your data by using KubernetesCronJob or pg_dump tool. -
Log in to the Red Hat OpenShift project namespace where you installed the product.
-
Run the script:
./backupPG.sh --instance ${INSTANCE} > ${BACKUP_DIR}
Replace the following values in the command:
${BACKUP_DIR}
: Specify a file where you want to write the downloaded data. Be sure to specify a backup directory in which to store the file. For example,/bu/backup-file-name.dump
creates a backup directory namedbu
.--instance ${INSTANCE}
: Select the specific instance to be backed up.
If you prefer to back up data by using the PostgreSQL tool directly, you can complete the procedure to back up data manually.
Backing up data manually
Complete the steps in this procedure to back up your data by using the PostgreSQL tool directly.
To back up your data, complete these steps:
-
Fetch a running PostgreSQL pod:
Only for version 5.1.0 or greater, use:
oc get pods -l app=${INSTANCE}-postgres-16 -o jsonpath="{.items[0].metadata.name}"
For versions below 5.1.0, use:
oc get pods -l app=${INSTANCE}-postgres -o jsonpath="{.items[0].metadata.name}"
Replace ${INSTANCE} with the instance of the deployment that you want to back up.
-
Perform the following two steps only if you have version 5.0.0 or 4.8.5 and before:
a. Fetch the store VCAP secret name:
oc get secrets -l component=store,app.kubernetes.io/instance=${INSTANCE} -o=custom-columns=NAME:.metadata.name | grep store-vcap
b. Fetch the PostgreSQL connection values. You will pass these values to the command that you run in the next step. You must have
jq
installed.-
To get the database:
oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.database'
-
To get the hostname:
oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.host'
-
To get the username:
oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.username'
-
To get the password:
oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.password'
-
-
Perform the following two steps only if you have version 4.8.6 or 5.0.1 and later:
a. Fetch the store connection secret name:
oc get secrets -l component=store-subsystem,app.kubernetes.io/instance=${INSTANCE} -o=custom-columns=NAME:.metadata.name | grep store-datastore-connection
b. Fetch the PostgreSQL connection values. You will pass these values to the command that you run in the next step. You must have
jq
installed.-
To get the database:
oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.store_vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.database'
-
To get the hostname:
oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.store_vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.host'
-
To get the username:
oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.store_vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.username'
-
To get the password:
oc get secret $VCAP_SECRET_NAME -o jsonpath="{.data.store_vcap_services}" | base64 --decode | jq --raw-output '.["user-provided"][]|.credentials|.password'
-
-
Run the following command:
oc exec $KEEPER_POD -- bash -c "export PGPASSWORD='$PASSWORD' && pg_dump -Fc -h $HOSTNAME -d $DATABASE -U $USERNAME" > ${BACKUP_DIR}
The following lists describe the arguments. You retrieved the values for some of these parameters in the previous step:
Only for version 5.1.0 or greater:
Use
$KEEPER_POD
: Any PostgreSQL 16 pod in your instance.For versions below 5.1.0:
Use
$KEEPER_POD
: Any PostgreSQL pod in your instance.For all versions:
${BACKUP_DIR}
: Specify a file where you want to write the downloaded data. Be sure to specify a backup directory in which to store the file. For example,/bu/backup-file-name.dump
creates a backup directory namedbu
.$DATABASE
: The store database name that was retrieved from the Store VCAP secret in step 3.$HOSTNAME
: The hostname that was retrieved from the Store VCAP secret in step 3.$USERNAME
: The username that was retrieved from the Store VCAP secret in step 3.$PASSWORD
: The password that was retrieved from the Store VCAP secret in step 3.
To see more information about the
pg_dump
command, you can run this command:oc exec -it ${KEEPER_POD} -- pg_dump --help
-
Take a backup of the secret that contains the encryption key. Ignore this step if the below mentioned secret is not available in that release.
oc get secret -l service=conversation,app=$INSTANCE-auth-encryption oc get secret $INSTANCE-auth-encryption -o yaml > auth-encryption-secret.yaml
Restoring data
IBM created a restore tool called pgmig
. The tool restores your database backup by adding it to a database that you choose. It also upgrades the schema to the one that is associated with the version of the product where you restore
the data. Before the tool adds the backed-up data, it removes the data for all instances in the current service deployment, so any spares are also removed.
Prerequisite:
Setup auth-encryption-secret
that you backed up earlier.
oc apply -f auth-encryption-secret.yaml
oc get secret -l service=conversation,app=$INSTANCE-auth-encryption
-
Install the target IBM Cloud Pak for Data cluster to which you want to restore the data.
From the web client for the target cluster, create one service instance of for each service instance that was backed up on the old cluster. The target IBM Cloud Pak for Data cluster must have the same number of instances as there were in the environment where you backed up the database.
-
Back up the current database before you replace it with the backed-up database.
The tool clears the current database before it restores the backup. So, if you might need to revert to the current database, be sure to create a backup of it first.
-
Go to the backup directory that you specified in the
${BACKUP_DIR}
parameter in the previous procedure. -
Run the following command to download the
pgmig
tool from the GitHub Watson Developer Cloud Community repository.In the first command, update
<WA_VERSION>
to the version that you want to restore. For example, update<WA_VERSION>
to4.6.0
if you want to restore 4.6.0.wget https://github.com/watson-developer-cloud/community/raw/master/watson-assistant/data/<WA_VERSION>/pgmig chmod 755 pgmig
-
Create the following two configuration files and store them in the same backup directory:
-
resourceController.yaml
: The Resource Controller file keeps a list of all provisioned instances. See Creating the resourceController.yaml file. -
postgres.yaml
: The PostgreSQL file lists details for the target PostgreSQL pods. See Creating the postgres.yaml file.
-
-
Get the secret:
Only for version 5.1.0 or greater, use:
oc get secret ${INSTANCE}-postgres-16-ca -o jsonpath='{.data.ca\.crt}' | base64 -d | tee ${BACKUP_DIR}/ca.crt | openssl x509 -noout -text
For version below 5.1.0, use:
oc get secret ${INSTANCE}-postgres-ca -o jsonpath='{.data.ca\.crt}' | base64 -d | tee ${BACKUP_DIR}/ca.crt | openssl x509 -noout -text
- Replace
${INSTANCE}
with the name of the instance that you want to back up. - Replace
${BACKUP_DIR}
with the directory where thepostgres.yaml
andresourceController.yaml
files are located.
- Replace
-
Copy the files that you downloaded and created in the previous steps to any existing directory on a PostgreSQL pod.
a. Only for version 5.1.0 or greater:
Run the following command to find PostgreSQL pods:
oc get pods | grep ${INSTANCE}-postgres-16
b. For versions below 5.1.0:
Run the following command to find PostgreSQL pods:
oc get pods | grep ${INSTANCE}-postgres
c. The files that you must copy are
pgmig
,postgres.yaml
,resourceController.yaml
,ca.crt
(the secret file that is generated in step 6), and the file that you created for your downloaded data. Run the following commands to copy the files.If you are restoring data to a stand-alone IBM Cloud Pak for Data cluster, then replace all references to
oc
withkubectl
in these sample commands.oc exec -it ${POSTGRES_POD} -- mkdir /controller/tmp oc exec -it ${POSTGRES_POD} -- mkdir /controller/tmp/bu oc rsync ${BACKUP_DIR}/ ${POSTGRES_POD}:/controller/tmp/bu/
- Replace
${POSTGRES_POD}
with the name of one of the PostgreSQL pods from the previous step.
- Replace
-
Stop the store deployment by scaling the store deployment down to 0 replicas:
oc scale deploy ibm-watson-assistant-operator -n ${OPERATOR_NS} --replicas=0 oc get deployments -l component=store
Make a note of how many replicas there are in the store deployment:
oc scale deployment ${STORE_DEPLOYMENT} --replicas=0
-
Initiate the execution of a remote command in the PostgreSQL pod:
oc exec -it ${POSTGRES_POD} /bin/bash
-
Run the
pgmig
tool:Only for version 5.1.0 or greater:
cd /controller/tmp/bu export PG_CA_FILE=/controller/tmp/bu/ca.crt ./pgmig --resourceController resourceController.yaml --target postgres.yaml --source <backup-file-name.dump> export ENABLE_ICP=true
-
Run the
pgmig
tool:For versions below 5.1.0:
cd /controller/tmp/bu export PG_CA_FILE=/controller/tmp/bu/ca.crt ./pgmig --resourceController resourceController.yaml --target postgres.yaml --source <backup-file-name.dump>
- Replace
<backup-file-name.dump>
with the name of the file that you created for your downloaded data.
For more command options, see PostgreSQL migration tool details.
As the script runs, you are prompted for information that includes the instance on the target cluster to which to add the backed-up data. The data on the instance you specify is removed and replaced. If there are multiple instances in the backup, you are prompted multiple times to specify the target instance information.
- Replace
-
Scale the store deployment back up:
oc scale deployment ${STORE_DEPLOYMENT} --replicas=${ORIGINAL_NUMBER_OF_REPLICAS} oc scale deploy ibm-watson-assistant-operator -n ${OPERATOR_NS} --replicas=1
You might need to wait a few minutes before the data your restored is visible from the web interface.
-
After you restore the data, you must train the backend model. For more information about retraining your backend model, see Retraining your backend model.
Creating the resourceController.yaml file
The resourceController.yaml file contains details about the new environment where you are adding the backed-up data. Add the following information to the file:
accessTokens:
- value
- value2
host: localhost
port: 5000
To add the values that are required but currently missing from the file, complete the following steps:
-
To get the accessTokens values list, you need to get a list of bearer tokens for the service instances.
- Log in to the IBM Cloud Pak for Data web client.
- From the main IBM Cloud Pak for Data web client navigation menu, select My instances.
- On the Provisioned instances tab, click your instance.
- In the Access information of the instance, find the Bearer token. Copy the token and paste it into the accessTokens list.
A bearer token for an instance can access all instances that are owned by the user. Therefore, if a single user owns all of the instances, then only one bearer token is required.
If the service has multiple instances, each owned by a different user, then you must gather bearer tokens for each user who owns an instance. You can list multiple bearer token values in the
accessTokens
section. -
To get the host information, you need details for the pod that hosts the UI component:
oc describe pod -l component=ui
Look for the section that says
RESOURCE_CONTROLLER_URL: https://${release-name}-addon-assistant-gateway-svc.zen:5000/api/ibmcloud/resource-controller
.For example, you can use a command like this to find it:
oc describe pod -l component=ui | grep RESOURCE_CONTROLLER_URL
Copy the host that is specified in the
RESOURCE_CONTROLLER_URL
. The host value is theRESOURCE_CONTROLLER_URL
value, excluding the protocol at the beginning and everything from the port to the end of the value. For example, for the previous example, the host is${release-name}-addon-assistant-gateway-svc.zen
. -
To get the port information, again check the RESOURCE_CONTROLLER_URL entry. The port is specified after
<host>:
in the URL. In this sample URL, the port is5000
. -
Paste the values that you discovered into the YAML file and save it.
Creating the postgres.yaml file
The postgres.yaml file contains details about the PostgreSQL pods in your target environment (the environment where you restore the data). Add the following information to the file:
host: localhost
port: 5432
database: store
username: user
su_username: admin
su_password: password
To add the values that are required but currently missing from the file, complete the following steps:
-
For version 4.8.6 or 5.0.1 and later:
To get information about the
host
, you must get the Store datastore connection strings secret.oc get secret ${INSTANCE}-store-datastore-connection-strings -o jsonpath='{.data.store_vcap_services}' | base64 -d
For version 5.0.0 or 4.8.5 and before:
To get information about the
host
, you must get the Store VCAP secret.oc get secret ${INSTANCE}-store-vcap -o jsonpath='{.data.vcap_services}' | base64 -d
The
get
command returns information about the Redis and PostgreSQL databases. Look for the segment of JSON code for the PostgreSQL database, namedpgservice
. It looks like this:{ "user-provided":[ { "name": "pgservice", "label": "user-provided", "credentials": { "host": "${INSTANCE}-rw", "port": 5432, "database": "conversation_pprd_${INSTANCE}", "username": "${dbadmin}", "password": "${password}" } } ], }
-
Copy the values for user-provided credentials (
host
,port
,database
,username
, andpassword
).You can specify the same values that were returned for
username
andpassword
as thesu_username
andsu_password
values.The updated file looks something like this:
Only for version 5.1.0 or greater:
host: wa_inst-postgres-16-rw port: 5432 database: conversation_pprd_wa_inst username: dbadmin su_username: dbadmin su_password: mypassword
For versions below 5.1.0:
host: wa_inst-postgres-rw port: 5432 database: conversation_pprd_wa_inst username: dbadmin su_username: dbadmin su_password: mypassword
-
Save the
postgres.yaml
file.
PostgreSQL migration tool details
The following table lists the arguments that are supported by the pgmig
tool:
Argument | Description |
---|---|
-h, --help | Command usage |
-f, --force | Erase data if present in the target Store |
-s, --source string | Backup file name |
-r, --resourceController string | Resource Controller configuration file name |
| -t, --target string | Target PostgreSQL server configuration file name |
| -m, --mapping string | Service instance-mapping configuration file name (optional) | | --testRCConnection | Test the connection for Resource Controller, then exit |
| --testPGConnection | Test the connection for PostgreSQL server, then exit |
| -v, --version | Get Build version |
The mapping configuration file
After you run the script and specify the mappings when prompted, the tool generates a file that is named enteredMapping.yaml
in the current directory. This file reflects the mapping of the old cluster details to the new cluster
based on the interactive inputs that were provided while the script was running.
For example, the YAML file contains values like this:
instance-mappings:
00000000-0000-0000-0000-001570184978: 00000000-0000-0000-0000-001570194490
Where the first value (00000000-0000-0000-0000-001570184978
) is the instance ID in the database backup and the second value (00000000-0000-0000-0000-001570194490
) is the ID of a provisioned instance in the service
on the system.
You can pass this file to the script for subsequent runs of the script in the same environment. Or you can edit it for use in other back up and restore operations. The mapping file is optional. If it is not provided, the tool prompts you for the mapping details based on information you provide in the YAML files.
Retraining your backend model
Per the number of models in your assistant, you can use one of the following options to retrain your backend model:
Retrain your backend model manually
When you open a dialog skill after a change in the training data, training is initiated automatically. Give the skill time to retrain on the restored data. It usually takes less than 10 minutes to get trained. The process of training a machine learning model requires at least one node to have 4 CPUs that can be dedicated to training. Therefore, open restored assistants and skills during low traffic periods and open them one at a time. If the assistant or dialog skill does not respond, then modify the workspace (for example, add an intent and then remove it). Check and confirm.
Auto-retrain your backend model
When you have a large number of models to retrain, you can use the auto-retrain-all job to train the backend model. To learn more about the auto-retrain-all job and its implementation, refer to the following topics:
Before you begin
Before you begin the auto-retrain-all job, you must ensure that the PostgreSQL database and Cloud Object Storage (Cloud Object Storage), which stores your action and dialog skills along with their snapshots, are active and not corrupted. In addition, you must ensure that your assistants do not receive or send any data during the auto-retrain-all job.
Planning
To get a good estimation of the duration that is required to complete the auto-retrain-all job, you can use the calculate_autoretrain_all_job_duration.sh
script:
#!/bin/bash
calculate_duration() {
local input_variable="$1"
DURATION=$(("$NUM_OF_WORKSPACES_TO_TRAIN"*60 / (input_variable * 2) + "$NUM_OF_WORKSPACES_TO_TRAIN" * 2))
}
NUM_OF_WORKSPACES_TO_TRAIN=$(oc exec wa-etcd-0 -n cpd -- bash -c '
password="$( cat /var/run/credentials/pass.key )"
etcdctl_user="root:$password"
export ETCDCTL_USER="$etcdctl_user"
ETCDCTL_API=3 etcdctl --cert=/etc/etcdtls/operator/etcd-tls/etcd-client.crt --key=/etc/etcdtls/operator/etcd-tls/etcd-client.key --cacert=/etc/etcdtls/operator/etcd-tls/etcd-client-ca.crt --endpoints=https://$(hostname).${CLUSTER_NAME}.cpd.svc.cluster.local:2379 get --prefix /bluegoat/voyager-nlu/voyager-nlu-slot-wa/workspaces/ --keys-only | sed '/^$/d' | wc -l')
echo "Number of workspaces to train $NUM_OF_WORKSPACES_TO_TRAIN"
calculate_duration 5
DURATION_5=$DURATION
calculate_duration 10
DURATION_10=$DURATION
calculate_duration 15
DURATION_15=$DURATION
echo "Approximate duration of the auto retrain all job if you have 5 Training pods: $DURATION_5 seconds"
echo "Approximate duration of the auto retrain all job if you have 10 Training pods: $DURATION_10 seconds"
echo "Approximate duration of the auto retrain all job if you have 15 Training pods: $DURATION_15 seconds"
In addition, you can plan to speed up the auto-retrain-all job after you get the estimation of duration. For more information about speeding up the auto-retrain-all job, see the Speeding up the auto-retrain-all job topic.
Procedure
To retrain your backend model by using the auto-retrain-all job, you do the following steps:
- Set up the environment variables for the auto-retrain-all job
- Run the auto-retrain-all job
- Validate the auto-retrain-all job
Set up the environment variables for the auto-retrain-all job
Set up the following environment variable before you run the auto-retrain-all job:
-
Set the
AUTO_RETRAIN
environment variable tofalse
to disable any existing auto-retrain job:export AUTO_RETRAIN="false"
-
To set up the
BATCH_RETRAIN_ALL_SIZE
environment variable, you multiply the number of available training replicas,CLU_TRAINING_REPLICAS
, with2
based on the assumption that each model takes approximately~30 seconds
to train a model. Use the following command to set upBATCH_RETRAIN_ALL_SIZE
:export BATCH_RETRAIN_ALL_SIZE=$(($(oc get deploy ${INSTANCE}-clu-training --template='{{index .spec.replicas}}') * 2))
-
Set
WAIT_TIME_BETWEEN_BATCH_RETRAIN_IN_SECONDS_FOR_RETRAIN_ALL
to(60-${BATCH_RETRAIN_ALL_SIZE})
:export WAIT_TIME_BETWEEN_BATCH_RETRAIN_IN_SECONDS_FOR_RETRAIN_ALL=$((60-${BATCH_RETRAIN_ALL_SIZE}))
-
Set
WAIT_TIME_BETWEEN_TRAININGS_FOR_RETRAIN_ALL
to 1:export WAIT_TIME_BETWEEN_TRAININGS_FOR_RETRAIN_ALL=1
-
Set
AUTO_RETRAIN_ALL_CRON_SCHEDULE
to the time that you want to run the auto-retrain-all job:export AUTO_RETRAIN_ALL_CRON_SCHEDULE=<value of cron schedule>
For example, you can give a value such as
"0 40 19 11 3 ? 2024"
, which is in the following format:(Seconds) (Minutes) (Hours) (Day of Month) (Month) (Day of Week) (Year)
You must set the time in UTC time zone.
-
Set
AUTO_RETRAIN_ALL_ENABLED
to true:export AUTO_RETRAIN_ALL_ENABLED="true"
Run the auto-retrain-all job
-
To run the auto-retrain-all job, use the following command:
export PROJECT_CPD_INST_OPERANDS=<namespace where Cloud Pak for Data and Assistant is installed> export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: ${INSTANCE}-store-admin-env-vars namespace: ${PROJECT_CPD_INST_OPERANDS} spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistantStore name: ${INSTANCE} patchType: patchStrategicMerge patch: store-admin: deployment: spec: template: spec: containers: - name: store-admin env: - name: AUTO_RETRAIN value: "${AUTO_RETRAIN}" - name: AUTO_RETRAIN_ALL_CRON_SCHEDULE value: "${AUTO_RETRAIN_ALL_CRON_SCHEDULE}" - name: AUTO_RETRAIN_ALL_ENABLED value: "${AUTO_RETRAIN_ALL_ENABLED}" - name: BATCH_RETRAIN_ALL_SIZE value: "${BATCH_RETRAIN_ALL_SIZE}" - name: WAIT_TIME_BETWEEN_BATCH_RETRAIN_IN_SECONDS_FOR_RETRAIN_ALL value: "${WAIT_TIME_BETWEEN_BATCH_RETRAIN_IN_SECONDS_FOR_RETRAIN_ALL}" - name: WAIT_TIME_BETWEEN_TRAININGS_FOR_RETRAIN_ALL value: "${WAIT_TIME_BETWEEN_TRAININGS_FOR_RETRAIN_ALL}" EOF
-
After you complete the auto-retrain-all job, you must disable the auto-retrain-all flag and enable auto-retrain flag by using the following commands:
oc patch temporarypatch ${INSTANCE}-store-admin-env-vars -p '{"metadata":{"finalizers":[]}}' --type=merge -n ${PROJECT_CPD_INST_OPERANDS} oc delete temporarypatch ${INSTANCE}-store-admin-env-vars -n ${PROJECT_CPD_INST_OPERANDS} oc patch watsonassistantstore/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oppy.ibm.com/temporary-patches\":null}}}" --type=merge
Validate the auto-retrain-all job
You can validate the successful completion of the auto-retrain-all job by comparing the number of Affected workspaces found
with the Retrained Total
count in the store-admin service log. To get the number of Affected workspaces found
and the Retrained Total
, run the following command:
oc logs $(oc get pod -l component=store-admin --no-headers |awk '{print $1}') | grep "\[RETRAIN-ALL-SUMMARY\] Affected workspaces found"
If the auto-retrain-all job is successful, the Retrained Total
count equals the number of Affected workspaces found
. In addition, if the difference between the counts of the Retrained Total
and Affected workspaces found
is small, the auto-retrain-all job completes successfully by training the remaining models in the background. However, if there is a big difference between Retrained Total
and Affected workspaces found
, you
must look at the store-admin logs to analyze the issue and consider speeding up the auto-retrain-all job.
Speeding up the auto-retrain-all job
The duration to complete the auto-retrain-all job depends on the number of models to train. Therefore, to speed up the training process, you must scale
the number of CLU_TRAINING_REPLICAS
and its dependencies. For
example, if you scale
the number of CLU_TRAINING_REPLICAS
to x
, you must scale
the number of dependent replicas per the following calculation:
TFMM_REPLICAS
to 0.5xDRAGONFLY_CLU_MM_REPLICAS
to 0.3xCLU_EMBEDDING_REPLICAS
to 0.2xCLU_TRITON_SERVING_REPLICAS
to 0.2x.
If your calculation result for the number of models is a decimal number, then you must round-up the result to the next greater whole number. For example, if the number of TFMM_REPLICAS
is 2.4, then round-up the value to 3.
Use the following steps to scale
the number of models:
-
Register the values of the number of replicas per your calculation:
export CLU_TRAINING_REPLICAS=<value from calculation> export TFMM_REPLICAS=<value from calculation> export DRAGONFLY_CLU_MM_REPLICAS=<value from calculation> export CLU_EMBEDDING_REPLICAS=<value from calculation> export CLU_TRITON_SERVING_REPLICAS=<value from calculation>
-
Increase the number of
REPLICAS
by using the following command:export PROJECT_CPD_INST_OPERANDS=<namespace where Cloud Pak for Data and Assistant is installed> export INSTANCE=`oc get wa -n ${PROJECT_CPD_INST_OPERANDS} |grep -v NAME| awk '{print $1}'` cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: ${INSTANCE}-clu-training-replicas namespace: ${PROJECT_CPD_INST_OPERANDS} spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistantCluTraining name: $INSTANCE patchType: patchStrategicMerge patch: clu-training: deployment: training: spec: replicas: ${CLU_TRAINING_REPLICAS} EOF cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: ${INSTANCE}-clu-runtime-replicas namespace: ${PROJECT_CPD_INST_OPERANDS} spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistantCluRuntime name: ${INSTANCE} patchType: patchStrategicMerge patch: tfmm: deployment: spec: replicas: ${TFMM_REPLICAS} dragonfly-clu-mm: deployment: spec: replicas: ${DRAGONFLY_CLU_MM_REPLICAS} EOF cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: ${INSTANCE}-clu-replicas namespace: ${PROJECT_CPD_INST_OPERANDS} spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistantClu name: ${INSTANCE} patchType: patchStrategicMerge patch: clu-embedding: deployment: spec: replicas: ${CLU_EMBEDDING_REPLICAS} clu-triton-serving: deployment: spec: replicas: ${CLU_TRITON_SERVING_REPLICAS} EOF
-
After you complete the auto-retrain-all job, you must revert the number of
REPLICAS
to the original numbers:oc patch temporarypatch ${INSTANCE}-clu-training-replicas -p '{"metadata":{"finalizers":[]}}' --type=merge -n ${PROJECT_CPD_INST_OPERANDS} oc patch temporarypatch ${INSTANCE}-clu-runtime-replicas -p '{"metadata":{"finalizers":[]}}' --type=merge -n ${PROJECT_CPD_INST_OPERANDS} oc patch temporarypatch ${INSTANCE}-clu-replicas -p '{"metadata":{"finalizers":[]}}' --type=merge -n ${PROJECT_CPD_INST_OPERANDS} oc delete temporarypatch ${INSTANCE}-clu-training-replicas -n ${PROJECT_CPD_INST_OPERANDS} oc delete temporarypatch ${INSTANCE}-clu-runtime-replicas -n ${PROJECT_CPD_INST_OPERANDS} oc delete temporarypatch ${INSTANCE}-clu-replicas -n ${PROJECT_CPD_INST_OPERANDS} oc patch watsonassistantclutraining/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oppy.ibm.com/temporary-patches\":null}}}" --type=merge oc patch watsonassistantcluruntime/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oppy.ibm.com/temporary-patches\":null}}}" --type=merge oc patch watsonassistantclu/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oppy.ibm.com/temporary-patches\":null}}}" --type=merge oc patch watsonassistantclutraining/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oper8.org/temporary-patches\":null}}}" --type=merge oc patch watsonassistantcluruntime/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oper8.org/temporary-patches\":null}}}" --type=merge oc patch watsonassistantclu/${INSTANCE} -p "{\"metadata\":{\"annotations\":{\"oper8.org/temporary-patches\":null}}}" --type=merge