IBM Cloud Docs
Backing up and restoring data in Cloud Pak for Data

Backing up and restoring data in Cloud Pak for Data

Use the following procedures to back up and restore data in your IBM Watson® Discovery for IBM Cloud Pak® for Data instance.

IBM Cloud Pak for Data

This information applies only to installed deployments.

You use the same set of backup and restore scripts to back up and restore data in any of the supported upgrade paths. The backup script stores the version number of the service with data to back up from the existing deployment. The restore script detects the version of the service that is installed on the new IBM Cloud Pak for Data deployment, and then follows the appropriate steps to restore data to the detected version.

The following table lists the upgrade paths that are supported by the scripts.

Supported upgrade paths
Version in use Version that you can upgrade to
4.8.x 4.8.x
4.7.x 4.8.x
4.6.x 4.8.x
4.5.x 4.8.x
4.0.x 4.8.x except 4.8.0

If you are upgrading from 4.6.x to 4.8.x, a simpler way to complete the upgrade is described in the following topics:

If you use IBM Cloud Pak for Data Red Hat OpenShift APIs for Data Protection (OADP) backup and restore utility to offline back up and restore an entire cluster, a few extra steps are required. For more information, see Using OADP to back up a cluster where Discovery is installed. For information about online OADP backup and restore, see Cloud Pak for Data online backup and restore.

You can do an in-place upgrade from one 4.8.x version to a later 4.8.y version. For more information, see Upgrading Watson Discovery from Version 4.8.x to a later 4.8 refresh.

You can do an in-place upgrade from one 4.7.x version to a later 4.7.y version. For more information, see Upgrading Watson Discovery from Version 4.7.x to a later 4.7 refresh.

You can do an in-place upgrade from one 4.6.x version to a later 4.6.y version. For more information, see Upgrading Watson Discovery from Version 4.6.x to a later 4.6 refresh.

You can do an in-place upgrade from one 4.5.x version to a later 4.5.y version. For more information, see Upgrading Watson Discovery to the latest Version 4.5 refresh.

You can do an in-place upgrade from one 4.0.x version to a later 4.0.y version. For more information, see Upgrading Watson Discovery to a newer 4.0 refresh.

Process overview

At a high level, the process includes the following steps:

  1. Back up your Discovery data by using the backup script.
  2. Install the latest version of IBM Cloud Pak for Data.
  3. Install the latest version of the Discovery service on the cluster.
  4. Restore the backed-up Discovery data by using the restore script.

Back up and restore limitations

You cannot migrate the following data:

  • Dictionary suggestions models. These models are created when you build a dictionary. The dictionary is included in the backup, but the term suggestions model is not. Reprocess the migrated collections to enable dictionary term suggestions.
  • You cannot back up and restore curations or migrate them because curations are a beta feature.

You can back up and restore some data by using the backup and restore scripts, but you must back up and restore other data manually. The following data must be backed up manually:

  • Local file system folders and documents that you can crawl by using the Local file system data source.

The following updates are made when your collections are restored:

  • Any collection that contains documents that were created by uploading data are automatically recrawled and reindexed when restored. These documents are assigned new document ID numbers in the restored collections.
  • Collections that were used in Content Mining projects are automatically recrawled and reindexed when restored. Only documents that are added by uploading data are assigned new document ID numbers in the restored collections.

Back up and restore methods

You can back up and restore your instance of Discovery manually or by using scripts.

You must have Administrative access to the Discovery instance on your Discovery cluster (where the data to be backed up is stored) and administrative access to the new instance (where the data will be restored to).

The backup and restore scripts complete many operations and can take quite a bit of time to run. To avoid timeout issues, run a tool that prevents timeouts, such as nohup.

Using the backup scripts

Because changes to the data stored in IBM Watson® Discovery during a backup can cause the backup to become corrupted and unusable, no in-flight requests are allowed during the backup period.

An in-flight request is any IBM Watson® Discovery action that processes data, including the following actions:

  • Source crawl (scheduled or unscheduled)
  • Ingesting documents
  • Training a trained query model

The amount of storage that is available in the node where you run the backup script must be 3 times as large as the largest backup file in the data store that you plan to back up. If your data store is large, consider using a persistent volume claim instead of relying on the node's ephemeral storage. For more information, see Configuring jobs to use PVC.

Complete the following steps to back up IBM Watson® Discovery data by using the backup scripts:

  1. Enter the following command to set the current namespace where your Discovery instance is deployed:

    oc project <namespace>
    
  2. Get the backup script from the GitHub repository.

    You need all of the files in the repository to complete a backup and restore. Follow the instructions in GitHub Help to clone or download a compressed file of the repository.

  3. Make each script an executable file by running the following command:

    chmod +x <name-of-script>
    

    Replace <name-of-script> with the name of the script.

  4. Run the all-backup-restore.sh script.

    ./all-backup-restore.sh backup [ -f backup_file_name ] [--pvc]
    

    The -f backup_file_name parameter is optional. The name watson_discovery_<timestamp>.backup is used if you don't specify a name.

    The --pvc parameter is optional. For more information about when to use it, see Configuring jobs to use PVC. By default, the backup and restore scripts create a tmp directory in the current directory that the script uses for extracting or compressing backup files.

    If you run into issues with the backup, rerun the backup command and include the --use-job parameter. This parameter instructs the backup script to use a Kubernetes job to back up ElasticSearch and MinIO in addition to Postgres, which uses a Kubernetes job by default. If the size of the data in ElasticSearch and MinIO is large and ephemeral storage is insufficient, include the --pvc option. When you do so, the script uses the persistent volume claim that is specified with the --pvc option instead of the emptyDir ephemeral storage as the temporary working directory for the job.

Extracting files from the backup archive file

The scripts generate an archive file, including the backup files of the services that are listed in Step 1.

  1. You can extract files from the archive file by running the following command:

    tar xvf <backup_file_name>
    

Configuring jobs to use PVC

The backup and restore process uses Kubernetes jobs. The jobs use ephemeral volumes that use ephemeral storage. It is a temporary storage mount on the pod that uses local storage of a node. In rare cases, the ephemeral storage is not large enough. You can optionally instruct the job to mount a Persistent Volume Claim (PVC) on its pod to use for storing the backup data. To do so, specify the --pvc option when you run the script. The scripts use emptyDir of Kubernetes otherwise.

In most cases, you don't need to use a persistent volume. If you choose to use a persistent volume, the volume must be 3 times as large as the largest backup file in the data store. The size of the data store's backup file depends on usage. After you create a backup, you can extract files from the archive file to check the file sizes.

Also, you must have 2 times as much disk space available on the local system as the size of the data store because the archive of the data is split and then recombined to prevent issues that might otherwise occur when you copy large files from the cluster node to the local system.

Mapping multitenant clusters

When you restore data that was backed up from a version earlier than 4.0.6 to any later release and the backed-up deployment had more than one instance of the service provisioned, an extra step is required. You must create a JSON file that maps the service instance IDs between the backed-up cluster and the cluster where the data is being restored.

This mapping step is not required if the instance IDs did not change between the back up and restore steps. For example, you can skip this step if you are restoring data to the same cluster where it was backed up from or if you are restoring data to a brand new cluster that has no Discovery instances.

To create a mapping, complete the following steps:

  1. Extract the mapping template file from the backup archive file.

    tar xf <backup_file_name> tmp/instance_mapping.json -O > <mapping_file_name> 
    
  2. Make a list of the names and instance IDs of the service instances that are provisioned to the cluster where the data is being restored.

    The instance ID is part of the URL that is specified in the instance summary page. From the IBM Cloud Pak for Data web client main menu, expand Services, and then click Instances. Find your instance, and then click it to open its summary page. Scroll to the Access information section of the page, and look for the instance ID in the URL field.

    For example, https://<host_name>/wd/<namespace>-wd/instances/<instance_id>/api.

    Repeat this step to make a note of the instance ID for every instance that is provisioned.

  3. Edit the mapping file.

    Add the instance IDs for the destination service instances that you listed in the previous step. The following snippet is an example of a mapping file.

    {
      "instance_mappings": [
        {
          "display_name": "discovery-1",
          "source_instance_id": "1644822491506334",
          "dest_instance_id": "<new_instance_id>"
        },
        {
          "display_name": "discovery-2",
          "source_instance_id": "1644822552830325",
          "dest_instance_id": "<new_instance_id>"
        }
      ]
    }
    

When you run the restore script, include the optional --mapping parameter to apply this mapping file when the data is restored.

Backing up data manually

Manually back up data that is not backed up by using the scripts.

To manually back up your data from an instance of Discovery, complete the following steps:

  1. Enter the following command to log on to your Discovery cluster:

    oc login https://<OpenShift administrative console URL> \
    -u <cluster administrator username> -p <password>
    
  2. Enter the following command to switch to the proper namespace:

    oc project <discovery-install namespace>
    
  3. Enter oc get pods|grep crawler.

  4. Enter the following command:

    oc cp <crawler pod>:/mnt <path-to-backup-directory>
    

Using the restore scripts

If you are restoring data from a version earlier than 4.0.6 and you are restoring a multitenant cluster to a multitenant cluster, you must take an extra step before you begin. For more information, see Mapping multitenant clusters.

Complete the following steps to restore data in IBM Watson® Discovery by using the restore scripts:

  1. Enter the following command to set the current namespace where your Discovery instance is deployed:

    oc project <namespace>
    
  2. If you haven't already, get the restore script from the GitHub repository.

    You need all of the files in the repository to complete a back up and restore. Follow the instructions in GitHub Help to clone or download a compressed file of the repository.

  3. Make each script an executable file by running the following command:

    chmod +x <name-of-script>
    

    Replace <name-of-script> with the name of the script.

  4. Restore the data from the backup file on your local system to the new Discovery deployment by running the following command:

    ./all-backup-restore.sh restore -f backup_file_name [--pvc] [--mapping]
    

    The --pvc parameter is optional. For more information about when to use it, see Configuring jobs to use PVC.

    The --mapping parameter is optional. For more information about when to use it, see Mapping multitenant clusters.

    By default, the backup and restore scripts create a tmp directory in the current directory that the script uses for extracting or compressing backup files. If you used the --use-job parameter when you backed up the data, specify it again when you restore the data. This parameter instructs the backup script to use a Kubernetes job to back up ElasticSearch and MinIO.

    The gateway, ingestion, orchestrator, hadoop worker, and controller pods automatically restart.

Restoring data manually

Manually restore data that cannot be restored by using the script.

To manually restore your data from an instance of Discovery, complete the following steps:

  1. Enter the following command to log on to your Discovery cluster:

    oc login https://<OpenShift administrative console URL> \
    -u <cluster administrator username> -p <password>
    
  2. Enter the following command to switch to the proper namespace:

    oc project <discovery-install namespace>
    
  3. Enter oc get pods|grep crawler.

  4. Enter the following command:

    oc cp <path-to-backup-directory> <crawler pod>:/mnt
    

Using OADP to offline back up a cluster where Discovery is installed

If you plan to offline back up and restore an entire IBM Cloud Pak for Data instance by using the IBM Cloud Pak for Data Red Hat OpenShift APIs for Data Protection (OADP) backup and restore utility, you must do some additional steps in the right order for the utility to work properly when Discovery is present. See Cloud Pak for Data offline backup and restore (OADP utility).

Backing up a cluster offline

To take an offline back up of a cluster, complete the following steps:

  1. Run the Discovery backup script.

  2. Use the OADP backup utility to back up the cluster.

Restoring a cluster offline

To offline restore a cluster, complete the following steps:

  1. Use the OADP backup utility to restore the cluster.

  2. Uninstall Discovery, and then install Discovery again on the restored cluster.

    The re-installation is required because the utility does not always reinstall Discovery correctly.

  3. Run the Discovery restore script to restore your data.