Troubleshooting known issues for IBM Cloud Pak for Data
IBM Cloud Pak for Data
Get help with solving issues that you might encounter with Watson Assistant on IBM Cloud Pak for Data.
4.5.x
Pod RESTARTS
count stays at 0 after a 4.5.x upgrade even though a few assistant pods are restarting
-
Problem: After you upgrade Watson Assistant, the pod
RESTARTS
count stays at 0 even though certain assistant pods are restarting. -
Cause: During the upgrade, custom resources that are owned by Watson Assistant for the
certificates.certmanager.k8s.io
CRD are deleted by using a script that runs in the background. Sometimes the CR deletion script completes before the assistant operator gets upgraded. In that case, the old assistant operator might re-create custom resources for thecertificates.certmanager.k8s.io
CRD. Leftover CRs might cause the certificate manager to continuously regenerate some certificate secrets, causing some assistant pods to restart recursively. -
Solution: Run the following script to delete leftover custom resources for the
certificates.certmanager.k8s.io
CRD after you set INSTANCE (normallywa
) and PROJECT_CPD_INSTANCE variables:for i in `oc get certificates.certmanager.k8s.io -l icpdsupport/addOnId=assistant --namespace ${PROJECT_CPD_INSTANCE} | grep "${INSTANCE}-"| awk '{print $1}'`; do oc delete certificates.certmanager.k8s.io $i --namespace ${PROJECT_CPD_INSTANCE}; done
4.5.0
Data Governor not healthy after installation
After Watson Assistant is installed, the dataexhausttenant
custom resource that is named wa-data-governor-ibm-data-governor-data-exhaust-internal
gets stuck in the Topics
phase. When this happens, errors
in the Data Governor pods report that the service does not exist.
-
Get the status of the
wa-data-governor
custom resource:oc get DataExhaust
-
Wait for the
wa-data-governor
custom resource to be in theCompleted
phase:NAME STATUS VERSION COMPLETED wa-data-governor Completed master 1s
-
Pause the reconciliation of the
wa-data-governor
custom resource:oc patch dataexhaust wa-data-governor -p '{"metadata":{"annotations":{"pause-reconciliation":"true"}}}' --type merge
-
Apply the fix to the
dataexhausttenant
custom resource:oc patch dataexhausttenant wa-data-governor-ibm-data-governor-data-exhaust-internal -p '{"spec":{"topics":{"data":{"replicas": 1}}}}' --type merge
-
Wait for the Data Governor pods to stop failing. You can restart the admin pods to speed up this process.
-
Continue the reconciliation of the
wa-data-governor
custom resource:oc patch dataexhaust wa-data-governor --type=json -p='[{"op": "remove", "path": "/metadata/annotations/pause-reconciliation"}]'
RabbitMQ gets stuck in a loop after several installation attempts
After an initial installation or upgrade failure and repeated attempts to retry, the common services RabbitMQ operator pod can get into a CrashLoopBackOff
state. For example, the log might include the following types of messages:
"error":"failed to upgrade release: post-upgrade hooks failed: warning:
Hook post-upgrade ibm-rabbitmq/templates/rabbitmq-backup-labeling-job.yaml
failed: jobs.batch "{%name}-ibm-rabbitmq-backup-label" already exists"
Resources for the ibm-rabbitmq-operator.v1.0.11
component must be removed before a new installation or upgrade is started. If too many attempts occur in succession, remaining resources can cause new installations to fail.
-
Delete the RabbitMQ backup label job from the previous installation or upgrade attempt. Look for the name of the job in the logs. The name ends in
-ibm-rabbitmq-backup-label
:oc delete job {%name}-ibm-rabbitmq-backup-label -n ${PROJECT_CPD_INSTANCE}
-
Check that the pod returns a
Ready
state:oc get pods -n ibm-common-services | grep ibm-rabbitmq
Preparing to install a size large deployment
If you specify large
for the watson_assistant_size
option when you install Watson Assistant, the installation fails to complete successfully.
Before you install a size large deployment of Watson Assistant, apply the following fix. The following fix uses wa
as the name of the Watson Assistant instance and cpd
as the namespace where Watson Assistant is installed.
These values are set in environment variables. Before you run the command, update the INSTANCE
variable with the name of your instance, and update the NAMESPACE
variable with the namespace where your instance in
installed:
INSTANCE=wa ; \
NAMESPACE=cpd ; \
#DRY_RUN="--dry-run=client --output=yaml" # To apply the changes, change to an empty string
DRY_RUN=""
cat <<EOF | tee wa-network-policy-base.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
oppy.ibm.com/internal-name: infra.networkpolicy
labels:
app: ${INSTANCE}-network-policy
app.kubernetes.io/instance: ${INSTANCE}
app.kubernetes.io/managed-by: Ansible
app.kubernetes.io/name: watson-assistant
component: network-policy
icpdsupport/addOnId: assistant
icpdsupport/app: ${INSTANCE}-network-policy
icpdsupport/ignore-on-nd-backup: "true"
icpdsupport/serviceInstanceId: inst-1
service: conversation
slot: ${INSTANCE}
tenant: PRIVATE
velero.io/exclude-from-backup: "true"
name: ${INSTANCE}-network-policy
namespace: ${NAMESPACE}
spec:
ingress:
- from:
- podSelector:
matchLabels:
service: conversation
slot: ${INSTANCE}
- podSelector:
matchLabels:
slot: global
- podSelector:
matchLabels:
component: watson-gateway
- podSelector:
matchLabels:
component: dvt
- podSelector:
matchLabels:
dwf_service: ${INSTANCE}-clu
network-policy: allow-egress
- podSelector:
matchLabels:
app: 0020-zen-base
- namespaceSelector:
matchLabels:
ns: ${NAMESPACE}
podSelector:
matchLabels:
app: 0020-zen-base
- podSelector:
matchLabels:
component: ibm-nginx
- namespaceSelector:
matchLabels:
ns: ${NAMESPACE}
podSelector:
matchLabels:
component: ibm-nginx
- namespaceSelector:
matchLabels:
assistant.watson.ibm.com/role: operator
podSelector:
matchLabels:
release: assistant-operator
- namespaceSelector:
matchLabels:
assistant.watson.ibm.com/role: operator
podSelector:
matchLabels:
app: watson-assistant-operator
- namespaceSelector:
matchLabels:
assistant.watson.ibm.com/role: operator
podSelector:
matchLabels:
app.kubernetes.io/instance: watson-assistant-operator
- namespaceSelector:
matchLabels:
assistant.watson.ibm.com/role: operator
podSelector:
matchLabels:
app.kubernetes.io/instance: ibm-etcd-operator-release
- namespaceSelector:
matchLabels:
assistant.watson.ibm.com/role: operator
podSelector:
matchLabels:
app.kubernetes.io/instance: ibm-etcd-operator
podSelector:
matchLabels:
service: conversation
slot: ${INSTANCE}
policyTypes:
- Ingress
EOF
for MICROSERVICE in analytics clu-embedding clu-serving clu-training create-slot-job data-governor dialog dragonfly-clu-mm ed es-store etcd integrations master nlu recommends sireg-ubi-ja-tok-20160902 sireg-ubi-ko-tok-20181109 spellchecker-mm store store-admin store-cronjob store-sync store_db_creator store_db_schema_updater system-entities tfmm ui ${INSTANCE}-redis webhooks-connector
do
# Change name and add component to selector
# Apply to the cluster
cat wa-network-policy-base.yaml | \
oc patch --dry-run=client --output=yaml -f - --type=merge --patch "{
\"metadata\": {\"name\": \"${INSTANCE}-network-policy-$( echo $MICROSERVICE | tr _ -)\"},
\"spec\": {\"podSelector\":{\"matchLabels\":{\"component\": \"${MICROSERVICE}\"}}}
}" |
oc apply -f - ${DRY_RUN}
done
Fixing a size large installation
Apply this fix if you installed a size large
deployment, and your installation fails to complete successfully. In some cases, pods aren't able to communicate with other pods, and the Transmission Control Protocol (TCP) connections
can't be established.
To confirm whether you are affected by this issue, run the following command:
oc logs --selector app=sdn --namespace openshift-sdn --container sdn | grep "Ignoring NetworkPolicy"
If you are affected, you see output similar to the following example:
W0624 12:58:21.407901 2480 networkpolicy.go:484] Ignoring NetworkPolicy cpd/wa-network-policy because it generates
If you encounter this error, apply the following fix to resolve the issue. The following fix uses wa
as the name of the Watson Assistant instance. This value is set in an environment variable. Before you run the command, update
the INSTANCE
variable with the name of your instance:
INSTANCE=wa ; \
#DRY_RUN="--dry-run=client --output=yaml" # To apply the changes, change to an empty string
DRY_RUN=""
for MICROSERVICE in analytics clu-embedding clu-serving clu-training create-slot-job data-governor dialog dragonfly-clu-mm ed es-store etcd integrations master nlu recommends sireg-ubi-ja-tok-20160902 sireg-ubi-ko-tok-20181109 spellchecker-mm store store-admin store-cronjob store-sync store_db_creator store_db_schema_updater system-entities tfmm ui ${INSTANCE}-redis webhooks-connector
do
# Get original networking policy
# Clean up metadata fields to get the resource applied by Watson Assistant
# Change name and add component to selector
# Apply to the cluster
oc get networkpolicy $INSTANCE-network-policy --output yaml | \
oc patch --dry-run=client --output=yaml -f - --type=json --patch='[
{"op":"remove", "path":"/metadata/creationTimestamp"},
{"op":"remove", "path":"/metadata/generation"},
{"op":"remove", "path":"/metadata/resourceVersion"},
{"op":"remove", "path":"/metadata/uid"},
{"op":"remove", "path":"/metadata/annotations/kubectl.kubernetes.io~1last-applied-configuration"},
{"op":"remove", "path":"/metadata/ownerReferences"}
]' | \
oc patch --dry-run=client --output=yaml -f - --type=merge --patch "{
\"metadata\": {\"name\": \"${INSTANCE}-network-policy-$( echo $MICROSERVICE | tr _ -)\"},
\"spec\": {\"podSelector\":{\"matchLabels\":{\"component\": \"${MICROSERVICE}\"}}}
}" |
oc apply -f - ${DRY_RUN}
done
Unable to scale the size of Redis pods
When you scale the deployment size of Watson Assistant, the Redis pods do not scale correctly and don't match the new size of the deployment (small
, medium
, or large
). This problem is a known issue in
Redis operator v1.5.1.
When you scale the size of your deployment, you must delete the Redis custom resource. Redis automatically re-creates the custom resource with the correct size and pods.
To delete and re-create the Redis custom resource:
-
Source the environment variables from the script:
source ./cpd_vars.sh
If you don't have the script that defines the environment variables, see Setting up installation environment variables.
-
Export the name of the Redis custom resource to an environment variable:
export REDIS_CR_NAME=`oc get redissentinels.redis.databases.cloud.ibm.com -l icpdsupport/addOnId=assistant -n ${PROJECT_CPD_INSTANCE} | grep -v NAME| awk '{print $1}'`
-
Delete the Redis custom resource:
oc delete redissentinels.redis.databases.cloud.ibm.com ${REDIS_CR_NAME} -n ${PROJECT_CPD_INSTANCE}
It might take approximately 5 minutes for the custom resource to be re-created.
-
Verify that Redis is running:
oc get redissentinels.redis.databases.cloud.ibm.com -n ${PROJECT_CPD_INSTANCE}
-
Export the name of your instance as an environment variable:
export INSTANCE=`oc get wa -n ${PROJECT_CPD_INSTANCE} |grep -v NAME| awk '{print $1}'`
-
Delete the Redis analytics secret:
oc delete secrets ${INSTANCE}-analytics-redis
-
Delete the
${INSTANCE}-analytics
deployment pods.
4.0.x
Data Governor error causes deployment failure
The following fix applies to 4.0.0 through 4.0.8. In some cases, the deployment is stuck and pods are not coming up because of an issue with the interaction between the Events operator and the Data Governor custom resource (CR).
Complete the following steps to determine whether you are impacted by this issue and, if necessary, apply the patch to resolve it:
-
To determine whether you are impacted, run the following command to see whether the CR was applied successfully:
oc get dataexhaust wa-data-governor -n $OPERAND_NS -o yaml
If you do not receive any error, then you do not need to apply the patch. If you receive an error similar to the following example, complete the next step to apply the patch:
message: 'Failed to create object: b''{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: replace operation does not apply: doc is missing path: /metadata/labels/icpdsupport/serviceInstanceId: missing value","reason":"InternalError","details":{"causes":[{"message":"replace operation does not apply: doc is missing path: /metadata/labels/icpdsupport/serviceInstanceId: missing value"}]},"code":500}\n''' reason: Failed status: "True" type: Failure ibmDataGovernorService: InProgress
-
From your operand namespace, run the following command to apply the patch. In the command,
wa
is used as the name of the instance. Replace this value with the name of your instance:cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: wa-data-governor spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistant name: wa # Replace wa with the name of your Watson Assistant instance patch: data-governor: dataexhaust: spec: additionalLabels: icpdsupport/serviceInstanceId: inst-1 kafkauser: metadata: labels: icpdsupport/serviceInstanceId: inst-1 patchType: patchStrategicMerge EOF
Wait about 15 minutes for the changes to take effect.
-
Validate that the patch was applied successfully:
oc get dataexhaust wa-data-governor -n $OPERAND_NS -o yaml
The patch was applied successfully when the value of
serviceInstanceId
isinst-1
:spec: additionalLabels: icpdsupport/serviceInstanceId: inst-1
Security context constraint permission errors
The following fix applies to 4.0.0 through 4.0.5. If a cluster has a security context constraint (SCC) that takes precedence over restricted SCCs and has different permissions than restricted SCCs, then 4.0.0
through 4.0.5 installations might fail with permission errors. For example, the update-schema-store-db-job
job reports errors similar to the following example:
oc logs wa-4.0.2-update-schema-store-db-job-bpsdr postgres-is-prepared
Waiting until postgres is running and responding
psql: error: could not read root certificate file "/tls/ca.crt": Permission denied
- The basic command to postgres failed (retry in 5 sec)
psql: error: could not read root certificate file "/tls/ca.crt": Permission denied
- The basic command to postgres failed (retry in 5 sec)
psql: error: could not read root certificate file "/tls/ca.crt": Permission denied
- The basic command to postgres failed (retry in 5 sec)
..
..
Other pods might have similar permission errors. If you look at the SCCs of the pods, you can see they are not restricted. For example, if you run the oc describe pod wa-etcd-0 |grep scc
command, you get an output similar to the
following example:
openshift.io/scc: fsgroup-scc
To fix this issue, raise the priority of the restricted SCC so that it takes precedence:
-
Run the following command:
oc edit scc restricted
-
Change the
priority
fromnull
to1
.
Now, new pods default back to the expected restricted SCC. When you run the oc describe pod wa-etcd-0 |grep scc
command, you get an output similar to the following example:
openshift.io/scc: restricted
Unable to collect logs with a webhook
The following fix applies to all versions of Watson Assistant 4.0.x. If you're unable to collect logs with a webhook, it might be because you are using a webhook that connects to a server that is using a self-signed certificate. If so, complete the following steps to import the certificate into the keystore so that you can collect logs with a webhook:
-
Log in to cluster and
oc project cpd-instance
, which is the namespace where the instance is located. -
Run the following command. In the following command, replace
INSTANCE_NAME
with the name of your instance and replaceCUSTOM_CERTIFICATE
with your Base64 encoded custom certificate key:INSTANCE="INSTANCE_NAME" # Replace INSTANCE_NAME with the name of the Watson Assistant instance CERT="CUSTOM_CERTIFICATE" # Replace CUSTOM_CERTIFICATE with the custom certificate key cat <<EOF | oc apply -f - apiVersion: v1 data: ca_cert: ${CERT} kind: Secret metadata: name: ${INSTANCE}-custom-webhooks-cert type: Opaque --- apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: ${INSTANCE}-add-custom-webhooks-cert spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistantStore name: ${INSTANCE} patchType: patchStrategicMerge patch: webhooks-connector: deployment: spec: template: spec: containers: - name: webhooks-connector env: - name: CERTIFICATES_IMPORT_LIST value: /etc/secrets/kafka/ca.pem:kafka_ca,/etc/secrets/custom/ca.pem:custom_ca volumeMounts: - mountPath: /etc/secrets/custom name: custom-cert readOnly: true volumes: - name: custom-cert secret: defaultMode: 420 items: - key: ca_cert path: ca.pem secretName: ${INSTANCE}-custom-webhooks-cert EOF
-
Wait approximately 10 minutes for the
wa-webhooks-connector
pod to restart. This pod restarts automatically. -
After the pod restarts, check the logs by running the following command. In the command, replace
XXXX
with the suffix of thewa-webhooks-connector
pod:oc logs wa-webhooks-connector-XXXX // Replace XXXX with the suffix of the wa-webhooks-connector pod
After you run this command, you should see two lines similar to the following example at the beginning of the log:
Certificate was added to keystore Certificate was added to keystore
When you see these two lines, then the custom certificate was properly imported into the keystore.
4.0.5
Install Redis if foundational services version is higher than 3.14.1
If you are installing the Redis operator with an IBM Cloud Pak foundational services version higher than 3.14.1, the Redis operator might get stuck in Pending
status. If you have an air-gapped cluster, complete the steps in the
Air-gapped cluster section to resolve this issue. If you are using the IBM Entitled Registry, complete the steps in the IBM Entitled Registry section to resolve this issue.
Air-gapped cluster
-
Check the status of the Redis operator:
oc get opreq common-service-redis -n ibm-common-services -o jsonpath='{.status.phase} {"\n"}'
-
If the Redis operand request is stuck in
Pending
status, delete the operand request:oc delete opreq watson-assistant-redis -n ibm-common-services
-
Set up your environment to download the CASE packages.
-
Create the directories where you want to store the CASE packages:
mkdir -p $HOME/offline/cpd mkdir -p $HOME/offline/cpfs
-
Set the following environment variables:
export CASE_REPO_PATH=https://github.com/IBM/cloud-pak/raw/master/repo/case export OFFLINEDIR=$HOME/offline/cpd export OFFLINEDIR_CPFS=$HOME/offline/cpfs
-
-
Download the Redis operator and IBM Cloud Pak® for Data platform operator CASE packages:
cloudctl case save \ --repo ${CASE_REPO_PATH} \ --case ibm-cloud-databases-redis \ --version 1.4.5 \ --outputdir $OFFLINEDIR cloudctl case save \ --repo ${CASE_REPO_PATH} \ --case ibm-cp-datacore \ --version 2.0.10 \ --outputdir ${OFFLINEDIR} \ --no-dependency
-
Create the Redis catalog source:
cloudctl case launch \ --case ${OFFLINEDIR}/ibm-cloud-databases-redis-1.4.5.tgz \ --inventory redisOperator \ --action install-catalog \ --namespace openshift-marketplace \ --args "--registry icr.io --inputDir ${OFFLINEDIR} --recursive"
-
Set the environment variables for your registry credentials:
export PRIVATE_REGISTRY_USER=username export PRIVATE_REGISTRY_PASSWORD=password export PRIVATE_REGISTRY={registry-info}
-
Run the following command to store the credentials:
cloudctl case launch \ --case ${OFFLINEDIR}/ibm-cp-datacore-2.0.10.tgz \ --inventory cpdPlatformOperator \ --action configure-creds-airgap \ --args "--registry ${PRIVATE_REGISTRY} --user ${PRIVATE_REGISTRY_USER} --pass ${PRIVATE_REGISTRY_PASSWORD}"
-
Mirror the images:
export USE_SKOPEO=true cloudctl case launch \ --case ${OFFLINEDIR}/ibm-cp-datacore-2.0.10.tgz \ --inventory cpdPlatformOperator \ --action mirror-images \ --args "--registry ${PRIVATE_REGISTRY} --user ${PRIVATE_REGISTRY_USER} --pass ${PRIVATE_REGISTRY_PASSWORD} --inputDir ${OFFLINEDIR}"
-
Create the Redis subscription.
-
Export the project that contains the IBM Cloud Pak® for Data operator:
export OPERATOR_NS=ibm-common-services|cpd-operators # Select the project that contains the Cloud Pak for Data operator
-
Create the subscription:
cat <<EOF | oc apply --namespace $OPERATOR_NS -f - apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: ibm-cloud-databases-redis-operator spec: name: ibm-cloud-databases-redis-operator source: ibm-cloud-databases-redis-operator-catalog sourceNamespace: openshift-marketplace EOF
-
IBM Entitled Registry
-
Check the status of the Redis operator:
oc get opreq common-service-redis -n ibm-common-services -o jsonpath='{.status.phase} {"\n"}'
-
If the Redis operand request is stuck in
Pending
status, delete the operand request:oc delete opreq watson-assistant-redis -n ibm-common-services
-
Create the Redis subscription. Create one of the following two subscriptions, depending on whether you are using.
-
Export the project that contains the IBM Cloud Pak® for Data operator:
export OPERATOR_NS=ibm-common-services|cpd-operators # Select the project that contains the Cloud Pak for Data operator
-
Create the subscription. Choose one of the following two subscriptions, depending on how you are using the IBM Entitled Registry:
-
IBM Entitled Registry from the
ibm-operator-catalog
:cat <<EOF | oc apply --namespace $OPERATOR_NS -f - apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: ibm-cloud-databases-redis-operator spec: name: ibm-cloud-databases-redis-operator source: ibm-operator-catalog sourceNamespace: openshift-marketplace EOF
-
IBM Entitled Registry with catalog sources that pull specific versions of the images:
cat <<EOF | oc apply --namespace $OPERATOR_NS -f - apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: ibm-cloud-databases-redis-operator spec: name: ibm-cloud-databases-redis-operator source: ibm-cloud-databases-redis-operator-catalog sourceNamespace: openshift-marketplace EOF
-
-
-
Validate that the operator was successfully created.
-
Run the following command to confirm that the subscription was applied:
oc get sub -n $OPERATOR_NS ibm-cloud-databases-redis-operator -o jsonpath='{.status.installedCSV} {"\n"}'
-
Run the following command to confirm that the operator is installed:
oc get pod -n $OPERATOR_NS -l app.kubernetes.io/name=ibm-cloud-databases-redis-operator \ -o jsonpath='{.items[0].status.phase} {"\n"}'
-
4.0.4
Integrations image problem on air-gapped installations
If your installation is air-gapped, your integrations image fails to properly start.
If the installation uses the IBM Entitled Registry to pull images, complete the steps in the IBM Entitled Registry section. If the installation uses a private Docker registry to pull images, complete the steps in the Private Docker registry section.
IBM Entitled Registry
If your installation uses the IBM Entitled Registry to pull images, complete the following steps to add an override entry to the CR:
-
Get the name of your instance by running the following command:
oc get wa
-
Edit and save the CR.
-
Run the following command to edit the CR. In the command, replace
INSTANCE_NAME
with the name of the instance:oc edit wa INSTANCE_NAME
-
Edit the CR by adding the following lines:
appConfigOverrides: container_images: integrations: image: cp.icr.io/cp/watson-assistant/servicedesk-integration tag: 20220106-143142-0ea3fbf7-wa_icp_4.0.5-signed@sha256:7078fdba4ab0b69dbb93f47836fd9fcb7cfb12f103662fef0d9d1058d2553910
-
-
Wait for the operator to pick up the change and start a new integrations pod. This might take up to 10 minutes.
-
After the new integrations pod starts, the old pod terminates. When the new pod starts, the server starts locally and the log looks similar to the following example:
oc logs -f ${INTEGRATIONS_POD} [2022-01-07T01:33:13.609] [OPTIMIZED] db.redis.RedisManager - Redis trying to connect. counter# 1 [2022-01-07T01:33:13.628] [OPTIMIZED] db.redis.RedisManager - Redis connected [2022-01-07T01:33:13.629] [OPTIMIZED] db.redis.RedisManager - Redis is ready to serve! [2022-01-07T01:33:14.614] [OPTIMIZED] Server - Server started at: https://localhost:9449
Private Docker registry
If your installation uses a private Docker registry to pull images, complete the following steps to download and push the new integrations image to your private Docker registry and add an override entry to the CR:
-
Edit the CSV file to add the new integrations image.
-
Run the following command to open the CSV file:
vi $OFFLINEDIR/ibm-watson-assistant-4.0.4-images.csv
-
Add the following line to the CSV file immediately after the existing integrations image:
cp.icr.io,cp/watson-assistant/servicedesk-integration,20220106-143142-0ea3fbf7-wa_icp_4.0.5-signed,sha256:7078fdba4ab0b69dbb93f47836fd9fcb7cfb12f103662fef0d9d1058d2553910,IMAGE,linux,x86_64,"",0,CASE,"",ibm_wa_4_0_0;ibm_wa_4_0_2;ibm_wa_4_0_4;vLatest
-
-
Mirror the image again by using the commands that you used to download and push all the images, for example:
cloudctl case launch \ --case ${OFFLINEDIR}/ibm-cp-datacore-2.0.9.tgz \ --inventory cpdPlatformOperator \ --action configure-creds-airgap \ --args "--registry cp.icr.io --user cp --pass $PRD_ENTITLED_REGISTRY_APIKEY --inputDir ${OFFLINEDIR}"
-
Get the name of your instance by running the following command:
oc get wa
-
Edit and save the CR.
-
Run the following command to edit the CR. In the command, replace
INSTANCE_NAME
with the name of the instance:oc edit wa INSTANCE_NAME
-
Edit the CR by adding the following lines:
appConfigOverrides: container_images: integrations: image: cp.icr.io/cp/watson-assistant/servicedesk-integration tag: 20220106-143142-0ea3fbf7-wa_icp_4.0.5-signed@sha256:7078fdba4ab0b69dbb93f47836fd9fcb7cfb12f103662fef0d9d1058d2553910
-
-
Wait for the operator to pick up the change and start a new integrations pod. This might take up to 10 minutes.
-
After the new integrations pod starts, the old pod terminates. When the new pod starts, the server starts locally and the log looks similar to the following example:
oc logs -f ${INTEGRATIONS_POD} [2022-01-07T01:33:13.609] [OPTIMIZED] db.redis.RedisManager - Redis trying to connect. counter# 1 [2022-01-07T01:33:13.628] [OPTIMIZED] db.redis.RedisManager - Redis connected [2022-01-07T01:33:13.629] [OPTIMIZED] db.redis.RedisManager - Redis is ready to serve! [2022-01-07T01:33:14.614] [OPTIMIZED] Server - Server started at: https://localhost:9449
4.0.0
Install Watson Assistant 4.0.0 with EDB version 1.8
Complete this task only if you need a fresh installation of Watson Assistant 4.0.0. Do not complete this task on existing clusters with data. Completing this task on an existing cluster with data results in data loss.
If you upgraded to EDB version 1.8 and need a new installation of Watson Assistant 4.0.0, complete the following steps. In the following steps, wa
is used as the name of the custom resource. Replace this value with the name of
your custom resource:
-
First, apply the following patch:
cat <<EOF | oc apply -f - apiVersion: assistant.watson.ibm.com/v1 kind: TemporaryPatch metadata: name: wa-postgres-180-hotfix spec: apiVersion: assistant.watson.ibm.com/v1 kind: WatsonAssistantStore name: wa # Specify the name of your custom resource patchType: patchStrategicMerge patch: postgres: postgres: spec: bootstrap: initdb: options: - "--encoding=UTF8" - "--locale=en_US.UTF-8" - "--auth-host=scram-sha-256" EOF
-
Run the following command to check that the temporary patch is applied to the
WatsonAssistantStore
custom resource. It might take up to 10 minutes after the store custom resource is updated:oc get WatsonAssistantStore wa -o jsonpath='{.metadata.annotations.oppy\.ibm\.com/temporary-patches}' ; echo
If the patch is applied, the command returns output similar to the following example:
{"wa-postgres-180-hotfix": {"timestamp": "2021-09-23T15:48:12.071497", "api_version": "assistant.watson.ibm.com/v1"}}`
-
Run the following command to delete the Postgres instance:
oc delete clusters.postgresql.k8s.enterprisedb.io wa-postgres
-
Wait 10 minutes. Then, run the following command to check the Postgres instance:
oc get pods | grep wa-postgres
When the Postgres instance is running properly, the command returns output similar to the following example:
wa-postgres-1 1/1 Running 0 37m wa-postgres-2 1/1 Running 0 36m wa-postgres-3 1/1 Running 0 36m
-
Run the following command to check that the jobs that initialize the Postgres database completed successfully:
oc get jobs wa-create-slot-store-db-job wa-4.0.0-update-schema-store-db-job
When the jobs complete successfully, the command returns output similar to the following example:
NAME COMPLETIONS DURATION AGE wa-create-slot-store-db-job 1/1 3s 31m wa-4.0.0-update-schema-store-db-job 1/1 13s 31m
If the jobs don't complete successfully, then they timed out and need to be re-created. Delete the jobs by running the
oc delete jobs wa-create-slot-store-db-job wa-4.0.0-update-schema-store-db-job
command. The jobs are re-created after 10 minutes by the operator.
For information about upgrading from Watson Assistant 4.0.0 to Watson Assistant 4.0.2, see Upgrading Watson Assistant to a newer 4.0 refresh.
1.5.0
Search skill not working because of custom certificate
The search skill, which is the integration with the IBM Watson® Discovery service, might not work if you configured a custom TLS certificate in IBM Cloud Pak® for Data. If the custom certificate is not signed by a well-known certificate authority (CA), the search skill does not work as expected. You might also see errors in the Try out section of the search skill.
Validate the issue
First, check the logs of the search skill pods to confirm whether this issue applies to you.
-
Run the following command to list the search skill pods:
oc get pods -l component=skill-search
-
Run the following command to check the logs for the following exception:
oc logs -l component=skill-search | grep "IBMCertPathBuilderException"
The error looks similar to the following example:
{"level":"ERROR","logger":"wa-skills","message":"Search skill exception","exception":[{"message":"com.ibm.jsse2.util.h: PKIX path building failed: com.ibm.security.cert.IBMCertPathBuilderException: unable to find valid certification path to requested target","name":"SSLHandshakeException"}]}
If you see this error, continue to follow the steps to apply the fix.
Apply the fix
To fix the search skill, you inject the CA that signed your TLS certificate into the Java truststore that is used by the search skill. The search skill pods are then able to validate your certificate and communicate with the IBM Watson® Discovery service.
-
First, get your certificate. You might have this certificate, but in these steps you can retrieve the certificate directly from the cluster.
-
Run the following command to check that the secret exists:
oc get secret external-tls-secret
-
Run the following command to retrieve the certificate chain from the secret:
oc get secret external-tls-secret --output jsonpath='{.data.cert\.crt}' | base64 -d | tee ingress_cert_chain.crt
-
Extract the CA certificate.
The
ingress_cert_chain.crt
file typically contains multiple certificates. The last certificate in the file is usually your CA certificate. -
Copy the last certificate block that begins with
-----BEGIN CERTIFICATE-----
and ends with-----END CERTIFICATE-----
. -
Save this certificate in the
ingress_ca.crt
file. When you save theingress_ca.crt
file, the-----BEGIN CERTIFICATE-----
line must be the first line of the file, and the-----END CERTIFICATE-----
line must be the last line of the file.
-
-
Retrieve the truststore that is used by the search skill pods.
-
Run the following command to list the search skill pods:
oc get pods -l component=skill-search
-
Run the following command to set the
SEARCH_SKILL_POD
environment variable with the search skill pod name:SEARCH_SKILL_POD="$(oc get pods -l component=skill-search --output custom-columns=NAME:.metadata.name --no-headers | head -n 1
-
Run the following command to see the selected pod:
echo "Selected search skill pod: ${SEARCH_SKILL_POD}"
-
Retrieve the truststore file. The
cacerts
file is the default truststore that is used by Java. It contains the list of the certificate authorities that Java trusts by default. Run the following command to copy the binarycacerts
file from the pod into your current directory:oc cp ${SEARCH_SKILL_POD}:/opt/ibm/java/jre/lib/security/cacerts cacerts
-
-
Run the following command to inject the
ingress_ca.crt
file into thecacerts
file:keytool -import -trustcacerts -keystore cacerts -storepass changeit -alias customer_ca -file ingress_ca.crt
You can run the
keytool -list -keystore cacerts -storepass changeit | grep customer_ca -A 1
command to check that your CA certificate is included in thecacerts
file. -
Run the following command to create the configmap that contains the updated
cacerts
file:oc create configmap watson-assistant-skill-cacerts --from-file=cacerts
Because the
cacerts
file is binary, the output of theoc describe configmap watson-assistant-skill-cacerts
command shows an empty data section. To check whether the updatedcacerts
file is present in the configmap, run theoc get configmap watson-assistant-skill-cacerts --output yaml
command. -
Override the
cacerts
file in the search skill pods. In this step, you configure the operator to override thecacerts
file in the search skill pods with the updatedcacerts
file. In the following example file, the instance is calledwatson-assistant---wa
. Replace this value with the name of your instance:cat <<EOF | oc apply -f - kind: TemporaryPatch apiVersion: com.ibm.oppy/v1 metadata: name: watson-assistant---wa-skill-cert spec: apiVersion: com.ibm.watson.watson-assistant/v1 kind: WatsonAssistantSkillSearch name: "watson-assistant---wa" # Replace this with the name of your Watson Assistance instance patchType: patchStrategicMerge patch: "skill-search": deployment: spec: template: spec: volumes: - name: updated-cacerts configMap: name: watson-assistant-skill-cacerts defaultMode: 420 containers: - name: skill-search volumeMounts: - name: updated-cacerts mountPath: /opt/ibm/java/jre/lib/security/cacerts subPath: cacerts EOF
-
Wait until new search skill pods are created. It might take up to 10 minutes before the updates take effect.
-
Check that the search skill feature is working as expected.
Disable Horizontal Pod Autoscaling and set a maximum number of master pods
Horizontal Pod Autoscaling (HPA) is enabled automatically. As a result, the number of replicas changes dynamically in the range of 1 to 10 replicas. You can disable HPA if you want to limit the maximum number of master pods or if you're concerned about master pods being created and deleted too frequently.
-
Disable HPA for the
master
microservice by running the following command. In these steps, substitute your instance name for theINSTANCE_NAME
variable:oc patch wa ${INSTANCE_NAME} --type='json' --patch='[{"op": "add", "path": "/appConfigOverrides/clu", "value":{"master":{"autoscaling":{"enabled":false}}}}]'
-
Wait until the information propagates into the operator:
sleep 600
-
Run the following command to remove HPA for the
master
microservice:oc delete hpa ${INSTANCE_NAME}-master
-
Wait for about 30 seconds:
sleep 30
-
Scale down the
master
microservice to the number of replicas that you want. In the following example, themaster
microservice is scaled down to two replicas:oc scale deploy ${INSTANCE_NAME}-master --replicas=2
cpd-watson-assistant-1.5.0-patch-1
cpd-watson-assistant-1.5.0-patch-1 is available for installations of version 1.5.0.
This patch includes:
- Configuration changes for FIPS compatibility
- A fix for a blank screen issue that occurred when selecting multiple items in the UI
- An update to resolve a localStorage issue when developing in Google Chrome
- A fix for a scrolling issue
Resizing the Redis statefulset memory and cpu values after applying patch 1 for 1.5.0
Here are steps to resize Redis statefulset memory and cpu values after applying cpd-watson-assistant-1.5.0-patch-1.
-
Use
oc get wa
to see your instance name:oc get wa NAME VERSION READY READYREASON UPDATING UPDATINGREASON DEPLOYED VERIFIED AGE watson-assistant---wa-qa 1.5.0 True Stable False Stable 18/18 18/18 11h
-
Export your instance name as a variable that you can use in each step, for example:
export INSTANCENAME=watson-assistant---wa-qa
-
Change the
updateStrategy
in both Redis statefulsets to typeRollingUpdate
:oc patch statefulset c-$INSTANCENAME-redis-m -p '{"spec":{"updateStrategy":{"type":"RollingUpdate"}}}' oc patch statefulset c-$INSTANCENAME-redis-s -p '{"spec":{"updateStrategy":{"type":"RollingUpdate"}}}'
-
Update the Redis statefulsets with the resized cpu and memory values:
-
Member CPU
oc patch statefulset c-$INSTANCENAME-redis-m --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value":"50m"},{"op": "replace", "path": "/spec/template/spec/containers/1/resources/requests/cpu", "value":"50m"},{"op": "replace", "path": "/spec/template/spec/containers/2/resources/requests/cpu", "value":"50m"},{"op": "replace", "path": "/spec/template/spec/containers/3/resources/requests/cpu", "value":"50m"}]'
-
Member memory
oc patch statefulset c-$INSTANCENAME-redis-m --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/1/resources/limits/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/2/resources/limits/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/3/resources/limits/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/1/resources/requests/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/2/resources/requests/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/3/resources/requests/memory", "value":"256Mi"}]'
-
Sentinel CPU
oc patch statefulset c-$INSTANCENAME-redis-s --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value":"50m"},{"op": "replace", "path": "/spec/template/spec/containers/1/resources/requests/cpu", "value":"50m"},{"op": "replace", "path": "/spec/template/spec/containers/2/resources/requests/cpu", "value":"50m"},{"op": "replace", "path": "/spec/template/spec/containers/3/resources/requests/cpu", "value":"50m"}]'
-
Sentinel memory
oc patch statefulset c-$INSTANCENAME-redis-s --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/1/resources/limits/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/2/resources/limits/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/3/resources/limits/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/1/resources/requests/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/2/resources/requests/memory", "value":"256Mi"},{"op": "replace", "path": "/spec/template/spec/containers/3/resources/requests/memory", "value":"256Mi"}]'
-
-
Confirm that the Redis member and sentinel pods have the new memory and cpu values, for example:
oc describe pod c-$INSTANCENAME-redis-m-0 |grep cpu oc describe pod c-$INSTANCENAME-redis-m-0 |grep memory oc describe pod c-$INSTANCENAME-redis-s-0 |grep cpu oc describe pod c-$INSTANCENAME-redis-s-0 |grep memory
-
The results should look like these examples:
oc describe sts c-$INSTANCENAME-redis-m |grep cpu {"m":{"db":{"limits":{"cpu":"4","memory":"256Mi"},"requests":{"cpu":"25m","memory":"256Mi"}},"mgmt":{"limits":{"cpu":"2","memory":"100Mi"}... cpu: 4 cpu: 50m cpu: 2 cpu: 50m cpu: 2 cpu: 50m cpu: 2 cpu: 50m
oc describe sts c-$INSTANCENAME-redis-m |grep memory {"m":{"db":{"limits":{"cpu":"4","memory":"256Mi"},"requests":{"cpu":"25m","memory":"256Mi"}},"mgmt":{"limits":{"cpu":"2","memory":"100Mi"}... memory: 256Mi memory: 256Mi memory: 256Mi memory: 256Mi memory: 256Mi memory: 256Mi memory: 256Mi memory: 256Mi
oc describe pod c-$INSTANCENAME-redis-s-0 |grep cpu {"m":{"db":{"limits":{"cpu":"4","memory":"256Mi"},"requests":{"cpu":"25m","memory":"256Mi"}},"mgmt":{"limits":{"cpu":"2","memory":"100Mi"}... cpu: 2 cpu: 50m cpu: 2 cpu: 50m cpu: 2 cpu: 50m cpu: 2 cpu: 50m
oc describe pod c-$INSTANCENAME-redis-s-0 |grep memory {"m":{"db":{"limits":{"cpu":"4","memory":"256Mi"},"requests":{"cpu":"25m","memory":"256Mi"}},"mgmt":{"limits":{"cpu":"2","memory":"100Mi"}... memory: 256Mi memory: 256Mi memory: 256Mi memory: 256Mi memory: 256Mi memory: 256Mi memory: 256Mi memory: 256Mi
-
Change the
updateStrategy
in both Redis statefulsets back to typeOnDelete
:oc patch statefulset c-$INSTANCENAME-redis-m -p '{"spec":{"updateStrategy":{"type":"OnDelete"}}}' oc patch statefulset c-$INSTANCENAME-redis-s -p '{"spec":{"updateStrategy":{"type":"OnDelete"}}}'
Delete the pdb (poddisruptionbudgets) when changing the deployment from medium to small
Whenever the deployment size is changed from medium to small. A manual step is required to delete the poddisruptionbudgets
that are created for medium instances.
Run the following command, replacing <instance-name>
with the name of your CR instance and replacing <namespace-name>
with the name of the namespace where the instance resides.
oc delete pdb -l icpdsupport/addOnId=assistant,component!=etcd,ibmevents.ibm.com/kind!=Kafka,app.kubernetes.io/instance=<instance-name> -n <namespace-name>