How do I troubleshoot confidential containers?

Review these possible issues.

The issues could be caused by a misconfiguration during setup.

To start troubleshooting issues, run the following commands to gather as much data about your confidential containers as you can.

  1. Gather information about the operator.

    oc get csv -n openshift-sandboxed-containers-operator
    
    oc describe csv -n openshift-sandboxed-containers-operator
    
    oc get all -n openshift-sandboxed-containers-operator
    
  2. Retrieve logs and events from all pods related to the DaemonSets.

    oc describe pod/osc-caa-ds-<random string> -n openshift-sandboxed-containers-operator
    
    oc logs pod/osc-caa-ds-<random string> -n openshift-sandboxed-containers-operator
    
    oc describe pod/osc-config-sync-install-<random string> -n openshift-sandboxed-containers-operator
    
    oc logs pod/osc-config-sync-install-<random string> -n openshift-sandboxed-containers-operator
    
    oc describe pod/osc-rpm-install-<random string> -n openshift-sandboxed-containers-operator
    
    oc logs pod/osc-rpm-install-<random string> -n openshift-sandboxed-containers-operator
    
  3. Gather information about the pods.

    a. Gather information about the controller manager.

    oc describe pod/controller-manager-<random string> -n openshift-sandboxed-containers-operator
    
    oc logs pod/controller-manager-<random string> -n openshift-sandboxed-containers-operator
    

    b. Gather logs for a a random string.

    oc logs pod/<random string>
    
    oc describe pod/<random string>
    

    c. Gather information about the openshift-sandboxed-containers-operator-bundle.

    oc logs pod/trikprot-openshift-sandboxed-containers-operator-bundle-<version>
    
    oc describe pod/trikprot-openshift-sandboxed-containers-operator-bundle-<version>
    
  4. Gather information about the ConfigMaps.

    a. Gather information about the feature gates.

    oc get configmap/osc-feature-gates -n openshift-sandboxed-containers-operator -o yaml
    

    b. Gather information about the peer pods.

    oc get configmap/peer-pods-cm -n openshift-sandboxed-containers-operator -o yaml
    

    c. Gather information about the secrets.

    oc get secret/auth-json-secret -n openshift-sandboxed-containers-operator
    
    oc get secret/peer-pods-secret -n openshift-sandboxed-containers-operator
    

    d. Gather information about the KataConfig.

    oc get kataconfigs.kataconfiguration.openshift.io/kata-runtime-settings -n openshift-sandboxed-containers-operator -o yaml
    

    e. Gather information about the Custom Resource Definitions.

    oc get crd/peerpods.confidentialcontainers.org
    
    oc get crd/kataconfigs.kataconfiguration.openshift.io
    
  5. Check peer pods capacity and limits.

    a. Check the current peer pods limit across all worker nodes.

    oc get nodes -o json | jq -r '[.items[] | select (.status.allocatable["kata.peerpods.io/vm"] != null)| .status.allocatable["kata.peerpods.io/vm"] | tonumber] | add'
    

    b. Check the allocated resources on each worker node.

    for n in $(oc get nodes -o name); do
      echo "=== $n ==="
      oc describe "$n" | sed -n '/Allocated resources:/,/Events:/p'
    done
    

    c. Count the number of peer pods currently running.

    oc get pods -A -o json | jq '.items[] | select(.spec.runtimeClassName == "kata-remote") | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l
    

Common issues and resolutions

Insufficient kata.peerpods.io/vm error

If you see an error like the following when scheduling peer pods:

Warning FailedScheduling 0/30 nodes are available: 9 Insufficient kata.peerpods.io/vm. preemption: 0/30 nodes are available: 9 No preemption victims found for incoming pod.

This error indicates that you have reached the PEERPODS_LIMIT_PER_NODE limit on your worker nodes. The default limit is 10 peer pods per worker node.

To resolve this issue:

  1. Verify the current limit and how many peer pods are running.

    oc get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable["kata.peerpods.io/vm"], capacity: .status.capacity["kata.peerpods.io/vm"]}'
    
  2. Increase the PEERPODS_LIMIT_PER_NODE value in the peer-pods-cm ConfigMap. For more information, see Creating confidential containers.

    oc -n openshift-sandboxed-containers-operator patch cm peer-pods-cm \
      --type merge \
      -p '{"data":{"PEERPODS_LIMIT_PER_NODE":"24"}}'
    
  3. Restart the Cloud API Adapter daemonset.

    oc -n openshift-sandboxed-containers-operator rollout restart daemonset/osc-caa-ds
    
  4. Verify the new limit is applied.

    oc get nodes -o json | jq -r '[.items[] | select (.status.allocatable["kata.peerpods.io/vm"] != null)| .status.allocatable["kata.peerpods.io/vm"] | tonumber] | add'
    

For more information about peer pods limits and capacity planning, see How many peer pods can I run per worker node?.

IAM authentication error after upgrading to OSC Operator 1.12.1

If you see an error similar to the following in the Cloud API Adapter (CAA) logs after upgrading to OpenShift Sandboxed Containers Operator version 1.12.1:

cloud-api-adaptor: cluster error with:
 Unauthorized
further details:
 {
    "StatusCode": 401,
    "Result": {
        "code": "A0007",
        "description": "You do not have the correct permissions to perform this action..."
    }
}

Version 1.12.1 introduced a new requirement to automatically fetch the cluster's security group from the IBM Cloud IKS cluster service API. When you use IBMCLOUD_IAM_PROFILE_ID for authentication (compute resource identity), the IAM profile might not have the necessary permissions to query the cluster service API.

Choose one of the following options.

Grant additional IAM permissions (recommended)

Update the IAM profile to include permissions for the IKS cluster service API, specifically the ability to call GetClusterTypeSecurityGroups(). Contact your IBM Cloud administrator to add the required permissions.

Explicitly set the security group ID

Configure the IBMCLOUD_VPC_SG_ID environment variable in the peer-pods-cm ConfigMap to bypass the automatic cluster security group lookup. Then restart the Cloud API Adapter daemonset.

  1. Patch the ConfigMap with your security group ID.

    oc -n openshift-sandboxed-containers-operator patch cm peer-pods-cm \
      --type merge \
      -p '{"data":{"IBMCLOUD_VPC_SG_ID":"<your-security-group-id>"}}'
    
  2. Restart the Cloud API Adapter daemonset.

    oc -n openshift-sandboxed-containers-operator rollout restart daemonset/osc-caa-ds
    
Switch to API key authentication

Switch from IBMCLOUD_IAM_PROFILE_ID to IBMCLOUD_API_KEY authentication. API key authentication uses a service ID with explicit IAM policies, which you can scope to include the required cluster service permissions. Update the peer-pods-secret secret with your API key instead of the IAM profile ID.

For more information about the underlying change, see the upstream cloud-api-adaptor commit dde66055.

Insufficient CPU error

If you see an error like the following when scheduling peer pods:

Warning FailedScheduling 0/3 nodes are available: 3 Insufficient cpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

This error indicates that your worker nodes do not have enough CPU resources available. Each peer pod consumes approximately 250m CPU on the worker node for the Kubernetes pod construct, even though the actual workload runs in a separate VSI.

To resolve this issue:

  1. Check the CPU allocation on your worker nodes.

    for n in $(oc get nodes -o name); do
      echo "=== $n ==="
      oc describe "$n" | sed -n '/Allocated resources:/,/Events:/p'
    done
    
  2. Choose one of the following options:

    • Add more worker nodes to your cluster
    • Use worker nodes with more vCPUs
    • Reduce the PEERPODS_LIMIT_PER_NODE value to match your worker node capacity
    • Remove other workloads from the worker nodes to free up CPU resources