IBM Cloud Docs
Why can't I SSH into my worker node?

Why can't I SSH into my worker node?

Virtual Private Cloud Classic infrastructure

You can't access your worker node by using an SSH connection.

SSH by password is unavailable on the worker nodes.

To run actions on every worker node, use a Kubernetes DaemonSet, or use jobs for one-time actions.

To get host access to worker nodes for debugging and troubleshooting purposes, review the following options.

Debugging by using oc debug

Use the oc debug node command to deploy a pod with a privileged securityContext to a worker node that you want to troubleshoot.

The debug pod is deployed with an interactive shell so that you can access the worker node immediately after the pod is created. For more information about how the oc debug node command works, see this Red Hat blog post.

  1. Get the name of the worker node that you want to access. For CoreOS worker nodes, the name is the hostname of the worker. For all other worker nodes, the worker node name is the private IP address.

    oc get nodes -o wide
    
  2. Create a debug pod that has host access. When the pod is created, the pod's interactive shell is automatically opened. If the oc debug node command fails, continue to option 2.

    oc debug node/<NODE_NAME>
    

    If the oc debug node/<NODE_NAME> command fails, there might be security group, ACL, or firewall rules that prevent the default container image from being pulled. Try the command again with the --image=us.icr.io/armada-master/network-alpine:latest option, which uses an image from IBM Cloud Container Registry that is accessible over the private network .

  3. Run debug commands to help you gather information and troubleshoot issues. Commands that you might use to debug, such as tcpdump, curl, ip, ifconfig, nc, ping, and ps, are already available in the shell. You can also install other tools, such as mtr and conntrack, by running yum install <tool>.

Debugging by using kubectl exec

If you are unable to use the oc debug node command, you can create an Alpine pod with a privileged securityContext and use the kubectl exec command to run debug commands from the pod's interactive shell.

  1. Get the name of the worker node that you want to access. For CoreOS worker nodes, the name is the hostname of the worker. For all other worker nodes, the worker node name is the private IP address.

    oc get nodes -o wide
    
  2. Export the name in an environment variable.

    export NODE=<NODE_NAME>
    
  3. Create a debug pod on the worker node. The Docker alpine image here is used as an example. If the worker node doesn't have public network access, you can maintain a copy of the image for debugging in your own ICR repository or build a customized image with other tools to fit your needs.

    kubectl apply -f - << EOF
    apiVersion: v1
    kind: Pod
    metadata:
      name: debug-${NODE}
      namespace: default
    spec:
      tolerations:
      - operator: "Exists"
      hostNetwork: true
      containers:
      - args: ["-c", "sleep 20d"]
        command: ["/bin/sh"]
        image: us.icr.io/armada-master/network-alpine:latest
        imagePullPolicy: Always
        name: debug
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /host
          name: host-volume
      volumes:
      - name: host-volume
        hostPath:
          path: /
      nodeSelector:
        kubernetes.io/hostname: ${NODE}
      restartPolicy: Never
    EOF
    
  4. Log in to the debug pod. The pod's interactive shell is automatically opened. If the kubectl exec command fails, continue to option 3.

    kubectl exec -it debug-${NODE} -- sh
    

    To get logs or other file from a worker node, use the **kubectl cp** command in the following format. The following example gets the /var/log/messages file from the host file system of the worker node.

    oc cp default/debug-${NODE}:/host/var/log/messages ./messages
    

    Get the following logs to look for issues on the worker node.

    /var/log/messages 
    /var/log/kubelet.log
    /var/log/crio.log
    /var/log/calico/cni/cni.log
    
  5. Run debug commands to help you gather information and troubleshoot issues. Commands that you might use to debug, such as dig, tcpdump, mtr, curl, ip, ifconfig, nc, ping, and ps, are already available in the shell. You can also install other tools, such as conntrack, by running apk add <tool>. For example, to add conntrack, run apk add conntrack-tools.

  6. Delete the host access pod that you created for debugging.

    kubectl delete pod debug-${NODE}
    

Debugging by enabling root SSH access on a worker node

If you are unable to use the oc debug node or kubectl exec commands, such as if the VPN connection between the cluster master and worker nodes is down, you can create a pod that enables root SSH access and copies a public SSH key to the worker node for SSH access.

Allowing root SSH access is a security risk. Only allow SSH access when it is required and no other option is available to troubleshoot worker node issues. When you finish troubleshooting, be sure to follow the steps in the Cleaning up after debugging section to disable SSH access.

  1. Choose an existing or create a new public SSH key.

    ssh-keygen -f /tmp/id_rsa_cluster_worker -t rsa -b 4096 -C temp-worker-ssh-key -P ''
    ls /tmp
    
    id_rsa_cluster_worker id_rsa_cluster_worker.pub
    
    cat /tmp/id_rsa_cluster_worker.pub
    
  2. Get the name of the worker node that you want to access. For CoreOS worker nodes, the name is the hostname of the worker. For all other worker nodes, the worker node name is the private IP address.

    oc get nodes -o wide
    
  3. Create the following YAML file for a debug pod, and save the file as enable-ssh.yaml. Replace <NODE_NAME> with the worker node name and replace the example value for SSH_PUBLIC_KEY with your public SSH key. The Docker alpine image here is used as an example. If the worker node doesn't have public network access, you can keep a copy of the image for debugging in your own ICR repository or build a customized image with other tools to fit your needs.

    apiVersion: v1
    kind: Pod
    metadata:
      name: enable-ssh-<NODE_NAME>
      labels:
        name: enable-ssh
    spec:
      tolerations:
      - operator: "Exists"
      hostNetwork: true
      hostPID: true
      hostIPC: true
      containers:
      - image: us.icr.io/armada-master/network-alpine:latest
        env:
        - name: SSH_PUBLIC_KEY
          value: "<ssh-rsa AAA...ZZZ temp-worker-ssh-key>"
        args: ["-c", "echo $(SSH_PUBLIC_KEY) | tee -a /root/.ssh/authorized_keys && sed -i 's/^#*PermitRootLogin.*/PermitRootLogin yes/g' /host/etc/ssh/sshd_config && sed -i 's/^#*PermitRootLogin.*/PermitRootLogin yes/g' /host/etc/ssh/sshd_config.d/40-rhcos-defaults.conf || true && killall -1 sshd || yes n | ssh-keygen -f /host/etc/ssh/ssh_host_rsa_key -t rsa -b 4096 -C temp-server-ssh-key -P '' && while true; do sleep 86400; done"]
        command: ["/bin/sh"]
        name: enable-ssh
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /host
          name: host-volume
        - mountPath: /root/.ssh
          name: ssh-volume
      volumes:
      - name: host-volume
        hostPath:
          path: /
      - name: ssh-volume
        hostPath:
          path: /root/.ssh
      nodeSelector:
        kubernetes.io/hostname: <NODE_NAME>
      restartPolicy: Never
    
  4. Create the pod in your cluster. When this pod is created, your public key is added to the worker node and SSH is configured to allow root SSH login.

    oc apply -f enable-ssh.yaml
    
  5. Use the private or public network to access the worker node by using your SSH key.

SSH into the worker on the private network

Create a new or choose an existing server instance that has access to the same private network as the worker node. For VPC clusters, the virtual server instance must exist in the same VPC as the worker node.

For classic clusters, the device can access the worker node from any private VLAN if a Virtual Router Function (VRF) or VLAN spanning is enabled. Otherwise, the device must exist on the same private VLAN as the worker node.

  1. Copy your SSH private key from step 1 from your local machine to this server instance.

    scp <SSH_private_key_location> <user@host>:/.ssh/id_rsa_worker_private
    
  2. SSH into the server instance.

  3. Set the correct permissions for using the SSH private key that you copied.

    chmod 400 ~/.ssh/id_rsa_worker_private
    
  4. Use the private key to SSH into the worker node that you found in step 2.

    ssh -i ~/.ssh/id_rsa_worker_private root@<WORKER_PRIVATE_IP>
    

SSH into the worker node on the public network

Debug classic clusters that are connected to a public VLAN by logging in to your worker nodes.

  1. Install and configure the Calico CLI, and set the context for your cluster to run Calico commands.

  2. Create a Calico global network policy that is named ssh-open to allow inbound SSH traffic on port 22.

    calicoctl apply -f - <<EOF
    apiVersion: projectcalico.org/v3
    kind: GlobalNetworkPolicy
    metadata:
      name: ssh-open
    spec:
      selector: ibm.role == 'worker_public'
      ingress:
      - action: Allow
        protocol: TCP
        destination:
          ports:
          - 22
      order: 1500
    EOF
    
  3. Get the public IP address of your worker node.

    oc get nodes -o wide
    
  4. SSH into the worker node via its public IP address.

    ssh -i <SSH_private_key_location> root@<WORKER_PUBLIC_IP>
    
  5. Run debug commands to help you gather information and troubleshoot issues, such as ip, ifconfig, ping, ps, and curl. You can also install other tools that might not be installed by default, such as tcpdump or nc, by running yum install <tool>.

Cleaning up after debugging

After you finish debugging, clean up resources to disable SSH access.

  1. Delete the SSH enablement pod.

    oc delete pod enable-ssh-<NODE_NAME>
    
  2. If you accessed the worker node via the public network, delete the Calico policy so that port 22 is blocked again.

    calicoctl delete gnp ssh-open [-c <path_to_calicoctl_cfg>/calicoctl.cfg]
    
  3. Reload your classic worker node or replace your VPC worker node so that the original SSH configuration is used and the SSH key that you added is removed.