Why can't I SSH into my worker node?
Virtual Private Cloud Classic infrastructure
You can't access your worker node by using an SSH connection.
SSH by password is unavailable on the worker nodes.
To run actions on every worker node, use a Kubernetes DaemonSet
, or use jobs for one-time actions.
To get host access to worker nodes for debugging and troubleshooting purposes, review the following options.
Debugging by using oc debug
Use the oc debug node
command to deploy a pod with a privileged securityContext
to a worker node that you want to troubleshoot.
The debug pod is deployed with an interactive shell so that you can access the worker node immediately after the pod is created. For more information about how the oc debug node
command works, see this Red Hat blog post.
-
Get the name of the worker node that you want to access. For CoreOS worker nodes, the name is the hostname of the worker. For all other worker nodes, the worker node name is the private IP address.
oc get nodes -o wide
-
Create a debug pod that has host access. When the pod is created, the pod's interactive shell is automatically opened. If the
oc debug node
command fails, continue to option 2.oc debug node/<NODE_NAME>
If the
oc debug node/<NODE_NAME>
command fails, there might be security group, ACL, or firewall rules that prevent the default container image from being pulled. Try the command again with the--image=us.icr.io/armada-master/network-alpine:latest
option, which uses an image from IBM Cloud Container Registry that is accessible over the private network . -
Run debug commands to help you gather information and troubleshoot issues. Commands that you might use to debug, such as
tcpdump
,curl
,ip
,ifconfig
,nc
,ping
, andps
, are already available in the shell. You can also install other tools, such asmtr
andconntrack
, by runningyum install <tool>
.
Debugging by using kubectl exec
If you are unable to use the oc debug node
command, you can create an Alpine pod with a privileged securityContext
and use the kubectl exec
command to run debug commands from the pod's interactive shell.
-
Get the name of the worker node that you want to access. For CoreOS worker nodes, the name is the hostname of the worker. For all other worker nodes, the worker node name is the private IP address.
oc get nodes -o wide
-
Export the name in an environment variable.
export NODE=<NODE_NAME>
-
Create a debug pod on the worker node. The Docker alpine image here is used as an example. If the worker node doesn't have public network access, you can maintain a copy of the image for debugging in your own ICR repository or build a customized image with other tools to fit your needs.
kubectl apply -f - << EOF apiVersion: v1 kind: Pod metadata: name: debug-${NODE} namespace: default spec: tolerations: - operator: "Exists" hostNetwork: true containers: - args: ["-c", "sleep 20d"] command: ["/bin/sh"] image: us.icr.io/armada-master/network-alpine:latest imagePullPolicy: Always name: debug securityContext: privileged: true volumeMounts: - mountPath: /host name: host-volume volumes: - name: host-volume hostPath: path: / nodeSelector: kubernetes.io/hostname: ${NODE} restartPolicy: Never EOF
-
Log in to the debug pod. The pod's interactive shell is automatically opened. If the
kubectl exec
command fails, continue to option 3.kubectl exec -it debug-${NODE} -- sh
To get logs or other file from a worker node, use the
**kubectl cp**
command in the following format. The following example gets the/var/log/messages
file from the host file system of the worker node.oc cp default/debug-${NODE}:/host/var/log/messages ./messages
Get the following logs to look for issues on the worker node.
/var/log/messages /var/log/kubelet.log /var/log/crio.log /var/log/calico/cni/cni.log
-
Run debug commands to help you gather information and troubleshoot issues. Commands that you might use to debug, such as
dig
,tcpdump
,mtr
,curl
,ip
,ifconfig
,nc
,ping
, andps
, are already available in the shell. You can also install other tools, such asconntrack
, by runningapk add <tool>
. For example, to addconntrack
, runapk add conntrack-tools
. -
Delete the host access pod that you created for debugging.
kubectl delete pod debug-${NODE}
Debugging by enabling root SSH access on a worker node
If you are unable to use the oc debug node
or kubectl exec
commands, such as if the VPN connection between the cluster master and worker nodes is down, you can create a pod that enables root SSH access and copies a
public SSH key to the worker node for SSH access.
Allowing root SSH access is a security risk. Only allow SSH access when it is required and no other option is available to troubleshoot worker node issues. When you finish troubleshooting, be sure to follow the steps in the Cleaning up after debugging section to disable SSH access.
-
Choose an existing or create a new public SSH key.
ssh-keygen -f /tmp/id_rsa_cluster_worker -t rsa -b 4096 -C temp-worker-ssh-key -P '' ls /tmp
id_rsa_cluster_worker id_rsa_cluster_worker.pub
cat /tmp/id_rsa_cluster_worker.pub
-
Get the name of the worker node that you want to access. For CoreOS worker nodes, the name is the hostname of the worker. For all other worker nodes, the worker node name is the private IP address.
oc get nodes -o wide
-
Create the following YAML file for a debug pod, and save the file as
enable-ssh.yaml
. Replace<NODE_NAME>
with the worker node name and replace the examplevalue
forSSH_PUBLIC_KEY
with your public SSH key. The Docker alpine image here is used as an example. If the worker node doesn't have public network access, you can keep a copy of the image for debugging in your own ICR repository or build a customized image with other tools to fit your needs.apiVersion: v1 kind: Pod metadata: name: enable-ssh-<NODE_NAME> labels: name: enable-ssh spec: tolerations: - operator: "Exists" hostNetwork: true hostPID: true hostIPC: true containers: - image: us.icr.io/armada-master/network-alpine:latest env: - name: SSH_PUBLIC_KEY value: "<ssh-rsa AAA...ZZZ temp-worker-ssh-key>" args: ["-c", "echo $(SSH_PUBLIC_KEY) | tee -a /root/.ssh/authorized_keys && sed -i 's/^#*PermitRootLogin.*/PermitRootLogin yes/g' /host/etc/ssh/sshd_config && sed -i 's/^#*PermitRootLogin.*/PermitRootLogin yes/g' /host/etc/ssh/sshd_config.d/40-rhcos-defaults.conf || true && killall -1 sshd || yes n | ssh-keygen -f /host/etc/ssh/ssh_host_rsa_key -t rsa -b 4096 -C temp-server-ssh-key -P '' && while true; do sleep 86400; done"] command: ["/bin/sh"] name: enable-ssh securityContext: privileged: true volumeMounts: - mountPath: /host name: host-volume - mountPath: /root/.ssh name: ssh-volume volumes: - name: host-volume hostPath: path: / - name: ssh-volume hostPath: path: /root/.ssh nodeSelector: kubernetes.io/hostname: <NODE_NAME> restartPolicy: Never
-
Create the pod in your cluster. When this pod is created, your public key is added to the worker node and SSH is configured to allow root SSH login.
oc apply -f enable-ssh.yaml
-
Use the private or public network to access the worker node by using your SSH key.
SSH into the worker on the private network
Create a new or choose an existing server instance that has access to the same private network as the worker node. For VPC clusters, the virtual server instance must exist in the same VPC as the worker node.
For classic clusters, the device can access the worker node from any private VLAN if a Virtual Router Function (VRF) or VLAN spanning is enabled. Otherwise, the device must exist on the same private VLAN as the worker node.
-
Copy your SSH private key from step 1 from your local machine to this server instance.
scp <SSH_private_key_location> <user@host>:/.ssh/id_rsa_worker_private
-
SSH into the server instance.
-
Set the correct permissions for using the SSH private key that you copied.
chmod 400 ~/.ssh/id_rsa_worker_private
-
Use the private key to SSH into the worker node that you found in step 2.
ssh -i ~/.ssh/id_rsa_worker_private root@<WORKER_PRIVATE_IP>
SSH into the worker node on the public network
Debug classic clusters that are connected to a public VLAN by logging in to your worker nodes.
-
Install and configure the Calico CLI, and set the context for your cluster to run Calico commands.
-
Create a Calico global network policy that is named
ssh-open
to allow inbound SSH traffic on port 22.calicoctl apply -f - <<EOF apiVersion: projectcalico.org/v3 kind: GlobalNetworkPolicy metadata: name: ssh-open spec: selector: ibm.role == 'worker_public' ingress: - action: Allow protocol: TCP destination: ports: - 22 order: 1500 EOF
-
Get the public IP address of your worker node.
oc get nodes -o wide
-
SSH into the worker node via its public IP address.
ssh -i <SSH_private_key_location> root@<WORKER_PUBLIC_IP>
-
Run debug commands to help you gather information and troubleshoot issues, such as
ip
,ifconfig
,ping
,ps
, andcurl
. You can also install other tools that might not be installed by default, such astcpdump
ornc
, by runningyum install <tool>
.
Cleaning up after debugging
After you finish debugging, clean up resources to disable SSH access.
-
Delete the SSH enablement pod.
oc delete pod enable-ssh-<NODE_NAME>
-
If you accessed the worker node via the public network, delete the Calico policy so that port 22 is blocked again.
calicoctl delete gnp ssh-open [-c <path_to_calicoctl_cfg>/calicoctl.cfg]
-
Reload your classic worker node or replace your VPC worker node so that the original SSH configuration is used and the SSH key that you added is removed.