How to recover an OpenShift cluster after a power-off event

Tackling OpenShift cluster recovery

Running an OpenShift cluster in a home lab on budget-friendly, legacy hardware is a fantastic way to dive into enterprise-grade Kubernetes. But what happens when a power outage (or a hasty power shut down in my case) leaves your cluster in disarray? What happens if nodes stuck in NotReady status and the Ingress VIP is unreachable? Non-graceful shutdowns happen. This blog post walks you through a recovery procedure, based on real-world experience with an OpenShift cluster. Whether you’re dealing with inaccessible oc login commands or struggling to restore the Ingress VIP, this guide leverages the kubeconfig file and practical steps to get your cluster back online.

Prerequisites for OpenShift cluster recovery

Before diving in, ensure you have the following:

kubeconfig: The kubeconfig file from your cluster installation.
SSH access: Access to a control plane node as the core user.
oc CLI tool: Installed on your local machine and the control plane node.
Cluster details:
- API VIP
- Ingress VIP
- Cluster name
- Base domain
- kubeadmin credentials

These prerequisites are critical for recovering the cluster and performing administrative tasks, especially when external access via the Ingress VIP is down.

Step 1: SSH into a control plane node

Start by connecting to a control plane node to bypass external network issues:

ssh core@<hostname>

Why This Works: The core user on Red Hat CoreOS (RHCOS) has sudo privileges, allowing you to run oc commands directly on the node. This is a lifesaver when the Ingress VIP is unreachable, as it provides local access to the cluster’s API server.

Step 2: Transfer the kubeconfig file

Copy the kubeconfig file from your local machine to the control plane node:

scp ~/kubeconfig core@<hostname>:/var/home/core/kubeconfig

Why This Works: The kubeconfig file contains kubeadmin credentials, enabling cluster-admin access to the API server.

Step 3: Configure the KUBECONFIG environment variable

Set the KUBECONFIG variable to point to the transferred kubeconfig file:

export KUBECONFIG=/var/home/core/kubeconfig

Why This Works: This ensures the oc tool uses the correct kubeconfig file, pointing to the API VIP with proper authentication tokens. It’s a simple but essential step for consistent cluster access.

Step 4: Verify cluster access with node status

Check node health to confirm administrative access:

oc get nodes

Why This Works: This command validates your kubeconfig file and highlights nodes in NotReady status. NotReady nodes often result from pending Certificate Signing Requests (CSRs) or kubelet issues post-power failure.

Step 5: Approve pending Certificate Signing Requests (CSRs)

Resolve NotReady nodes by approving pending CSRs:

oc get csr

Approve all pending CSRs:

oc get csr -o name | xargs oc adm certificate approve

Why This Works: After a power outage, kubelets may fail to authenticate due to expired or pending CSRs. Approving them allows nodes to reconnect to the API server. Re-run oc get csr after a few minutes to ensure no new requests are pending.

Step 6: Monitor node readiness

Watch nodes transition to Ready status:

watch oc get nodes

Why This Works: Node readiness confirms that kubelets are communicating with the API server, and operators like the Machine Config Operator have applied configurations. This also enables the Ingress Operator to bind the Ingress VIP.

Step 7: Restore the ingress VIP and the cluster login

Verify the Ingress VIP and test cluster login from your local machine:

nc -zv <node_ip_address> 443

oc login https://api.<cluster_name>.<domain_name>:6443 --username=<username> --password=<password>

Why This Works: A successful connection to the control plane node indicates the Ingress Operator and Keepalived have restored the Ingress VIP. The oc login command uses OAuth, which relies on the Ingress VIP for the OAuth endpoint.

Step 8: Confirm cluster health

Check the status of cluster operators:

oc get clusteroperators

Expected Outcome: All operators, including ingress, should show Available=True, Progressing=False, and Degraded=False.

Why This Works: This confirms the cluster is fully operational, with all nodes, operators, and the Ingress VIP functioning correctly. Your home lab OpenShift cluster is now back online!

John Wilkins