Tackling OpenShift cluster recovery
Running an OpenShift cluster in a home lab on budget-friendly, legacy hardware is a fantastic way to dive into enterprise-grade Kubernetes. But what happens when a power outage (or a hasty power shut down in my case) leaves your cluster in disarray? What happens if nodes stuck in NotReady status and the Ingress VIP is unreachable? Non-graceful shutdowns happen. This blog post walks you through a recovery procedure, based on real-world experience with an OpenShift cluster. Whether you’re dealing with inaccessible oc login commands or struggling to restore the Ingress VIP, this guide leverages the kubeconfig file and practical steps to get your cluster back online.
Prerequisites for OpenShift cluster recovery
Before diving in, ensure you have the following:
- kubeconfig: The
kubeconfigfile from your cluster installation. - SSH access: Access to a control plane node as the
coreuser. - oc CLI tool: Installed on your local machine and the control plane node.
- Cluster details:
- API VIP
- Ingress VIP
- Cluster name
- Base domain
kubeadmincredentials
These prerequisites are critical for recovering the cluster and performing administrative tasks, especially when external access via the Ingress VIP is down.
Step 1: SSH into a control plane node
Start by connecting to a control plane node to bypass external network issues:
ssh core@<hostname>
Why This Works: The core user on Red Hat CoreOS (RHCOS) has sudo privileges, allowing you to run oc commands directly on the node. This is a lifesaver when the Ingress VIP is unreachable, as it provides local access to the cluster’s API server.
Step 2: Transfer the kubeconfig file
Copy the kubeconfig file from your local machine to the control plane node:
scp ~/kubeconfig core@<hostname>:/var/home/core/kubeconfig
Why This Works: The kubeconfig file contains kubeadmin credentials, enabling cluster-admin access to the API server.
Step 3: Configure the KUBECONFIG environment variable
Set the KUBECONFIG variable to point to the transferred kubeconfig file:
export KUBECONFIG=/var/home/core/kubeconfig
Why This Works: This ensures the oc tool uses the correct kubeconfig file, pointing to the API VIP with proper authentication tokens. It’s a simple but essential step for consistent cluster access.
Step 4: Verify cluster access with node status
Check node health to confirm administrative access:
oc get nodes
Why This Works: This command validates your kubeconfig file and highlights nodes in NotReady status. NotReady nodes often result from pending Certificate Signing Requests (CSRs) or kubelet issues post-power failure.
Step 5: Approve pending Certificate Signing Requests (CSRs)
Resolve NotReady nodes by approving pending CSRs:
oc get csr
Approve all pending CSRs:
oc get csr -o name | xargs oc adm certificate approve
Why This Works: After a power outage, kubelets may fail to authenticate due to expired or pending CSRs. Approving them allows nodes to reconnect to the API server. Re-run oc get csr after a few minutes to ensure no new requests are pending.
Step 6: Monitor node readiness
Watch nodes transition to Ready status:
watch oc get nodes
Why This Works: Node readiness confirms that kubelets are communicating with the API server, and operators like the Machine Config Operator have applied configurations. This also enables the Ingress Operator to bind the Ingress VIP.
Step 7: Restore the ingress VIP and the cluster login
Verify the Ingress VIP and test cluster login from your local machine:
nc -zv <node_ip_address> 443
Log in to the cluster:
oc login https://api.<cluster_name>.<domain_name>:6443 --username=<username> --password=<password>
Why This Works: A successful connection to the control plane node indicates the Ingress Operator and Keepalived have restored the Ingress VIP. The oc login command uses OAuth, which relies on the Ingress VIP for the OAuth endpoint.
Step 8: Confirm cluster health
Check the status of cluster operators:
oc get clusteroperators
Expected Outcome: All operators, including ingress, should show Available=True, Progressing=False, and Degraded=False.
Why This Works: This confirms the cluster is fully operational, with all nodes, operators, and the Ingress VIP functioning correctly. Your home lab OpenShift cluster is now back online!

Leave a Reply