Maintenance Guide

This guide provides instructions for regular maintenance tasks necessary to ensure the smooth and secure operation of the system.

Evacuating Nodes for Maintenance

When you need to perform maintenance on a node, you will need to evacuate the node to ensure that no workloads are running on it. Depending on the type of node you are evacuating, you will need to use different commands.

Control Plane Node

To evacuate a control plane node, you will need to drain the node. This will cause all the control plane components to be moved to other nodes in the cluster. To drain a control plane node, run the following command against the node you want to drain:

$ kubectl drain <node-name> --ignore-daemonsets --delete-local-data <node-name>

In the example above, you would replace <node-name> with the name of the node you want to drain. Once this process is complete, you can safely perform maintenance on the node.

When you are done with the maintenance, you can uncordon the node by running the following command:

$ kubectl uncordon <node-name>

Compute Node

In order to evacuate a compute node, you will need to start by disabling the OpenStack compute service on the node. This will prevent new workloads from being scheduled on the node. To disable the OpenStack compute service, run the following command against the node you want to evacuate:

$ openstack compute service set --disable <node-name> nova-compute

In the example above, you would replace <node-name> with the name of the node you want to evacuate. Once the OpenStack compute service has been disabled, you will need to evacuate all the virtual machines running on the node. To do this, run the following command:

$ nova host-evacuate-live <node-name>

In the example above, you would replace <node-name> with the name of the node you want to evacuate. This command will live migrate all the virtual machines running on the node to other nodes in the cluster.

Note

It is generally not recommended to use the nova client however the nova host-evacuate-live command is not available in the openstack client (see bug 2055552).

You can monitor the progress of this operation by seeing if there are any VMs left on the node by running the following command:

$ openstack server list --host <node-name>

Once all the virtual machines have been evacuated, you can safely perform maintenance on the node. When you are done with the maintenance, you can reenable the OpenStack compute service by running the following command:

$ openstack compute service set --enable <node-name> nova-compute

Note

Once you enable the compute service, the node will start accepting new VMs but it will not automatically move the VMs back to the node. You will need to manually move the VMs back to the node if you want them to run there.

Renewing Certificates

The certificates used by the Kubernetes cluster are valid for one year. They are automatically renewed when the cluster is upgraded to a new version of Kubernetes. However, if you are running the same version of Kubernetes for more than a year, you will need to manually renew the certificates.

To renew the certificates, run the following command on each one of your control plane nodes:

$ sudo kubeadm certs renew all

Once the certificates have been renewed, you will need to restart the Kubernetes control plane components to pick up the new certificates. You need to do this on each one of your control plane nodes by running the following command one at a time on each node:

$ ps auxf | egrep '(kube-(apiserver|controller-manager|scheduler)|etcd)' | awk '{ print $2 }' | xargs sudo kill

Changing controller network addresses

Changing the IP address of an existing controller node is a complex operation that affects multiple critical components of the Atmosphere deployment. The recommended approach is to remove and redeploy the controller node rather than attempting an in-place IP change.

Components affected by address changes

When a controller node’s IP address changes, the following components are directly impacted:

  • etcd cluster: The etcd cluster stores its member list with specific IP addresses. Changing an IP requires careful reconfiguration of the cluster membership.

  • Ceph monitors: If Ceph monitors are running on controller nodes, they maintain a monitor map with specific IP addresses that must remain consistent across the cluster.

  • Kubernetes API server: The API server advertises its address to other components, and certificates may tie to specific IP addresses.

  • Virtual IP: The virtual IP configuration depends on the underlying node IP addresses for proper fail-over behavior.

  • DNS resolution: The fully qualified domain names (FQDNs) in the inventory must resolve to the correct IP addresses.

  • Network policies and firewall rules: Any security policies that reference specific controller IP addresses require updates.

Alternative approach for multiple controllers

If you need to change IP addresses for multiple controllers, consider deploying new controllers with the correct IP addresses first, then removing the old ones. This approach maintains higher availability throughout the process:

  1. Deploy new controller nodes with the desired IP addresses

  2. Wait for them to fully join the cluster and synchronize

  3. Remove the old controller nodes one at a time

Warning

Never attempt to change IP addresses on all controllers simultaneously. This will result in a complete control plane outage and potential data loss.

Important Considerations

  • Always maintain quorum in the etcd cluster (majority of nodes operational)

  • Test this procedure in a non-production environment first

  • Have backups of critical data before proceeding

  • Consider the impact on any external systems that may reference controller IP addresses

  • Some components may cache old IP addresses and require restart