Before jumping into the topic of this blog, let’s define the context. To keep your Linux system up-to-date, you traditionally have your unattended-upgrades package on your Ubuntu servers installed, triggered every week or so. This piece of software will download all required package updates from the core and security repositories by default and apply them. The thing is, some of them would require a reboot of your server. Unfortunately, the server was acting as a Kubernetes node, and as a consequence, the pretty hard reboot, from a Kubernetes standpoint, messed up your workload. This story, you may be familiar with.

An animal called Kured

In the world of DevOps, where continuous development and deployment are essential, ensuring the smooth operation of a Kubernetes cluster is crucial. One key aspect is managing Kubernetes node reboots efficiently and safely. This is where Kured (KUbernetes REboot Daemon) steps in. Kured is a Kubernetes daemonset that automates the process of node reboots based on indications from the underlying operating system’s package management system. In this blog, we will explore how Kured significantly improves DevOps project downtime by simplifying the management of node reboots.

CNCF sandbox project, Kured is a set of Kubernetes objects. And like any good tool, you can install Kured using Helm for an easy installation (Yes, also through standard manifests file if you like to play).

Our sandbox environment is composed of two AWS EC2 instances, running an RKE2-backed Kubernetes cluster. If you are not familiar with RKE2, this is a fully compliant Kubernetes distribution, with a specific focus on security. RKE stands for Rancher Kubernetes Engine and is a CNCF certified Kubernetes distribution. Interested into some RKE2 content? My colleague Kevin posted some time ago the procedure to setup nodes autoscaling in RKE2.

ubuntu@k8s-node-master:~$ kubectl get nodes
NAME              STATUS   ROLES                       AGE     VERSION
k8s-node-master   Ready    control-plane,etcd,master   5h49m   v1.26.12+rke2r1
k8s-node-worker   Ready    <none>                      5h27m   v1.26.12+rke2r1

Here, we are dealing with a quite simple cluster, with a control-plane node and, aside from that, a worker node. Adding Kured to your cluster is pretty straightforward, thanks to Helm.

ubuntu@k8s-node-master:~$ helm repo add kubereboot https://kubereboot.github.io/charts

ubuntu@k8s-node-master:~$ helm search repo kubereboot
NAME                    CHART VERSION   APP VERSION     DESCRIPTION           
kubereboot/kured        5.4.2           1.15.0          A Helm chart for kured

The Helm repository is added first; a helm search shows the latest available chart and app version. But before installing it, you can quickly have a look at some of them. There are dozens of fields you can customize, and a standard helm show values kubereboot/kured should do the trick! I’m pasting here the URL values.yaml file, from GitHub. But the output of the helm command should be similar, according to the version you’ve selected.

Pay attention to the following:

KeyShort description
configuration.startTime and configuration.endTimeIndicates the time window for rebooting.
configuration.rebootDaysReboot will be only performed in the selected days.
configuration.periodDefines how often Kured needs to check for reboot.
configuration.forceRebootWhat Kured needs to do if drain times out?
configuration.drainGracePeriod
configuration.drainPodSelector
configuration.drainTimeout
Drain related parameters. By setting this, you defines a grace period (remember the kubectl delete pods --force --grace-period=0 ūüėÄ ), the list of pods to drain, and a timeout after which the drain will be canceled.
configuration.blockingPodSelectorIf certain pods are present on the node, the reboot will be canceled.
configuration.concurrencyCan be useful for very large clusters. By default 1, meaning only one node will be rebooted at a time.

As said in the documentation, Kured is checking regularly for the presence of the file /var/run/reboot-required. Created by the package manager, this file comes along with the file /var/run/reboot-required.pkgs, which lists all the packages that requested the reboot. We can tune in the Helm chart values the name and location, but for Debian/Ubuntu-like Linux distributions, no modifications should be necessary.

Let’s start!

I created my cluster some while ago and would require for sure some system upgrades. Let’s install Kured and see how it behaves.

First, we need to get the value file and update it.

ubuntu@k8s-node-master:~$ helm show values kubereboot/kured > values.yaml
ubuntu@k8s-node-master:~$ ls -l
total 8
-rw-rw-r-- 1 ubuntu ubuntu 6412 Mar  4 09:56 values.yaml

I’m going to update the file to set the configuration.period key. For the purpose of my tests, I don’t want to wait for an hour ūüėČ so I will shorten it.

Installing Kured is just a matter of helm install.

ubuntu@k8s-node-master:~$ helm install kured kubereboot/kured -n kured --create-namespace -f values.yaml  
NAME: kured
LAST DEPLOYED: Mon Mar  4 10:03:56 2024
NAMESPACE: kured
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Kured will check for /var/run/reboot-required, and reboot nodes when needed.

See https://github.com/kubereboot/kured/ for details.

Kured is installed, on the two nodes:

ubuntu@k8s-node-master:~$ kubectl get pods -n kured -o wide
NAME          READY   STATUS    RESTARTS   AGE   IP           NODE              NOMINATED NODE   READINESS GATES
kured-65l69   1/1     Running   0          60s   10.42.1.7    k8s-node-worker   <none>           <none>
kured-6z7bf   1/1     Running   0          60s   10.42.0.21   k8s-node-master   <none>           <none>

Let’s have a look at the logs; they’re quite informative:

ubuntu@k8s-node-master:~$ kubectl logs -n kured kured-65l69
time="2024-03-04T10:04:02Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2024-03-04T10:04:02Z" level=info msg="Kubernetes Reboot Daemon: 1.15.0"
time="2024-03-04T10:04:02Z" level=info msg="Node ID: k8s-node-worker"
time="2024-03-04T10:04:02Z" level=info msg="Lock Annotation: kured/kured:weave.works/kured-node-lock"
time="2024-03-04T10:04:02Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2024-03-04T10:04:02Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting"
time="2024-03-04T10:04:02Z" level=info msg="PreferNoSchedule taint: "
time="2024-03-04T10:04:02Z" level=info msg="Blocking Pod Selectors: []"
time="2024-03-04T10:04:02Z" level=info msg="Reboot schedule: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC"
time="2024-03-04T10:04:02Z" level=info msg="Reboot check command: [test -f /sentinel/reboot-required] every 1m0s"
time="2024-03-04T10:04:02Z" level=info msg="Concurrency: 1"
time="2024-03-04T10:04:02Z" level=info msg="Reboot method: command"
time="2024-03-04T10:04:02Z" level=info msg="Reboot signal: 39"
time="2024-03-04T10:05:50Z" level=info msg="Reboot not required"
time="2024-03-04T10:06:50Z" level=info msg="Reboot not required"
time="2024-03-04T10:07:50Z" level=info msg="Reboot not required"
time="2024-03-04T10:08:50Z" level=info msg="Reboot not required"
time="2024-03-04T10:09:50Z" level=info msg="Reboot not required"
time="2024-03-04T10:10:50Z" level=info msg="Reboot not required"

Logs are coming from the pod scheduled on the worker. A quick look, and we can see that I’ve set the check every minute, the whole day, every day. It is definitely not a production-ready practice, but fine enough for our tests!

Can we introduce some workloads?

Without any form of inspiration, I’m triggering here the examples mentioned in the Minikube tutorials, from kubernetes.io. Deployment first, and then the Service, for this test as NodePort.

ubuntu@k8s-node-master:~$ kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.39 -- /agnhost netexec --http-port=8080
deployment.apps/hello-node created
ubuntu@k8s-node-master:~$ kubectl get deployments
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
hello-node   1/1     1            1           8s
ubuntu@k8s-node-master:~$ kubectl get pods -o wide
NAME                          READY   STATUS    RESTARTS   AGE   IP          NODE              NOMINATED NODE   READINESS GATES
hello-node-7b87cd5f68-lzlxk   1/1     Running   0          16s   10.42.1.8   k8s-node-worker   <none>           <none>
ubuntu@k8s-node-master:~$ kubectl expose deployment hello-node --type=NodePort --port=8080
service/hello-node exposed
ubuntu@k8s-node-master:~$ kubectl get svc
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
hello-node   NodePort    10.43.187.123   <none>        8080:30608/TCP   5s
kubernetes   ClusterIP   10.43.0.1       <none>        443/TCP          26d

Let’s validate our deployment, we are using nodePort, so I’m using the IP of my worker.

ubuntu@k8s-node-master:~$ curl 192.168.7.16:30608
NOW: 2024-03-04 10:26:13.925835071 +0000 UTC m=+213.118628418

It worked (luckily)!

Kured in action

Now it’s time for some system upgrades on the worker.

ubuntu@k8s-node-worker:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

We are running Ubuntu 22.04.4 LTS, so as usual, it’s a matter of apt update and apt dist-upgrade.

ubuntu@k8s-node-worker:~$ sudo apt update
Hit:1 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy InRelease  
Get:2 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]  
Get:3 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]  
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]  
...
Fetched 9330 kB in 2s (4796 kB/s)                                
Reading package lists... Done  
Building dependency tree... Done  
Reading state information... Done  
79 packages can be upgraded. Run 'apt list --upgradable' to see them.  
ubuntu@k8s-node-worker:~$ sudo apt dist-upgrade    
Reading package lists... Done  
Building dependency tree... Done  
Reading state information... Done  
Calculating upgrade... Done  
The following NEW packages will be installed:  
 linux-aws-6.5-headers-6.5.0-1014 linux-headers-6.5.0-1014-aws linux-image-6.5.0-1014-aws linux-modules-6.5.0-1014-aws  
...
Processing triggers for dbus (1.12.20-2ubuntu4.1) ...
Processing triggers for linux-image-6.5.0-1014-aws (6.5.0-1014.14~22.04.1) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-6.5.0-1014-aws
/etc/kernel/postinst.d/zz-update-grub:
Sourcing file `/etc/default/grub'

The output is truncated, but the upgrade brought a new Linux kernel. That’s nice, because it will require some updates! Let’s look at the files mentioned earlier:

ubuntu@k8s-node-worker:~$ cat /var/run/reboot-required  
*** System restart required ***  
ubuntu@k8s-node-worker:~$ cat /var/run/reboot-required.pkgs    
libc6  
linux-image-6.5.0-1014-aws  
linux-base

All good: the /var/run/reboot-required will trigger the whole process from Kured. Let’s be fast; things are about to get busy. What we should expect:

  • Kured will drain the node and terminate the pods,
  • Pods will be rescheduled on available nodes,
  • The node will be rebooted,
  • Kured will be restarted on the rebooted node,
  • and node will be uncordoned.

And all of this is taking place as you are staring at your screen ūüôā

ubuntu@k8s-node-master:~$ kubectl get pods -o wide && kubectl get nodes -o wide
NAME                          READY   STATUS        RESTARTS   AGE   IP          NODE              NOMINATED NODE   READINESS GATES
hello-node-7b87cd5f68-lzlxk   1/1     Terminating   0          27m   10.42.1.8   k8s-node-worker   <none>           <none>
NAME              STATUS                     ROLES                       AGE   VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
k8s-node-master   Ready                      control-plane,etcd,master   26d   v1.26.12+rke2r1   192.168.4.194   <none>        Ubuntu 22.04.3 LTS   6.2.0-1017-aws   containerd://1.7.11-k3s2
k8s-node-worker   Ready,SchedulingDisabled   <none>                      26d   v1.26.12+rke2r1   192.168.7.16    <none>        Ubuntu 22.04.4 LTS   6.2.0-1017-aws   containerd://1.7.11-k3s2

Kured took notice of the file and initiated the process. Drain is ongoing, the pod is terminating.

ubuntu@k8s-node-master:~$ kubectl get pods -o wide && kubectl get nodes -o wide  
NAME                          READY   STATUS    RESTARTS   AGE   IP           NODE              NOMINATED NODE   READINESS GATES  
hello-node-7b87cd5f68-7l9jk   1/1     Running   0          25s   10.42.0.22   k8s-node-master   <none>           <none>  
NAME              STATUS                     ROLES                       AGE   VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME  
k8s-node-master   Ready                      control-plane,etcd,master   26d   v1.26.12+rke2r1   192.168.4.194   <none>        Ubuntu 22.04.3 LTS   6.2.0-1017-aws   containerd://1.7.11-k3s2  
k8s-node-worker   NotReady,SchedulingDisabled   <none>                      26d   v1.26.12+rke2r1   192.168.7.16    <none>        Ubuntu 22.04.4 LTS   6.2.0-1017-aws   containerd://1.7.11-k3s2

Now, the node is marked as NotReady by the control-plane, the reboot is ongoing. The pod has also been rescheduled on the control-plane node. Yes, it’s uncommon for a vanilla Kubernetes but on RKE2, by default, control-plane nodes can handle user workloads! Wait a bit of time for the worker to reboot and do our checks again.

ubuntu@k8s-node-master:~$ kubectl get pods -o wide && kubectl get nodes -o wide  
NAME                          READY   STATUS    RESTARTS   AGE     IP           NODE              NOMINATED NODE   READINESS GATES  
hello-node-7b87cd5f68-7l9jk   1/1     Running   0          4m14s   10.42.0.22   k8s-node-master   <none>           <none>  
NAME              STATUS   ROLES                       AGE   VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME  
k8s-node-master   Ready    control-plane,etcd,master   26d   v1.26.12+rke2r1   192.168.4.194   <none>        Ubuntu 22.04.3 LTS   6.2.0-1017-aws   containerd://1.7.11-k3s2  
k8s-node-worker   Ready    <none>                      26d   v1.26.12+rke2r1   192.168.7.16    <none>        Ubuntu 22.04.4 LTS   6.5.0-1014-aws   containerd://1.7.11-k3s2

Nice! The node is back and properly uncordoned. We can check the logs from the pod of Kured, running on the newly rebooted node.

ubuntu@k8s-node-master:~$ kubectl get pods -n kured -o wide
NAME          READY   STATUS    RESTARTS      AGE   IP           NODE              NOMINATED NODE   READINESS GATES
kured-65l69   1/1     Running   1 (93s ago)   49m   10.42.1.10   k8s-node-worker              
kured-6z7bf   1/1     Running   0             49m   10.42.0.21   k8s-node-master              

ubuntu@k8s-node-master:~$ kubectl logs -n kured kured-65l69
time="2024-03-04T10:52:27Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2024-03-04T10:52:27Z" level=info msg="Kubernetes Reboot Daemon: 1.15.0"
time="2024-03-04T10:52:27Z" level=info msg="Node ID: k8s-node-worker"
time="2024-03-04T10:52:27Z" level=info msg="Lock Annotation: kured/kured:weave.works/kured-node-lock"
time="2024-03-04T10:52:27Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2024-03-04T10:52:27Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting"
time="2024-03-04T10:52:27Z" level=info msg="PreferNoSchedule taint: "
time="2024-03-04T10:52:27Z" level=info msg="Blocking Pod Selectors: []"
time="2024-03-04T10:52:27Z" level=info msg="Reboot schedule: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC"
time="2024-03-04T10:52:27Z" level=info msg="Reboot check command: [test -f /sentinel/reboot-required] every 1m0s"
time="2024-03-04T10:52:27Z" level=info msg="Concurrency: 1"
time="2024-03-04T10:52:27Z" level=info msg="Reboot method: command"
time="2024-03-04T10:52:27Z" level=info msg="Reboot signal: 39"
time="2024-03-04T10:53:42Z" level=info msg="Holding lock"
time="2024-03-04T10:53:42Z" level=info msg="Uncordoning node k8s-node-worker"
time="2024-03-04T10:53:42Z" level=info msg="Releasing lock"

A few words to conclude

Kured isn’t a new product; the very first version that was officially released on GitHub was in 2017. Through a very simple example, we already see the potential for this tool, which could easily be integrated into our platforms. We certainly need to explore its behavior in large scale environments, but Kured seems to be well suited for such use cases! And that means fewer headaches for patching our nodes. I wish I’d found this tool sooner. Stay connected for further discovery with this tool!