By working on a project for one of our customer, I have investigated how MetalLB works in details. This Kubernetes cluster is on-premises only and uses Nginx Ingress Controller for processing incoming L7 and L4 traffic. If you want to deep dive into this technology then buckle up!

Nginx Ingress Controller Service

When we want to give access to a Kubernetes cluster with several services from outside, the best practice is to use an ingress and create a service of type Load Balancer for processing those ingress rules.

Nginx Ingress Controller is one of the options for doing this task and when it is setup you’ll see something like this:

$ kubectl get svc -A|grep -i loadbalancer
nginx-controller     ingress-nginx-ingress-nginx    LoadBalancer   172.10.10.10    10.0.0.220   443:31529/TCP            4d3h

This is our Nginx service LoadBalancer that uses the external IP address of 10.0.0.220. This IP address is provided by MetalLB.

MetalLB functionning

MetalLB has two components containerized into pods. One Controller and several Speakers. The Controller is the one responsible for giving the external IP address to the service LoadBalancer we have just seen above. There is one Speaker pod per node in the cluster that influences how the traffic is routed in the cluster.

MetalLB routing can use two modes: L2 and BGP. L2 is the one we are using and in this mode the elected Speaker is responsible of the external IP address (which is then a floating IP shared between the Speakers) and will respond to ARP requests to it. The Speaker pod is a host-networked pod so it uses the IP address of the node it is hosted on. This node will then route the traffic. Knowing that, I wanted to look under the hood and check if we could find information about the elected Speaker and its associated node used to route the traffic. I also wanted to know if there was a load balancing at play and how it was working.

Searching for L2 information

Finding the MAC address of the external IP address

The pod logs of the Speakers doesn’t reveal much except that there are some activities. However no details are logged.

The first step is to connect to one node of the cluster and do the following:

$ arp -a|grep 10.0.0.220
$ ping 10.0.0.220
PING 10.0.0.220 (10.0.0.220) 56(84) bytes of data.
^C
--- 10.0.0.220 ping statistics ---
11 packets transmitted, 0 received, 100% packet loss, time 10267ms

arp -a|grep 10.0.0.220
metallb-ep.domain (10.0.0.220) at 00:50:56:08:79:fb [ether] on ens192

This node had no ARP entry for our external IP address so by pinging it we could force it to create that entry. MetalLB documentation in its troubleshooting section doesn’t recommend ping but advice to use curl instead. In my case both worked. By doing a tcpdump in another terminal on this node at the same time we could see the ARP request and reply:

# tcpdump -i ens192 -vv arp
dropped privs to tcpdump
tcpdump: listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes
08:39:13.799663 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has metallb-ep.domain tell node1, length 28
08:39:13.800144 ARP, Ethernet (len 6), IPv4 (len 4), Reply metallb-ep.domain is-at 00:00:00:00:79:e9 (oui Unknown), length 46

From both the ARP table and the tcpdump trace we can see that the MAC address 00:00:00:00:79:e9 is associated with our external IP address. We know that this MAC address belongs to one of the node hosting the elected Speaker so let’s find it.

Finding which node interface is used in the ARP reply

At this stage you may want to automate that interface search in all the nodes or use the information collected in a monitoring tool. In my case I had only 3 nodes with a metallb-speaker pod so I did the search manually:

$ kubectl get po -A -owide |grep -i metal|grep -i speaker
metallb-controller   metallb-speaker-4w8wr                                     1/1     Running   0             4d5h   10.0.0.1   node1   <none>           <none>
metallb-controller   metallb-speaker-6x9c4                                     1/1     Running   0             4d5h   10.0.0.2   node2   <none>           <none>
metallb-controller   metallb-speaker-n4h8n                                     1/1     Running   0             18m    10.0.0.3   node3   <none>           <none>

Then ssh to each of them:

[node1 ~]$ ifconfig ens192
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.0.1  netmask 255.255.255.0  broadcast 10.0.0.255
        ether 00:00:00:00:79:e9  txqueuelen 1000  (Ethernet)

Sometimes life is easy: The first node is the chosen one! I now know that all my traffic is routed through that node and that the elected Speaker is the pod on that node too.

Load Balancing anyone?

Ok we could find which Speaker/Node routes our traffic but can we see if another Speaker/Node is chosen? And when? And how is it working? I’m glad you asked, I did the dirty work for you!

I’m first going to use arping from a node on which there is no metallb-speaker pod at all. This is one step further from what is advised in the MetalLB troubleshooting section.

[node4 ~]$ arping -I ens192 10.0.0.220
ARPING 10.0.0.220 from 10.0.0.4 ens192
Unicast reply from 10.0.0.220 [00:00:00:00:79:E9]  1.033ms
Unicast reply from 10.0.0.220 [00:00:00:00:7A:06]  1.051ms
Unicast reply from 10.0.0.220 [00:00:00:00:79:FB]  1.228ms
Unicast reply from 10.0.0.220 [00:00:00:00:79:FB]  1.059ms
Unicast reply from 10.0.0.220 [00:00:00:00:79:FB]  1.089ms
Unicast reply from 10.0.0.220 [00:00:00:00:79:FB]  1.177ms
Unicast reply from 10.0.0.220 [00:00:00:00:79:FB]  1.067ms
^C

[node4 ~]$ arping -I ens192 10.0.0.220
ARPING 10.0.0.220 from 10.0.0.4 ens192
Unicast reply from 10.0.0.220 [00:00:00:00:79:FB]  1.046ms
Unicast reply from 10.0.0.220 [00:00:00:00:79:E9]  1.064ms
Unicast reply from 10.0.0.220 [00:00:00:00:7A:06]  1.145ms
Unicast reply from 10.0.0.220 [00:00:00:00:7A:06]  1.170ms
Unicast reply from 10.0.0.220 [00:00:00:00:7A:06]  1.140ms
Unicast reply from 10.0.0.220 [00:00:00:00:7A:06]  1.037ms
Unicast reply from 10.0.0.220 [00:00:00:00:7A:06]  1.113ms
^C

The behavior observed above is consistent during all my tests, the first 3 replies are the MAC address of the nodes hosting the 3 metallb-speaker and the last one continue responding. I understand it to be elected Speaker. When you run the arping again then the 3 first replies are in the different order and that looks like a load balancer mechanism.

However that load balancing is working with arping which is designed to trigger an ARP request. When a node will communicate with the IP address of 10.0.0.220 it will use first its ARP table that contains the mapping between a MAC address and an IP address. If there is no entry then an ARP request will be triggered. In my case I had an entry already:

[node4 ~]# date; arp -a|grep 10.0.0.220
Tue Oct 11 14:27:01 CEST 2022
metallb-ep.domain (10.0.0.220) at 00:00:00:00:7a:06 [ether] on ens192
[node4 ~]# date; arp -a|grep 10.0.0.220
Tue Oct 11 14:27:50 CEST 2022
metallb-ep.domain (10.0.0.220) at 00:00:00:00:79:e9 [ether] on ens192
[node4 ~]# date; arp -a|grep 10.0.0.220
Tue Oct 11 14:29:21 CEST 2022
metallb-ep.domain (10.0.0.220) at 00:00:00:00:7a:06 [ether] on ens192

On node4 we observe that the MAC address associated with our external IP address changes regularly to point to a different Speaker/Node. That also looks like load balancing to me.

Let’s sniff the ARP traffic on that node to see what’s going on:

14:38:42.759258 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has metallb-ep.domain (Broadcast) tell metallb-ep.domain, length 46
14:38:42.759317 ARP, Ethernet (len 6), IPv4 (len 4), Reply metallb-ep.domain is-at 00:00:00:00:79:fb (oui Unknown), length 46
14:38:42.759323 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has mmetallb-ep.domain (Broadcast) tell metallb-ep.domain, length 46
14:38:42.759392 ARP, Ethernet (len 6), IPv4 (len 4), Reply metallb-ep.domain is-at 00:00:00:00:7a:06 (oui Unknown), length 46
14:38:42.760373 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has metallb-ep.domain (Broadcast) tell mmetallb-ep.domain, length 46
14:38:42.760393 ARP, Ethernet (len 6), IPv4 (len 4), Reply metallb-ep.domain is-at 00:00:00:00:79:e9 (oui Unknown), length 46

There are 3 ARP requests broadcast (sent by each Speaker I guess) and their corresponding replies that are sent every 15 seconds or so. When you observe the ARP table at the same time you see a change so those replies are used like some kind of gratuitous ARP to change the MAC address associated with the external IP address. Thus a different node will be used for that IP address so that still looks very much like load balancing to me.

Last but not least

As I’ve started to push, and I could experiment – DevOps style – with a cluster not in production, I finally wanted to know what are the effects of killing a metallb-speaker pod.

What I’ve observed is that if I kill the Speaker pod hosted on the node that I have the MAC address in my ARP table (so the elected Speaker), then a bunch of ARP request broadcast as shown above is sent immediately. However my node doesn’t necessarily changes its ARP table entry. The pod comes up in less than 15s and so the same one could be elected again apparently. I’ve read that a fail over has around 10s of recovery time so that fits with what I’ve observed.

Conclusion

With L2 node, only one node is used at a time to route the traffic. I read that load balancing was not possible in this mode. However, even if it is not the best way to achieve it, I’ve observed some load balancing mechanism not only when the elected Speaker is deleted but also at a regular interval of tens of seconds.

I haven’t found any explanation of those internals anywhere on the net so this is my contribution to build upon and go further on that topic.

Note that we have other older clusters where the MAC address of the Speaker/Node doesn’t change so the load balancing observation described in this blog doesn’t apply. The MetalLB version in those clusters is older than the one I’ve tested so this load balancing behaviour may have been improved, I’ll have to do further investigation.

Note also that the IP addresses, Ethernet Adresses and Host names have been anonymized.