We came across an interesting issue recently in one of our customer’s Kubernetes cluster that worth sharing. We installed Cloudbees in that cluster and everything looked fined with all pods up and running. However we couldn’t connect to the Cloudbees dashboard. We use Cilium as our CNI for this cluster and we will see how it is involved in this issue and how it helps for troubleshooting.

Spotting a networking issue

The usual troubleshooting steps are to start by following the traffic path in command line using curl. We can then generate HTTP/HTTPS traffic and evaluate the responses. So from one node in the cluster we tried to reach the Ingress host, then the service IP address and finally one of the pod of this service. In each step we couldn’t reach the Cloudbees webserver from that node.

Then we tried to reach the pod from the node it is hosted on. Yep, that worked. So it looked like it doesn’t work for traffic coming from another node. This smells the networking issue so it was time to use our Cilium toolbox to confirm it.

Cilium comes by default with some very useful tools and especially the ability to monitor traffic to its endpoints (which are the pods). So, from another node we generated HTTP traffic toward that Pod with the command:

curl -v 172.20.4.209:2080

At the same time, we monitored all the traffic seen by Cilium and filtered on the IP address of the Pod:

$ kubectl exec -it -n kube-system cilium-5v2fq -- cilium monitor|grep 172.20.4.209
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init)
Policy verdict log: flow 0x0 local EP ID 33, remote ID remote-node, proto 6, ingress, action deny, match none, 172.20.0.206:33382 -> 172.20.4.209:2080 tcp SYN
xx drop (Policy denied) flow 0x0 to endpoint 33, file bpf_lxc.c line 1916, , identity remote-node->19848: 172.20.0.206:33382 -> 172.20.4.209:2080 tcp SYN
Policy verdict log: flow 0x0 local EP ID 33, remote ID remote-node, proto 6, ingress, action deny, match none, 172.20.0.206:33382 -> 172.20.4.209:2080 tcp SYN
xx drop (Policy denied) flow 0x0 to endpoint 33, file bpf_lxc.c line 1916, , identity remote-node->19848: 172.20.0.206:33382 -> 172.20.4.209:2080 tcp SYN
Policy verdict log: flow 0x0 local EP ID 33, remote ID remote-node, proto 6, ingress, action deny, match none, 172.20.0.206:33382 -> 172.20.4.209:2080 tcp SYN
xx drop (Policy denied) flow 0x0 to endpoint 33, file bpf_lxc.c line 1916, , identity remote-node->19848: 172.20.0.206:33382 -> 172.20.4.209:2080 tcp SYN
Policy verdict log: flow 0x0 local EP ID 33, remote ID remote-node, proto 6, ingress, action deny, match none, 172.20.0.206:33382 -> 172.20.4.209:2080 tcp SYN
xx drop (Policy denied) flow 0x0 to endpoint 33, file bpf_lxc.c line 1916, , identity remote-node->19848: 172.20.0.206:33382 -> 172.20.4.209:2080 tcp SYN
Policy verdict log: flow 0x0 local EP ID 33, remote ID remote-node, proto 6, ingress, action deny, match none, 172.20.0.206:33382 -> 172.20.4.209:2080 tcp SYN
xx drop (Policy denied) flow 0x0 to endpoint 33, file bpf_lxc.c line 1916, , identity remote-node->19848: 172.20.0.206:33382 -> 172.20.4.209:2080 tcp SYN

We could immediately spot the networking issue by seeing the TCP SYN packet being dropped due to a Policy.

On the path to the root cause

That is great but now we needed to understand what is this policy and why it was applied. We could get information about that endpoint by using another cilium command:

$ kubectl exec -it -n kube-system cilium-5v2fq -- cilium endpoint list
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init)
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                         IPv6   IPv4           STATUS
           ENFORCEMENT        ENFORCEMENT
33         Enabled            Disabled          19848      k8s:app=flow-web                                                                           172.20.4.209   ready
                                                           k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=cloudbees                                  
                                                           k8s:io.cilium.k8s.policy.cluster=default                                                                  
                                                           k8s:io.cilium.k8s.policy.serviceaccount=default                                                           
                                                           k8s:io.kubernetes.pod.namespace=cloudbees                                                                 
                                                           k8s:release=cloudbees

We see that ingress traffic has policy enforced for this endpoint and that is why our traffic is dropped. By checking the Cilium documentation about policy we learned that by default if there is an ingress rule in a network policy then enforcement is applied. Ok but wait a minute we don’t use network policy in our Kubernetes network so where this rule comes from?

Also all other endpoint components in our cluster have enforcement policy disabled so why is it different with Cloudbees?

Workaround

In order to make the Cloudbees dashboard available while we continued looking for the root cause, we tried the following actions without success:

  • Deletion of the Pod
  • Rollout restart of the cilium-agent DaemonSet
  • Rollout restart of the cilium-operator DaemonSet
  • Uninstall and Reinstall Cilium with our Ansible scripts

We then understood that this configuration is set that way and we needed to look at the default values used for the Cloudbees installation. In order to not apply any enforcement at all we changed the value of the enforcement mode from “default” to “never” in our Cloudbees installation script.

policyEnforcementMode: "never"

With this value, even if there are policies they will not be enforced.

We ran the installation of Cloudbees with this parameter and indeed we could now access our dashboard and the ingress Enforcement was Disabled for our endpoint. That’s great but now we wanted to get to the bottom of it and understand where this policy came from.

Getting closer to the root cause

As it is Cilium that enforces the policy, we looked at which policy it sees with the following command:

$ kubectl exec -it -n kube-system cilium-4x9kp -- cilium policy get

We could see the ingress policy enforced by Cilium but it is easier to read the network policies directly from Kubernetes with the following command:

$ kubectl get netpol -A
NAMESPACE   NAME                POD-SELECTOR                                                    AGE
cloudbees   dois-policy         app=flow-devopsinsight,release=cloudbees                        2d16h
cloudbees   repository-policy   app=flow-repository,release=cloudbees                           2d16h
cloudbees   server-policy       app=flow-server,release=cloudbees                               2d16h
cloudbees   web-policy          app=flow-web,release=cloudbees                                  2d16h
cloudbees   zookeeper-policy    mode=private,ownerApp=cloudbees-flow,role=cluster-coordinator   2d16h

This confirms that there are network policies but only Cloudbees has some in our cluster. Indeed those are installed with Cloudbees! It is becoming clearer now.

The root cause

The graal of any troubleshooting work is to find the root cause of the issue. We were wondering why this network policy was not working, after all if a network policy is enforced it is not especially a bad thing but here it seems to drop all the traffic. Let’s find out why and reach our graal.

Let’s first have a look at the details of this network policy that drops our traffic:

$ kubectl describe netpol web-policy -n cloudbees
Name:         web-policy
Namespace:    cloudbees
Created on:   2022-10-21 15:32:32 +0200 CEST
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: cloudbees
              meta.helm.sh/release-namespace: cloudbees
Spec:
  PodSelector:     app=flow-web,release=cloudbees
  Allowing ingress traffic:
    To Port: 2080/TCP
    To Port: 2443/TCP
    From:
      IPBlock:
        CIDR: 0.0.0.0/0
        Except:
    From:
      PodSelector: app=flow-bound-agent,release=cloudbees
    From:
      PodSelector: app=flow-server,release=cloudbees
  Not affecting egress traffic
  Policy Types: Ingress

It is love at first sight as all traffic (0.0.0.0/0), usually external to the cluster (according to the Kubernetes documentation), is allowed as Ingress as well as the traffic from selected Pods in the namespace cloudbees. When we connect to the Cloudbees dashboard, the traffic doesn’t comes from a Pod in the namespace cloudbees so it should be allowed. That is the theory but in reality we’ve seen that the traffic is dropped. Let’s dig deeper.

The Cilium documentation gives us some useful information about network policy in the section “IP/CIDR based”: “CIDR rules do not apply to traffic where both sides of the connection are either managed by Cilium or use an IP belonging to a node in the cluster (including host networking pods)”. Could it be we are in this case?

The first thing is to check on which node is hosted the Cloudbees dashboard Pod:

$ kubectl get po -A -o wide|grep -i flow-web
cloudbees               flow-web-5677549db-7q4l8                                          1/1     Running                0                3d14h   172.20.4.209   node175   <none>           <none>

Then we check the Cilium Pod on that same node:

$ kubectl get po -A -o wide|grep -i cilium|grep node175
kube-system             cilium-jkrsn                                                      1/1     Running                0                  3d9h    10.162.56.53   node175   <none>           <none>

By being hosted on the same node, this Cilium Pod can monitor traffic to it. So we used the same monitoring command as previously:

$ kubectl exec -it -n kube-system cilium-jkrsn -- cilium monitor|grep 172.20.4.209
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init)
-> endpoint 33 flow 0x0 , identity 41842->19848 state new ifindex lxc90f196162357 orig-ip 172.20.2.216: 172.20.2.216:44430 -> 172.20.4.209:2080 tcp SYN
-> overlay flow 0x8be577d7 , identity 19848->unknown state reply ifindex cilium_vxlan orig-ip 0.0.0.0: 172.20.4.209:2080 -> 172.20.2.216:44430 tcp SYN, ACK
-> endpoint 33 flow 0x0 , identity 41842->19848 state established ifindex lxc90f196162357 orig-ip 172.20.2.216: 172.20.2.216:44430 -> 172.20.4.209:2080 tcp ACK

First we can confirm that the traffic is not dropped this time and that the source IP address is 172.20.2.216. We have to check to which Pod or Node it belongs:

$ kubectl get po -A -owide|grep 172.20.2.216
nginx-controller        ingress-nginx-ingress-nginx-7b598d66b4-dgfzh                      1/1     Running                0                  3d21h   172.20.2.216   node176   <none>           <none>

So this is the IP address of our nginx-controller Pod and this is managed by the Cilium Pod hosted on the same node as we can see:

$ kubectl exec -it -n kube-system cilium-rfpwh -- cilium endpoint list|grep 172.20.2.216
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init)
1963       Disabled           Disabled          41842      k8s:app.kubernetes.io/component=controller                                                             172.20.2.216   ready

So that’s it! We are in the case described in the Cilium documentation where both of our endpoints are managed by Cilium so it does not evaluate the 0.0.0.0/0 rule of the network policy and drop the traffic as there is no other matching rules. No mercy!

Sum up & Next Steps

Congratulation if you have read this far!

So our Cloudbees installation included the installation of network policies. Cilium by default enforce the network policies when they exists and so discarded the traffic in our case because there were no matching rules in our environment.

Now that we found out that Cloudbees installation comes with some network policies, we will have to check if we need them and if we can choose to not install them. Also if we decide to keep them, we will need to make them work in our environment.

When you know the solution, it all looks pretty obvious but it took us several hours to get the full picture. Doing troubleshooting in a Kubernetes cluster is often very challenging, this is why it is so interesting! By sharing our experience, I hope it will save you time if you face a similar issue.

If you too want to take your Kubernetes skills to the next level, check out our Training course given by our Kubernetes virtuoso!