Networking (Flannel) issues¶
Cross-node pod connectivity issues¶
It happens that pods in different worker nodes could not reach each other.
To check where the pod is runnning do:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kubernetes-bootcamp-6c5cfd894b-mnq5f 1/1 Running 0 32m 10.111.37.154 pa1-r2-s01 <none> <none>
kubernetes-bootcamp-6c5cfd894b-qd2tb 1/1 Running 0 32m 10.111.12.208 pa1-r3-gpu01 <none> <none>
kubernetes-bootcamp-6c5cfd894b-t2g2n 1/1 Running 0 47m 10.111.15.202 pa1-r3-s14 <none> <none>
Log in in one node and try to ping the others. Note that their IPs are on different subnets (10.111.37.0, 10.111.12.0, etc). Each subnet belongs t oa specific node.
Log in to the worker nodes and check/restart flannel:
$ juju run --application kubernetes-worker "sudo service flannel status [restart]"
This should be enough to solve the connectivity issues.
Flannel plugin disappears in /opt/cni/bin¶
It happened to us that after replacing a Juju relation between kubernetes-worker and flannel (by juju remove-relation and juju add-relation) the containers created in the worker failed with the following error:
network: failed to find plugin "flannel" in path [/opt/cni/bin]
The cause is probably a bug in the flannel charm, which removes the CNI flannel plugin when the relation is cleared but it does not create it again.
We solved the issue by issuing:
juju upgrade-charm kubernetes-worker
(and if needed rolling back to the previous revision). Indeed, it is kubernetes-worker that installs the plugin during installation!
Configure flannel through etcdctl¶
Example session:
$ juju ssh etcd/6 # choose the etcd leader
$ etcdctl ls /coreos.com/network
/coreos.com/network/config
/coreos.com/network/subnets
$ etcdctl ls /coreos.com/network/subnets
/coreos.com/network/subnets/10.111.71.0-24
/coreos.com/network/subnets/10.111.82.0-24
/coreos.com/network/subnets/10.111.24.0-24
/coreos.com/network/subnets/10.111.11.0-24
/coreos.com/network/subnets/10.111.39.0-24
$ etcdctl get /coreos.com/network/config
{"Network": "10.111.0.0/16", "Backend": {"Type": "vxlan"}}
# specify explicitly that the subnets should be /24 with SubnetLen
$ etcdctl set /coreos.com/network/config '{"Network": "10.111.0.0/16", "SubnetLen": 24, "Backend": {"Type": "vxlan"}}'
{"Network": "10.111.0.0/16", "SubnetLen": 24, "Backend": {"Type": "vxlan"}}
# reconfigure the subnet on a worker (10.111.82.0/24 -> 10.111.42.0/24)
$ etcdctl get /coreos.com/network/subnets/10.111.82.0-24
{"PublicIP":"10.2.8.123","BackendType":"vxlan","BackendData":{"VtepMAC":"56:25:26:42:5b:00"}}
$ etcdctl set /coreos.com/network/subnets/10.111.42.0-24 '{"PublicIP":"10.2.8.123","BackendType":"vxlan","BackendData":{"VtepMAC":"56:25:26:42:5b:00"}}'
{"PublicIP":"10.2.8.123","BackendType":"vxlan","BackendData":{"VtepMAC":"56:25:26:42:5b:00"}}
$ etcdctl rm /coreos.com/network/subnets/10.111.82.0-24
PrevNode.Value: {"PublicIP":"10.2.8.123","BackendType":"vxlan","BackendData":{"VtepMAC":"56:25:26:42:5b:00"}}
# restart flannel on all nodes
$ juju run --application kubernetes-worker "sudo service flannel restart"
# check
$ juju run --application kubernetes-worker "sudo cat /var/run/flannel/subnet.env"
$ juju run --application kubernetes-worker "sudo ip -4 a | grep 111"