We have recently observed conditions where some kubernetes workers (in the dse-k8s-eqiad cluster) were put into service without functional BGP sessions.
The result of this is that pods were scheduled but had no connectivity and this caused errors.
We have some checks in place from the switch side, such as those in team-wmcs/bgp.yaml, but nothing yet which monitors calico from the host side.
We could perhaps use the output from calicoctl node status and make this available via a prometheus::node_textfile resource.
A working calico node shows output similar to this:
btullis@dse-k8s-worker1018:~$ sudo calicoctl node status Calico process is running. IPv4 BGP status +--------------+---------------+-------+----------+-------------+ | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO | +--------------+---------------+-------+----------+-------------+ | 10.64.181.1 | node specific | up | 12:26:02 | Established | +--------------+---------------+-------+----------+-------------+ IPv6 BGP status +-------------------+---------------+-------+----------+-------------+ | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO | +-------------------+---------------+-------+----------+-------------+ | 2620:0:861:13b::1 | node specific | up | 12:26:02 | Established | +-------------------+---------------+-------+----------+-------------+
When the switch port isn't fully configured, so the BGP sessions are not established, the output looks like this:
btullis@dse-k8s-worker1018:~$ sudo calicoctl node status Calico process is running. IPv4 BGP status +--------------+---------------+-------+----------+--------------------------------+ | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO | +--------------+---------------+-------+----------+--------------------------------+ | 10.64.181.1 | node specific | start | 11:56:58 | Active Socket: Connection | | | | | | refused | +--------------+---------------+-------+----------+--------------------------------+ IPv6 BGP status +-------------------+---------------+-------+----------+--------------------------------+ | PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO | +-------------------+---------------+-------+----------+--------------------------------+ | 2620:0:861:13b::1 | node specific | start | 11:56:58 | Active Socket: Connection | | | | | | refused | +-------------------+---------------+-------+----------+--------------------------------+
This check could be useful for any kubernetes worker node.