Page MenuHomePhabricator

Alert if calico BGP sessions are not established on any kubernetes worker
Closed, DuplicatePublic

Description

We have recently observed conditions where some kubernetes workers (in the dse-k8s-eqiad cluster) were put into service without functional BGP sessions.
The result of this is that pods were scheduled but had no connectivity and this caused errors.

We have some checks in place from the switch side, such as those in team-wmcs/bgp.yaml, but nothing yet which monitors calico from the host side.

We could perhaps use the output from calicoctl node status and make this available via a prometheus::node_textfile resource.

A working calico node shows output similar to this:

btullis@dse-k8s-worker1018:~$ sudo calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS |   PEER TYPE   | STATE |  SINCE   |    INFO     |
+--------------+---------------+-------+----------+-------------+
| 10.64.181.1  | node specific | up    | 12:26:02 | Established |
+--------------+---------------+-------+----------+-------------+

IPv6 BGP status
+-------------------+---------------+-------+----------+-------------+
|   PEER ADDRESS    |   PEER TYPE   | STATE |  SINCE   |    INFO     |
+-------------------+---------------+-------+----------+-------------+
| 2620:0:861:13b::1 | node specific | up    | 12:26:02 | Established |
+-------------------+---------------+-------+----------+-------------+

When the switch port isn't fully configured, so the BGP sessions are not established, the output looks like this:

btullis@dse-k8s-worker1018:~$ sudo calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+---------------+-------+----------+--------------------------------+
| PEER ADDRESS |   PEER TYPE   | STATE |  SINCE   |              INFO              |
+--------------+---------------+-------+----------+--------------------------------+
| 10.64.181.1  | node specific | start | 11:56:58 | Active Socket: Connection      |
|              |               |       |          | refused                        |
+--------------+---------------+-------+----------+--------------------------------+

IPv6 BGP status
+-------------------+---------------+-------+----------+--------------------------------+
|   PEER ADDRESS    |   PEER TYPE   | STATE |  SINCE   |              INFO              |
+-------------------+---------------+-------+----------+--------------------------------+
| 2620:0:861:13b::1 | node specific | start | 11:56:58 | Active Socket: Connection      |
|                   |               |       |          | refused                        |
+-------------------+---------------+-------+----------+--------------------------------+

This check could be useful for any kubernetes worker node.

Event Timeline

Thanks for the task @BTullis

While we can take an approach similar to the WMCS team (alerting on the switch-side status), it occurs to me a likely cause of a problem would be the switch isn't configured. In which case the series for that host won't be exported by the switch and we won't get an alert.

I did a quick-check online and it appears that calico-node in the open-source version doesn't provide any bgp stats out of the box. So we probably would need to somehow export the status as you show from calicoctl.

@BTullis is this linked to T419457: dse-k8s control plane OOM ? If yes, could you please add the corresponding incident in description; if not could you remove the Incident Followup tag?

This is to make sure we track follow-ups accurately

This looks like a duplicate of T356877: Increase visibility of kubernetes network status (and subtask like T423851: Collect calico BGP metrics) but for a different team. I think you can apply the same kind of monitoring for your services.

@Blake @BTullis can we dedup those tasks (or create a dependency)?

Re: the Incident Follow-up tag, I wonder if there is immediate risk vs what is more feature request. cc @JMeybohm if this needs discussion in Kubernetes WIG