Page MenuHomePhabricator

Increase visibility of kubernetes network status
Open, MediumPublic

Description

Context: Follow-up of incident from 2024-02-07 (newly added kubernetes nodes missing BGP configuration).

We do not have good visibility into the network of kubernetes clusters.

  • We already collect some metrics from Calico components, we should make them more visible by adding key metrics to dashboards and possibly alerts
  • We do not have any metrics related to BGP sessions. These are not available in calico "open core", so we probably want to run bird-exporter.
    • Specific for situations like that incident: nodes pooled in pybal vs BGP status would be useful to have

Event Timeline

lmata subscribed.

please let us know if we can assist :-)

jijiki changed the task status from Open to Stalled.Apr 2 2025, 11:46 AM
jijiki moved this task from Incoming 🐫 to ⎈Kubernetes on the serviceops-deprecated board.
JMeybohm changed the task status from Stalled to Open.EditedMar 31 2026, 12:04 PM
JMeybohm triaged this task as Medium priority.
JMeybohm edited projects, added ServiceOps new, netops; removed serviceops-deprecated.

We had some issues that could have been surfaced by this while populating racks that use new nokia switches (T417817) and netops invested in making the relevant information more accessible in T387287: Prometheus: attach host's BGP/interface remote side metrics, dashboard at: https://grafana.wikimedia.org/goto/afhobcaujherkc?orgId=1.

Not sure why this was stalled in the first place, but I think we should:

  • Add alerts if BGP sessions are not established (gnmi_bgp_neighbor_session_state{peer_descr=~"wikikube.*"}) != 6?)
  • Add alerts if routes from k8s workers are rejected from switches (sum by (peer_descr) (gnmi_bgp_neighbor_prefixes_received_pre_policy{peer_descr=~"wikikube.*"} - gnmi_bgp_neighbor_prefixes_received{peer_descr=~"wikikube.*"}) > 0?)
  • Add the information from the BGP dashboard to the Kubernetes nodes dashboard for increased visibility.

Change #1269983 had a related patch set uploaded (by Blake; author: Blake):

[operations/alerts@master] kubernetes-generic: Add alerts for BGP failure scenarios.

https://gerrit.wikimedia.org/r/1269983

Change #1269994 had a related patch set uploaded (by Blake; author: Blake):

[operations/alerts@master] kubernetes-generic: Add alerts for BGP failure scenarios.

https://gerrit.wikimedia.org/r/1269994

Change #1269983 abandoned by Blake:

[operations/alerts@master] kubernetes-generic: Add alerts for BGP failure scenarios.

https://gerrit.wikimedia.org/r/1269983

Broadly the patch submitted looked good to me, though I see it was abandoned.

As per the comment I left on it we also need to alert when the session does not establish because the router/switch has not been configured for it. We could possibly detect that because there are no switch-exposed metrics for the given host. If we can't negatively alert on the series being missing we'd need to export the stats from Calico on the hosts instead. Unfortunately it doesn't do that as is, we'd need to instrument it ourselves but I think it's not too hard.

cmooney@wikikube-worker1273:~$ sudo calicoctl node status | awk '/node specific/{print "calico_bgp_peer_status=peer="$2", status="$11}'
calico_bgp_peer_status=peer=10.64.177.1, status=Established
calico_bgp_peer_status=peer=2620:0:861:137::1, status=Established

Ah, thanks Cathal! The original patch was abandoned because I was struggling with git, the new patch is now https://gerrit.wikimedia.org/r/c/operations/alerts/+/1269994. I'll update it in accordance with the comment on the previous patch.

Broadly the patch submitted looked good to me, though I see it was abandoned.

As per the comment I left on it we also need to alert when the session does not establish because the router/switch has not been configured for it. We could possibly detect that because there are no switch-exposed metrics for the given host. If we can't negatively alert on the series being missing we'd need to export the stats from Calico on the hosts instead. Unfortunately it doesn't do that as is, we'd need to instrument it ourselves but I think it's not too hard.

cmooney@wikikube-worker1273:~$ sudo calicoctl node status | awk '/node specific/{print "calico_bgp_peer_status=peer="$2", status="$11}'
calico_bgp_peer_status=peer=10.64.177.1, status=Established
calico_bgp_peer_status=peer=2620:0:861:137::1, status=Established

We can also run the prometheus bird exporter as a sidecar to calico-node (container image needs to be build and some yaml needs to be edited) which should give us more standardized metrics. We could add that as a stretch goal for this task.

We can also run the prometheus bird exporter as a sidecar to calico-node (container image needs to be build and some yaml needs to be edited) which should give us more standardized metrics. We could add that as a stretch goal for this task.

Yeah that's a better way to go if it works.

I gave a quick review on the CR, but you should at least use https://wikitech.wikimedia.org/wiki/Network_telemetry#remote_instance:gnmi_bgp_neighbor_session_state%7B%7D

If you also need the count of prefixes received by the switch we should implement the same "remote_instance" thing. That way you could also compare it to what's being sent on the host's side (for example with prometheus-bird-exporter)

Regarding the count of "prefixes received by the switch but not accepted", I think as a first step it could be a global Netops alert (for our internal peers).
However if it gets triggered more often by server side issues, we should move it to the same "remote_instance" mechanism.

Regarding the count of "prefixes received by the switch but not accepted", I think as a first step it could be a global Netops alert (for our internal peers).
However if it gets triggered more often by server side issues, we should move it to the same "remote_instance" mechanism.

I've open T423384: Investigate internal rejected prefixes for that.

Change #1269994 merged by jenkins-bot:

[operations/alerts@master] kubernetes-generic: Add alerts for BGP failure scenarios.

https://gerrit.wikimedia.org/r/1269994