Page MenuHomePhabricator

Prometheus: attach host's BGP/interface remote side metrics
Closed, ResolvedPublic

Description

Forking from one of the chosen path for T384731: Prevent BGP alerts triggering when K8s host maintenance is being done : being able to assign "remote" side metrics to devices (servers and potentially network devices as well).

Interesting metrics could for example be interface status, BGP status, learned/advertised/prefixes.

This would allow nice things like :

  • More accurate alerting (eg. notify the service owner)
  • Easier correlation (verify that local and remote side are similar)

@fgiunchedi provided guidelines on how to implement that in the parent task.

Experimenting with it raises some issues, nothing blocking so far.

First, if we want to do a label swap, instance <-> peer_descr, the query gets a bit complex :

sum without (instance_tmp) (
  label_replace(
    label_replace(
      label_replace(gnmi_bgp_neighbor_session_state{peer_group!~"(IX|Transit|bgptools)[46]", peer_descr!="(null)"},
        "instance_tmp", "$1:0", "peer_descr", "(.*)"
      ),
      "peer_descr", "$1:0", "instance", "(.*)"
    ),
    "instance", "$1", "instance_tmp", "(.*)"
  )
)

If we don't want to swap it, but instead create a new label with a more explicit name, it's a bit better :

sum without (peer_descr) (
  label_replace(
    label_replace(gnmi_bgp_neighbor_session_state{peer_group!~"(IX|Transit|bgptools)[46]", peer_descr!="(null)"},
      "remote_instance", "$1:0", "instance", "(.*)"
    ),
    "instance", "$1:0", "peer_descr", "(.*)"
  )
)

Here i used remote_instance to keep it generic enough. But it could be peer_instance, suggestions welcome.

Then comes the filtering, here for my testing I went for a denylist : peer_group!~"(IX|Transit|bgptools)[46]", peer_descr!="(null)"
But we could also add a ASN range allowlist, or more specific allow list, but it is getting quite large overall.

chatting with Filippo, there is not too much a risk of "polluting" metrics with invalid instance. But I think we should still aim at keeping it clean.

Next issue is the :0 in the instance name.

Servers have their instance set to for example to netflow1002:0, but network devices don't, for example: cr2-magru, so here if we apply the label_replace to network devices as well as servers, we end up with metrics with instances like cr2-magru:0 which defeat the purpose. One possible fix is to add the :0 label during ingestion : https://github.com/wikimedia/operations-puppet/blob/04ac4edb1f14f35ee7a3a693ca308b63110f94ba/modules/profile/manifests/prometheus/ops.pp#L1316

We can also decide to apply the label_replace only to servers for now. But it seems better to standardize the instance label sooner than later.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Another question is how to name those new metrics ?
One suggestion, to stay generic as well, is to do something like
gnmi_bgp_neighbor_session_state -> remote_bgp_neighbor_session_state (or peer_bgp_neighbor_session_state)
What do you think ?

Thank you for kickstarting this @ayounsi! I think I like remote_instance though don't feel strongly.

re: :0 in instance yes by default Prometheus attaches the port where it scraped data from (for example node-exporter on port 9100: node_boot_time_seconds{cluster="acmechief", instance="acmechief-test1001:9100", job="node"}) and as you said it'd be good to keep things consistent. The easiest is indeed to add the port at ingestion time, possibly with a value that's significant if that makes sense? e.g. gnmi port?

Another question is how to name those new metrics ?
One suggestion, to stay generic as well, is to do something like
gnmi_bgp_neighbor_session_state -> remote_bgp_neighbor_session_state (or peer_bgp_neighbor_session_state)
What do you think ?

Good question, we'll be using recording rules and for those we follow best practices outlined here: https://prometheus.io/docs/practices/rules/ so in this case I think something like below would make sense:

instance:gnmi_bgp_neighbor_session_state or remote_instance:gnmi_bgp_neighbor_session_state

Change #1122955 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add exporter port to gNMI metrics instance label

https://gerrit.wikimedia.org/r/1122955

Change #1122957 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Duplicate gNMI BGP session state to metric with peer_descr as instance

https://gerrit.wikimedia.org/r/1122957

Another question is how to name those new metrics ?
One suggestion, to stay generic as well, is to do something like
gnmi_bgp_neighbor_session_state -> remote_bgp_neighbor_session_state (or peer_bgp_neighbor_session_state)
What do you think ?

I like remote_bgp_neighbor_session_state but ultimately it's not too important.

We can also decide to apply the label_replace only to servers for now. But it seems better to standardize the instance label sooner than later.

I'm wondering what the benefit is to having the additional metrics saved for other network devices? As in what alterts will we have, and downtime configuration, where that will help us? Consider a link from cr1-codfw to cr2-drmrs. Our normal alerting will fire for both of these if the session goes down. Is there a way to reconfigure that so we still always get alerted when we need, but we can suppress any alerts if only one side is downtimed? Or is there other benefit of having them?

I'm wondering what the benefit is to having the additional metrics saved for other network devices? As in what alterts will we have, and downtime configuration, where that will help us? Consider a link from cr1-codfw to cr2-drmrs. Our normal alerting will fire for both of these if the session goes down. Is there a way to reconfigure that so we still always get alerted when we need, but we can suppress any alerts if only one side is downtimed? Or is there other benefit of having them?

No strong feelings, but a part of it is to treat network devices a bit like servers, having less exceptions.

The other one is to be able to check that the state or the number of prefixes for example is the same on both sides.

No strong feelings, but a part of it is to treat network devices a bit like servers, having less exceptions.

I'm kind of the same. But I think because we get these gnmi_ stats from both sides of the peering for network devices it will inevitably work differently than servers. No objection to us storing the additional metrics anyway.

No strong feelings, but a part of it is to treat network devices a bit like servers, having less exceptions.

I'm kind of the same. But I think because we get these gnmi_ stats from both sides of the peering for network devices it will inevitably work differently than servers. No objection to us storing the additional metrics anyway.

JFTR in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1122957 we'll be going with remote_instance:gnmi_bgp_neighbor_session_state to stay consistent with recording rules best practices (i.e. : convention)

ayounsi triaged this task as Medium priority.

Change #1122955 merged by Ayounsi:

[operations/puppet@production] Add exporter port to gNMI metrics instance label

https://gerrit.wikimedia.org/r/1122955

Change #1122957 merged by Ayounsi:

[operations/puppet@production] Duplicate gNMI BGP session state to metric with peer_descr as instance

https://gerrit.wikimedia.org/r/1122957

Change #1124795 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Also exclude Private-Peer from remote_instance:gnmi_bgp_neighbor_session_state

https://gerrit.wikimedia.org/r/1124795

I also added that metric to this dashboard as an example visualization : https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1

Change #1124795 merged by Ayounsi:

[operations/puppet@production] Also exclude Private-Peer from remote_instance:gnmi_bgp_neighbor_session_state

https://gerrit.wikimedia.org/r/1124795