Forking from one of the chosen path for T384731: Prevent BGP alerts triggering when K8s host maintenance is being done : being able to assign "remote" side metrics to devices (servers and potentially network devices as well).
Interesting metrics could for example be interface status, BGP status, learned/advertised/prefixes.
This would allow nice things like :
- More accurate alerting (eg. notify the service owner)
- Easier correlation (verify that local and remote side are similar)
@fgiunchedi provided guidelines on how to implement that in the parent task.
Experimenting with it raises some issues, nothing blocking so far.
First, if we want to do a label swap, instance <-> peer_descr, the query gets a bit complex :
sum without (instance_tmp) (
label_replace(
label_replace(
label_replace(gnmi_bgp_neighbor_session_state{peer_group!~"(IX|Transit|bgptools)[46]", peer_descr!="(null)"},
"instance_tmp", "$1:0", "peer_descr", "(.*)"
),
"peer_descr", "$1:0", "instance", "(.*)"
),
"instance", "$1", "instance_tmp", "(.*)"
)
)If we don't want to swap it, but instead create a new label with a more explicit name, it's a bit better :
sum without (peer_descr) (
label_replace(
label_replace(gnmi_bgp_neighbor_session_state{peer_group!~"(IX|Transit|bgptools)[46]", peer_descr!="(null)"},
"remote_instance", "$1:0", "instance", "(.*)"
),
"instance", "$1:0", "peer_descr", "(.*)"
)
)Here i used remote_instance to keep it generic enough. But it could be peer_instance, suggestions welcome.
Then comes the filtering, here for my testing I went for a denylist : peer_group!~"(IX|Transit|bgptools)[46]", peer_descr!="(null)"
But we could also add a ASN range allowlist, or more specific allow list, but it is getting quite large overall.
chatting with Filippo, there is not too much a risk of "polluting" metrics with invalid instance. But I think we should still aim at keeping it clean.
Next issue is the :0 in the instance name.
Servers have their instance set to for example to netflow1002:0, but network devices don't, for example: cr2-magru, so here if we apply the label_replace to network devices as well as servers, we end up with metrics with instances like cr2-magru:0 which defeat the purpose. One possible fix is to add the :0 label during ingestion : https://github.com/wikimedia/operations-puppet/blob/04ac4edb1f14f35ee7a3a693ca308b63110f94ba/modules/profile/manifests/prometheus/ops.pp#L1316
We can also decide to apply the label_replace only to servers for now. But it seems better to standardize the instance label sooner than later.
