Page MenuHomePhabricator

Nokia OSPF alerts not working
Closed, ResolvedPublic

Description

It seems the new alerts I added last week for OSPF status on our Nokia switches are not working as expected.

I tried to trigger a failure, however it is not working as expected, the current status for ssw1-d8-eqiad metrics is:

gnmi_nokia_ospf_oper_state{area_area_id="0.0.0.0", instance="ssw1-d8-eqiad:9804", instance_name="ospfv2", interface_interface_name="ethernet-1/11.0", job="gnmi", network_instance_name="default", prometheus="ops", site="eqiad"} 4
gnmi_nokia_ospf_neighbor_count{area_area_id="0.0.0.0", instance="ssw1-d8-eqiad:9804", instance_name="ospfv2", interface_interface_name="ethernet-1/11.0", job="gnmi", network_instance_name="default", prometheus="ops", site="eqiad"} 0

The state is 4, but neighbor count is 0, so it should alert, yet I don't think it fired.

There are also AlertLintProblem firing from the prometheus instances at our POPs, where we have no Nokia switches and it thus makes sense there will be no matching series. Unsure how to best tackle that.

I'll probably need some help from observability team on these.

Event Timeline

cmooney triaged this task as Medium priority.

Change #1198926 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/alerts@master] OSPF alert: fix error in grouping of labels

https://gerrit.wikimedia.org/r/1198926

Change #1198926 merged by jenkins-bot:

[operations/alerts@master] OSPF alert: fix error in grouping of labels

https://gerrit.wikimedia.org/r/1198926

Ok well I fixed the obvious error but the alerts still aren't firing :(

I saw the alerts on the ALERTS metric: https://w.wiki/FqSi .
I think there was a silence rule in place, so you didn't get any notifications.

About the AlertLintProblem, you could add the # pint disable promql/series annotation on the yaml file to avoid errors from datacenters without Nokia devices.

I saw the alerts on the ALERTS metric: https://w.wiki/FqSi .

Ok thanks for that! That is a good way to check, and yes the times line up with when I simulated an issue.

I think there was a silence rule in place, so you didn't get any notifications.

Could be, I did try to remove the downtime for the host before triggering, but I could have easily made a mistake there. I'm much more confident now things are working as they should thanks.

About the AlertLintProblem, you could add the # pint disable promql/series annotation on the yaml file to avoid errors from datacenters without Nokia devices.

Ok cool I'll give that a try and submit a patch.

Change #1199332 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/alerts@master] team-netops: ospf alert: add pint disable promql/series

https://gerrit.wikimedia.org/r/1199332

Change #1199332 merged by jenkins-bot:

[operations/alerts@master] team-netops: ospf alert: add pint disable promql/series

https://gerrit.wikimedia.org/r/1199332

FWIW this will need further investigation, I've reset a bunch of these switches which will cause the scenario the alerts should fire, but I've not seen the alertmanager alerts fire.

Small update, right now lsw1-d6-eqiad is broken. So this alert should be present for ssw1-d1-eqiad and ssw1-d8-eqiad.

In today's case, the alert criteria wasn't met because the metrics went missing.

The only interface that could have met the neighbor count criteria was system0.0 and this interface oper state remained 5 until the metric was no longer reported.

We'll likely need a different indicator of trouble to fire on an issue like today's.

cmooney claimed this task.

In today's case, the alert criteria wasn't met because the metrics went missing.

The only interface that could have met the neighbor count criteria was system0.0 and this interface oper state remained 5 until the metric was no longer reported.

We'll likely need a different indicator of trouble to fire on an issue like today's.

Cole thank you. I think that makes sense, however we should have gotten an alert from the "other side". Basically the OSPF adjacency is up on the switches either side of a link. In this case lsw1-d6-eqiad died completely, so that metric went away (and indeed if it was removed from the config on that device or something it would also disappear).

However we still have the metric for the other side, on ssw1-d1-eqiad and ssw1-d6-eqiad. However it seems that has fired for us today:

FIRING: OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown

I think what I was missing here is that the particular circumstance when the switch dies means the port goes down. And we get a port down alert (they did fire last week when it died). I was forgetting I also set up the OSPF alert to not fire in that scenario (only if the interface status is seen as ok too). John cabled up a replacement device today which has brought the port physically up, but OSPF is still down as I've yet to configure the replacement device. And in that scenario it fired, which is what we want.

So all seems ok! I think we need to make this paging but I will wait a week or two until we have confirmed everything is stable with these new devices before changing it. Thanks :)