We have detected a number of cases in which prometheus metrics seems to disappear for no apparent reason, thus causing bogus alerts.
Some examples:
1) the `CephClusterInUnknown` alert uses this:
`expr: (ceph_health_status{job="ceph_eqiad"} or on() vector(-1)) == -1`
The alert is triggering today, however in grafana we can see this:
Base metric:
{F57502361}
Alert: (I interpret 'no data' as: there should be no alert)
{F57502365}
However, the alert is firing in alertmanager:
{F57502431}
2) the `OpenstackAPIResponse` alert uses this:
`expr: avg_over_time((haproxy_server_total_time_average_seconds{job="cloudlb-haproxy", proxy!~"(mysql|wikireplica-db-(web|analytics)-s\\d)"} OR on() vector(100))[12h:]) > 99`
The alert is triggering today, however in grafana we see this:
Base metric:
{F57502381}
Alert expression: (again, I interpret 'no data' as there should be no alert triggering)
{F57502387}
However, the alert is showing in alertmanager:
{F57502441}
3) The Neutron alerts reported in {T374513} and {T373878}
Apparently, same behavior as with the other two.