Page MenuHomePhabricator

webperf1001 alert "Service: too long since latest timing beacon" when switched over
Closed, DeclinedPublic

Description

I presume this is because webperf2001 is currently the primary?

Event Timeline

Confirmed:

dpifke@webperf1001:~$ curl -s http://localhost:9230/metrics | grep latest
# HELP webperf_latest_handled_time_seconds UNIX timestamp of most recent message
# TYPE webperf_latest_handled_time_seconds gauge
webperf_latest_handled_time_seconds{schema="SaveTiming"} 1598968983.279599
webperf_latest_handled_time_seconds{schema="QuickSurveysResponses"} 1598968992.488641
webperf_latest_handled_time_seconds{schema="FirstInputTiming"} 1598968994.442974
webperf_latest_handled_time_seconds{schema="QuickSurveyInitiation"} 1598968995.142616
webperf_latest_handled_time_seconds{schema="NavigationTiming"} 1598968995.179199
webperf_latest_handled_time_seconds{schema="PaintTiming"} 1598968995.19738
dpifke@webperf1001:~$ date -d '@1598968995.179199'
Tue Sep  1 14:03:15 UTC 2020

This is arguably working as intended, and the correct action here is to simply silence the alert(s) as part of the switchover. I have to think about if there's a clean way to do so automatically.

It's not clear to me why this didn't trigger in codfw when the alert was first created.