Fix NavtimingStaleBeacon false alarms, add test


Fix NavtimingStaleBeacon false alarms, add test

The previous version of this alert was calculated using
min(webperf_latest_handled_time_seconds), since we apply a label for
each schema and we care about the oldest, i.e. we want the alert to
fire if *any* schema is not receiving events.

But we also apply a label for each site, and Navtiming only runs in a
single datacenter. We thus got a false alarm after the switchover
from eqiad to codfw a couple of days ago, because the {site="eqiad"}
values were too old.

When we used Icinga alerts, we tried to work around this with a hiera
flag in Puppet that would suppress the alert for the non-active
datacenter (Iaef79cf33d), but a) we no longer use Puppet to generate
the configuration, and b) we would have forgetten to flip the flag

The better fix for this is to group the alert by the schema label, and
use max() instead of min(). The alert should now only fire if any
schema is out of date in *both* sites.

I was able to create a unit test for this by taking advantage of
undocumented promtool behavior: time() doesn't actually use the
current time when running the tests, it instead returns the number of
iterations. (See the linked Github issue in webperf_test.yaml, which
implies this isn't a bug, and thus should be safe to depend on.)

I also took this opportunity to fix some templates which were using
"$site" instead of "{{ $labels.site }}", the former a static string
and the latter the correct template syntax.

Change-Id: I04ceb7c7939269060c5da0b210fddb2e7904010f


dpifkeAuthored on Jun 30 2021, 10:18 PM
fgiunchediCommitted on Aug 2 2021, 3:20 PM
rOALE7105a61bfb6f: Import alerts for Thanos rule
Reverted By
rOALE28b276069df9: Revert "Fix NavtimingStaleBeacon false alarms, add test"