Page MenuHomePhabricator

Aggregate check_mw_versions alerts for each individual app server
Open, MediumPublic

Description

Recently, an outstanding bug in Scap T223287 caused a deployment to pause longer than the configured deploy-time setting for check_mw_versions. This flooded IRC with alerts until the deployment was continued.

This alert seems a good candidate for aggregation.

Event Timeline

While it's clear that 400 alerts flooding production are not great, this check is important for each single machine. So we can aggregate the output, but we can't suppress it. We need to know *very clearly* if one single machine is running an outdated version of mediawiki.
So I second the aggregation if it's possible to show clearly in the icinga alert which machine (or machines, if the number if below, say, 90% of all mw servers) is failing the check.

It would all be simpler if we just had a single git repository representing the production mediawiki code.

fgiunchedi renamed this task from check_mw_versions alerts for each individual app server to Aggregate check_mw_versions alerts for each individual app server.Oct 22 2021, 1:30 PM

Ref T310225: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions).

Ref https://wikitech.wikimedia.org/w/index.php?title=Monitoring%2Fcheck_dsh_groups&diffonly=0&diff=1987914&oldid=1834094#Inactive_servers.

From today's adventure, I learned that there's an impression that this alerts fires noisly at times and as a tendency of being ignored or ack'ed for a depooled/inactive server. That seems at odds with the rationale and background behind the alert as also alluded to by @Joe above. I'll mention here a +1 for not ignoring useful alerts, but also that we need to fix the noise if that still happens, and indeed that for the runbook to be actionable, the hostname very much needs to be front and center in the alert.