Page MenuHomePhabricator

Alerts for drifts in /srv/mediawiki
Closed, DuplicatePublic

Description

This morning, at 09:13 UTC, I performed this sync:

10:13 <+logmsgbot> !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: eab87780: frwiki: Fix tagline height and width (T272907) (duration: 00m 58s)

Several minutes after, icinga alerted:

09:16 <+icinga-wm> PROBLEM - Apache HTTP on mw2319 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers
09:16 <+icinga-wm> PROBLEM - PHP7 rendering on mw2319 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
09:16 <+icinga-wm> PROBLEM - PHP7 rendering on mw2318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2501 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
09:17 <+icinga-wm> PROBLEM - Apache HTTP on mw2331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Application_servers
09:17 <+icinga-wm> PROBLEM - PHP7 rendering on mw2331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
09:17 <+icinga-wm> PROBLEM - Apache HTTP on mw2318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers
09:18 <+icinga-wm> PROBLEM - Apache HTTP on mw2320 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers
09:19 <elukey> I see something like
09:19 <elukey> Jan 26 09:19:20 mw2320 php7.2-fpm: PHP Fatal error:  require(): Failed opening required '/srv/mediawiki/wmf-config/logos.php'

However, despite the timing, the icinga alerts were not at all relevant to my sync (any sync of IS.php would make this issue appear). Thanks to the error @elukey pasted to IRC, it was obvious for me what happened, since I was familiar with a previous change @Legoktm did.

Last night, @Legoktm performed the following syncs, deploying
59e2387 et seq.:

00:34 <legoktm@deploy1001> Synchronized wmf-config/CommonSettings.php: Invalidate configuration cache when logos.php is touched too (duration: 00m 56s) [production]
00:32 <legoktm@deploy1001> Synchronized wmf-config/logos.php: Add script to mostly automate logo management (duration: 00m 55s) [production]
00:16 <legoktm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php (1/2) (duration: 01m 00s) [production]
00:14 <legoktm@deploy1001> Synchronized wmf-config/logos.php: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php (1/2) (duration: 00m 56s) [production]

At around the same time, @Dzahn reimagined the affected MediaWiki servers (see T245757#6775483 et seq.). For some reason currently unknown to me, scap pull was not ran before the servers were put back into service (happened 0:46 UTC, ie. 10 minutes after @Legoktm made the change), or did not pick lego's change.

Once I synced the IS.php change, the affected MW servers got lego's change as well, but without the required logos.php file (which was newly introduced by lego), thus, icinga alerted.

The issue I see is that the alert fired much later after the root cause was introduced to the infrastructure (ie. servers with drifted /srv/mediawiki pooled), and colerated with an unrelated change. If this happened with servers in the active DC, it would cause partial outage.

I think we should make icinga alert when /srv/mediawiki differs at any of the app servers, which would point to the issue much more sooner. However, I'm not sure how to easily calculate the hash of that directory :/.

Event Timeline

It already does the things suggested by Urbanecm, just that the Icinga check isn't running every 5 minutes but just every couple hours, i think.

19:25 <+icinga-wm> RECOVERY - Ensure local MW versions match expected deployment on mw1337 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
19:26 <+icinga-wm> RECOVERY - Ensure local MW versions match expected deployment on mw2321 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
19:27 <+icinga-wm> RECOVERY - Ensure local MW versions match expected deployment on deploy2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers

fixed!

it was just being unlucky about the timing. This would only happen if they were down right during deployment. Sorry, I should checked better by asking jouncebot and running "scap pull" every time just in case. It wasn't the active DC though, so it's mostly just about monitoring noise. Will do that so it shouldn't happen anymore.

re: monitoring: Already solved, just doesn't run as often as maybe expected.

Boldly reopening this.

It already does the things suggested by Urbanecm, just that the Icinga check isn't running every 5 minutes but just every couple hours, i think.

That should give this alert more than enough time to fire. See a short timelone:

  • 00:16 UTC, @Legoktm syncs InitialiseSettings.php
  • 00:46 UTC, @Dzahn pools a server with an outdated /srv/mediawiki copy, which introduces the bug (alert should fire now)
  • 09:16 UTC, @Urbanecm syncs an IS.php change that activated the change (affected codfw MW servers got a broken version, Apache-level alerts fire)
  • 09:24 UTC, @Urbanecm fixes that by syncing logos.php, which brings it to the affected MW servers

So, the first alert that pointed out to the issue alerted about 8 hours after @Dzahn pooled the servers. If it runs every couple of hours, then it should run three to four times already. That did not happen.

19:25 <+icinga-wm> RECOVERY - Ensure local MW versions match expected deployment on mw1337 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
19:26 <+icinga-wm> RECOVERY - Ensure local MW versions match expected deployment on mw2321 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
19:27 <+icinga-wm> RECOVERY - Ensure local MW versions match expected deployment on deploy2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers

Those appear to be wikiversion.json alerts? Wikiversion.json did not really play a role in this case :-). InitialiseSettings.php did. IMO, most of the files in /srv/mediawiki should be protected by an alert, as most of them can cause such an issue. In this case, it was only codfw, but if it happens in eqiad...we're partially down immediately.

[...]
re: monitoring: Already solved, just doesn't run as often as maybe expected.

See above, it's not really resolved.

Dzahn removed Dzahn as the assignee of this task.Jan 26 2021, 8:22 PM

Probably just that downtimes expired 8 hours later.

I think there is not much to fix here besides "remember to scap pull" and the other part about MW versions would be T218412

Probably just that downtimes expired 8 hours later.

I think there is not much to fix here besides "remember to scap pull" and the other part about MW versions would be T272967

Does "MW versions" include status of operations/mediawiki-config copy (ie. /srv/mediawiki rather than /srv/mediawiki/php-XXX)? If so, we can close this as redundant to the task you linked in the previous comment. If it doesn't (and I normally wouldn't call it mediawiki version, as it's not MW-version related), then I think this should stay open, or that the other task's scope should be broaded.

I don't know. I think the answer to that question would best be found on T218412.

Running scap pull after reimaging will definitely take care of the immediate problem.

I read through T218412: Define a mediawiki "version" and it looks like that's mostly what's being asked for in this ticket, figuring out how to define the state of all code that's deployed into a single version so it could be compared across hosts (and then adding monitoring accordingly). So I think this is probably a dupe of that, but I'll leave it to @Urbanecm for that.

running 'scap pull' on all hosts that are being re-pooled