This morning, at 09:13 UTC, I performed this sync:
10:13 <+logmsgbot> !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: eab87780: frwiki: Fix tagline height and width (T272907) (duration: 00m 58s)
Several minutes after, icinga alerted:
09:16 <+icinga-wm> PROBLEM - Apache HTTP on mw2319 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers 09:16 <+icinga-wm> PROBLEM - PHP7 rendering on mw2319 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering 09:16 <+icinga-wm> PROBLEM - PHP7 rendering on mw2318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2501 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering 09:17 <+icinga-wm> PROBLEM - Apache HTTP on mw2331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Application_servers 09:17 <+icinga-wm> PROBLEM - PHP7 rendering on mw2331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering 09:17 <+icinga-wm> PROBLEM - Apache HTTP on mw2318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers 09:18 <+icinga-wm> PROBLEM - Apache HTTP on mw2320 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers 09:19 <elukey> I see something like 09:19 <elukey> Jan 26 09:19:20 mw2320 php7.2-fpm: PHP Fatal error: require(): Failed opening required '/srv/mediawiki/wmf-config/logos.php'
However, despite the timing, the icinga alerts were not at all relevant to my sync (any sync of IS.php would make this issue appear). Thanks to the error @elukey pasted to IRC, it was obvious for me what happened, since I was familiar with a previous change @Legoktm did.
Last night, @Legoktm performed the following syncs, deploying
59e2387 et seq.:
00:34 <legoktm@deploy1001> Synchronized wmf-config/CommonSettings.php: Invalidate configuration cache when logos.php is touched too (duration: 00m 56s) [production]
00:32 <legoktm@deploy1001> Synchronized wmf-config/logos.php: Add script to mostly automate logo management (duration: 00m 55s) [production]
00:16 <legoktm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php (1/2) (duration: 01m 00s) [production]
00:14 <legoktm@deploy1001> Synchronized wmf-config/logos.php: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php (1/2) (duration: 00m 56s) [production]
At around the same time, @Dzahn reimagined the affected MediaWiki servers (see T245757#6775483 et seq.). For some reason currently unknown to me, scap pull was not ran before the servers were put back into service (happened 0:46 UTC, ie. 10 minutes after @Legoktm made the change), or did not pick lego's change.
Once I synced the IS.php change, the affected MW servers got lego's change as well, but without the required logos.php file (which was newly introduced by lego), thus, icinga alerted.
The issue I see is that the alert fired much later after the root cause was introduced to the infrastructure (ie. servers with drifted /srv/mediawiki pooled), and colerated with an unrelated change. If this happened with servers in the active DC, it would cause partial outage.
I think we should make icinga alert when /srv/mediawiki differs at any of the app servers, which would point to the issue much more sooner. However, I'm not sure how to easily calculate the hash of that directory :/.