Page MenuHomePhabricator

Post-mortem "MWException: No localisation cache found for English."
Closed, ResolvedPublic

Description

Between 2019-02-28T14:40 and 2019-02-28T14:45, this exception shows up in production for some load.php requests all HTTP requests on mw1272.

I've never seen this error before and it seems rather concerning this can happen.

Seems like the cache somehow disappeared on a pooled production app server. Or perhaps a new branch enabled without the localisation cache being created first (which is a fatal error, given we've configured MW to disallow computing/populating LCStore on-demand.)


Update: The time frame correlates with Server Admin Log entries:

  • 14:41 hashar@deploy1001: Synchronized php: group1 wikis to 1.33.0-wmf.19 (duration: 00m 53s).
  • 14:46 hashar@deploy1001: scap sync-l10n completed (1.33.0-wmf.19) (duration: 03m 33s).
  • 14:46 hashar: mw1272 had /srv/mediawiki/php-1.33.0-wmf.19/includes/cache/localisation/LocalisationCache.php:475) No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php.
  • 14:47 hashar: mw1272 fixed by running "scap sync-l10n" from deploy host.

What led to this problem, and how can we prevent this in the future?

Event Timeline

Krinkle created this task.Mar 6 2019, 12:06 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 6 2019, 12:06 AM
Krinkle updated the task description. (Show Details)
Krinkle moved this task from On-going to Follow-up prevention on the Wikimedia-Incident board.
Krinkle moved this task from Untriaged to Resolved on the Wikimedia-production-error board.
Krinkle added a subscriber: hashar.

When deploying MW-1.33-notes (1.33.0-wmf.19; 2019-02-26) I have spotted the error via logstash (iirc). From the wikimedia-operations logs:

[14:45:46] <hashar>	 so somehow wmf.19 lost the l10n cache for english language
[14:46:02] <hashar>	 well at least for some code paths / wikis / server
[14:46:27] <hashar>	 that is just on mw1272
[14:46:28] <logmsgbot>	 !log hashar@deploy1001 scap sync-l10n completed (1.33.0-wmf.19) (duration: 03m 33s[14:46:44] <hashar>	 !log mw1272 had /srv/mediawiki/php-1.33.0-wmf.19/includes/cache/localisation/LocalisationCache.php:475) No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php.
[14:47:25] <hashar>	 !log mw1272 fixed by running "scap sync-l10n" from deploy host
[14:50:40] <_joe_>	 hashar: I think mw1272 was down/in manitenance for some time
[14:53:01] <hashar>	 _joe_: yeah I am assuming that as well. anyway syncing l10n did bring it up to date ;)

The reason is that mw1272.eqiad.wmnet has some hardware issue which is T211668. History from IRC:

20190222
[06:11:28] <icinga-wm>	 PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100%
[06:51:38] <marostegui>	 !log Power cycle mw1272 as it crashed - T211668
[07:15:05] <_joe_>	 !log deactivating mw1272, memory problems
[07:23:38] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1272 is CRITICAL: Host mw1272 is not in mediawiki-installation dsh group
[09:32:49] <_joe_>	 !log set pooled=inactive on mw1272, T211668

The mediawiki-installation dsh group alarm get acknowledged later on:

20190225
[08:04:44] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1272 is CRITICAL: Host mw1272 is not in mediawiki-installation dsh group Muehlenhoff T211668

Then eventually:

20190228
[10:53:31] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1272 is OK: OK

Or in short, the server has hardware issue. It got depooled on Feb 22th at 09:32 UTC.

Eventually it got repooled ((somehow??) on Feb 28th around 10:53.

hashar added a comment.Mar 6 2019, 8:08 AM

From scap logs. The server had scap activity on 2019-02-28 12:18 which is the SWAT deploy:

12:18 	<zfilipin@deploy1001> 	Synchronized wmf-config/: SWAT: [[gerrit:491959|Show referencePreviews on group0 wikis as beta feature (T214905)]] (duration: 00m 56s)

I had the cdb cache updated manually:

2019-02-28T14:46 Updated 416 CDB files(s) in /srv/mediawiki/php-1.33.0-wmf.19/cache/l10n

And from logstash by filtering for _type:scap host:mw1272 channel:merge_cdb_updates, the past cache generations are:

2019-02-19T20:29:18merge_cdb_updatesUpdated 415 CDB files(s) in /srv/mediawiki/php-1.33.0-wmf.17/cache/l10n
2019-02-19T20:29:52merge_cdb_updatesUpdated 415 CDB files(s) in /srv/mediawiki/php-1.33.0-wmf.18/cache/l10n
2019-02-28T14:46:02merge_cdb_updatesUpdated 416 CDB files(s) in /srv/mediawiki/php-1.33.0-wmf.19/cache/l10n
2019-03-05T15:20:29merge_cdb_updatesUpdated 416 CDB files(s) in /srv/mediawiki/php-1.33.0-wmf.19/cache/l10n
2019-03-05T15:21:02merge_cdb_updatesUpdated 416 CDB files(s) in /srv/mediawiki/php-1.33.0-wmf.20/cache/l10n

MW-1.33-notes (1.33.0-wmf.19; 2019-02-26) has been first published on Feb 26th, when mw1272 was not pooled. Thus it the cache never got generated. When the server got later pulled back, I think we missed running a scap pull and the cache generation.

So that is a glitch in how we repool a MediaWiki server? I am not sure what the process is beside flagging the host as active in conftool, I guess some extra steps would be required.

hashar awarded a token.Mar 6 2019, 8:09 AM
hashar moved this task from To Triage to Localization on the Deployments board.
hashar moved this task from INBOX to Kanban on the Release-Engineering-Team board.
hashar moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

I don't know when or by who the server was repooled after hardware maintenance, I only acked the Icinga alert for the dsh group at some point

greg triaged this task as Medium priority.Mar 6 2019, 9:54 PM
greg added a subscriber: greg.Mar 13 2019, 9:27 PM

So, I guess the question is: What happens when a mw app server is repooled? Does the repooling automatically do a scap pull to get it in a consistent state with the rest of the fleet? I guess this is a question for @Joe .

hashar added a comment.EditedMar 25 2019, 4:03 PM

Took me a while to come back to this task.

So as I wrote in my previous comment, the issue is the server was not in the pool when we generated the l10n cache for the MediaWiki version to come. When the server got pooled again, it missed the l10n cache hence the error.

I don't think scap pull triggers a l10n cache build. So most probably we will want some new utility/command to orchestrate pooling a server (scap pull + l10ncache update + gitinfo update etc).

To clarify: I just witnessed the issue, I don't have plan to work on a solution.

Change 508476 had a related patch set uploaded (by Catrope; owner: Catrope):
[operations/mediawiki-config@master] Revert "Disable MessageBlobStore::clear() via hook"

https://gerrit.wikimedia.org/r/508476

Change 508488 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/tools/scap@master] Clear MessageBlobStore after syncing i18n data

https://gerrit.wikimedia.org/r/508488

Catrope added a subscriber: Catrope.May 7 2019, 1:29 AM

Wrong bug, sorry. Neither of these patches addresses this bug, although 508488 is semi-related.

Change 508488 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/tools/scap@master] Clear MessageBlobStore after syncing i18n data

https://gerrit.wikimedia.org/r/508488

Change 508488 merged by jenkins-bot:
[mediawiki/tools/scap@master] Clear MessageBlobStore after syncing i18n data

https://gerrit.wikimedia.org/r/508488

Krinkle closed this task as Resolved.May 31 2019, 7:45 PM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM