Page MenuHomePhabricator

Messages randomly missing in some languages when populating MessageBlobStore
Closed, ResolvedPublic40 Story Points

Description

This bug causes citoid to "disappear" from a wiki and for the default VisualEditor citation tools to appear instead.

For some reason, the message goes randomly missing for some languages; if this happens to be the site language, then this causes citoid to be missing for most users.

The site language has gone missing in the past for both en wiki and sv wiki.

This can be fixed for any given wiki by clearing the cached version of the message. Most logged-in users have the correct permissions to clear the cache. You can do this using the api by doing:

https://en.wikipedia.org/w/api.php?action=purge&titles=MediaWiki:Citoid-template-type-map.json

(Substitute your site's language prefix for 'en' here.) Please continue to report outages here as well even if you fix them, as we don't know what causes the issue and reporting outages can help with diagnostics.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Mvolz added a comment.Jul 25 2015, 3:23 PM

Another outage this morning was reported here by @Josve05a: T106938

Again it was the language the wiki was in, sv on sv.wikipedia.org.

If we can't figure out what's causing this is there some other hack we can use? Or as @Ainali notes on that ticket, some way of monitoring this so we are alerted to it?

Elitre added a comment.Aug 3 2015, 6:22 AM

Can someone please put the purge suggestion in the description if that works?

Mvolz updated the task description. (Show Details)Aug 3 2015, 7:21 AM
Mvolz added a comment.Aug 3 2015, 7:24 AM

Can someone please put the purge suggestion in the description if that works?

Done. Feel free to edit further for clarity. Also might be a good idea to link this on-wiki and mediawiki docs where ever relevant; so far we've caught citoid being broken on en wiki, where we have a lot of testers/bug reporter, and sv wiki, where we have one very active tester/bug reporter, which makes me think this might be happening under the radar elsewhere.

Yesterday evening (CET) we had a citoid outage at Polish Wikipedia. No idea what went wrong, citoid window was simply not there in VE when you clicked on "Przypis" (Polish for "Reference"). I reported it at https://www.mediawiki.org/wiki/Topic:Sgikbv81nxsv7oy6, but it seems the problems disappeared on their own this morning.

Elitre added a comment.EditedAug 3 2015, 10:51 AM

It is my understanding that we need specific details to be able to understand what happens; Marielle, James etc., please list them here so that we know what to ask users, thank you.

Mvolz added a comment.Aug 3 2015, 2:37 PM

It is my understanding that we need specific details to be able to understand what happens; Marielle, James etc., please list them here so that we know what to ask users, thank you.

@Krinkle would have the best idea of what information we need- but for now I'd say just which wiki, when it happened, and what language (i.e. site language or if they were using a different language in preferences).

Josve05a added a comment.EditedAug 6 2015, 10:32 AM

Happening on enwp right now for me. I haven't changed my lang.

Yup, just started happening on enwp for me, even in a private window.

Mvolz added a comment.Aug 6 2015, 10:43 AM

I purged it, but it's still happening for me in a private window unless I add debug=true as a parameter.

Now it works on enwp again.

CFCF added a subscriber: CFCF.Aug 6 2015, 3:13 PM
This comment was removed by Josve05a.

Currently Citoid is down for me at en-Wiki (FF 39.0, Windows XP, vector skin, desktop).
The main menu shows "Cite" with the usual small arrow to open the menu.
Inside the menu the list shows as a plain list with only 1 column: Website, Book, News, Journal, Basic, Re-use with their respective icons to the left. No "Citoid" function at all.

If you need further info, please let me know.

Quick update: purged per the advice on top of this thread (thanks) and Citoid and regular UI are available again.

Happening on live again

Now on sv.wp again

And again on enwiki just now.

If clicking on the link provided in the description is enough, that's what I did, but it took some mins for the change to be effective (maybe something we should add in the description).

Now reported on de.wp. I'm applying the workaround.

Mvolz added a comment.Aug 26 2015, 8:02 PM

@Krinkle have you had a chance to examine the bug in production yet? Do you need us to stop clearing the cache so you have a live example?

@Krinkle have you had a chance to examine the bug in production yet? Do you need us to stop clearing the cache so you have a live example?

Indeed. I understand why we always clear it (it's production after all) but since I've exhausted all obvious places to look based on static analysis of the code, I can't narrow it down further until we have live specimen to hunt.

However I'm not always around to give it a try, so unless I'm online and available (ping me on IRC), it's okay to try purging it in order to restore Citoid functionality.

Having to ping you on IRC was a step I was not aware of. We can all consider doing that the next time this happens.

I actually just found one, and Krinkle is looking into it now.

That was nl.wp, and Krinkle just fixed it.

Krinkle added a comment.EditedAug 26 2015, 9:35 PM

I confirmed that this bug is not caused by a race condition in Varnish during a deployment (e.g. the first request coming in when deployment sync is still busy). Even when bypassing Varnish with a cache-busting query parameter, the fresh response from MediaWiki directly has the broken message.

I did narrow it down to the MessageBlobStore database tables. When retrieving it from there directly, the cached blob consistently contains the broken <key> fallback value both for "nl" (the content language of nl.wikipedia.org) as well as for "en".

$ mwscript eval.php --wiki nlwiki
> $configFactory = ConfigFactory::getDefaultInstance();
> $rl = new ResourceLoader( $configFactory->makeConfig( 'main' ) );

> $m = MessageBlobStore::getInstance();
> require_once "$IP/tests/phpunit/includes/TestingAccessWrapper.php";
> $m = TestingAccessWrapper::newFromObject( $m );

> return $m->getFromDB( $rl, array( 'ext.citoid.visualEditor' ), 'nl' );
array(1) {
  ["ext.citoid.visualEditor"]=>
  string(1174) "{"citoid-520-error":"Het was niet mogelijk de gegevens van de opgegeven identifier op te halen.",
  .."citoid-template-type-map.json":"\u003Ccitoid-template-type-map.json\u003E",..}"
}
> return $m->getFromDB( $rl, array( 'ext.citoid.visualEditor' ), 'en' );
array(1) {
  ["ext.citoid.visualEditor"]=>
  string(1072) "{"citoid-520-error":"Unable to retrieve information from the provided identifier.",
  .."citoid-template-type-map.json":"\u003Ccitoid-template-type-map.json\u003E", ..}

Purging the MediaWiki namespace page on-wiki seems to fix it. Following that code path: ActionPurge -> doPurge::doPurge:

 MessageCache::singleton()->replace( $this->mTitle->getDBkey(), $text );
...
 new MessageBlobStore()
    ->updateMessage( $wgContLang->lcfirst( $msg ) );

Which replaces the database entry directly from the in-memory wikitext value during the purge. This then bumps the timestamp of the affected MessageBlobStore rows, which then propagates through the ResourceLoader startup module to actual users (within 5 minutes).

The problem must reside in the population of MessageBlobStore. The source for MessageBlobStore is wfMessage($key)->inLanguage($lang)->plain().

I suspect the following race condition exists

When the Citoid VE module is changed and this change is deployed (through the train or SWAT). For some reason, during this deployment there is a brief moment during which the localisation cache or message cache of one or more app servers is not working properly and misses this key. This then stays permanently in the MessageBlobStore until it has a reason to vacate the value (e.g. a purge or another deployment).

I also considered a race condition related to localisation cache being rebuild (during nightly LocalisationUpdate, or during the branch cut during first phase of the weekly train). However these values are not stored in LocalisationCache (it's a local wiki override, not from translatewiki). These are stored in MessageCache.

Elitre added a comment.Sep 1 2015, 2:13 PM

@Krinkle, does your explanation above also cover why only some languages are affected, while others aren't?

Mvolz added a comment.Sep 1 2015, 3:14 PM

@Krinkle, if it's a race condition on deploy, wouldn't it show up immediately after a deploy? And deploy on en wiki occurs on Thursdays right? We had a report of citoid missing on en wiki (T104944) on a Tuesday and it's highly unlikely it went 5 days without being reported... although possible I guess... I also remember one that was a Sat morning or something.

@Elitre, I think so, since each message is a separate row in the DB, it could miss some languages and not others.

The new branch from the MediaWiki deployment train reaches enwiki on Thursday indeed. On Tuesday, when the new branch is created, we use Scap to regenerate the localisation caches. I believe this also re-creates the localisation cache for the "older" branch. And there is also the nightly localisation update.

Change 236028 had a related patch set uploaded (by Esanders):
Provide alternative access to citoid-template-type-map.json

https://gerrit.wikimedia.org/r/236028

Change 236028 merged by jenkins-bot:
Provide alternative access to citoid-template-type-map.json

https://gerrit.wikimedia.org/r/236028

Change 236218 had a related patch set uploaded (by Jforrester):
Provide alternative access to citoid-template-type-map.json

https://gerrit.wikimedia.org/r/236218

Change 236221 had a related patch set uploaded (by Alex Monk):
Actually use new backup message

https://gerrit.wikimedia.org/r/236221

Change 236221 merged by jenkins-bot:
Actually use new backup message

https://gerrit.wikimedia.org/r/236221

Change 236218 merged by jenkins-bot:
Provide alternative access to citoid-template-type-map.json

https://gerrit.wikimedia.org/r/236218

Change 236222 had a related patch set uploaded (by Jforrester):
Actually use new backup message

https://gerrit.wikimedia.org/r/236222

Change 236222 merged by jenkins-bot:
Actually use new backup message

https://gerrit.wikimedia.org/r/236222

Jdforrester-WMF lowered the priority of this task from Unbreak Now! to High.Sep 4 2015, 10:54 PM

We just pushed out a patch which will hopefully cover over this issue whilst we find out the cause, so I'm reducing the priority of this down to merely "High".

Krinkle removed Krinkle as the assignee of this task.Sep 15 2015, 6:18 PM
Krinkle lowered the priority of this task from High to Low.Sep 18 2015, 8:27 PM
Krinkle removed a project: Patch-For-Review.

Change 252159 had a related patch set uploaded (by Krinkle):
resourceloader: Log if MessageBlobStore failed to fetch a message

https://gerrit.wikimedia.org/r/252159

hashar removed a subscriber: hashar.Nov 10 2015, 10:03 AM

Change 252159 merged by jenkins-bot:
resourceloader: Log if MessageBlobStore failed to fetch a message

https://gerrit.wikimedia.org/r/252159

Change 252393 had a related patch set uploaded (by Krinkle):
resourceloader: Log if MessageBlobStore failed to fetch a message

https://gerrit.wikimedia.org/r/252393

Change 252393 merged by jenkins-bot:
resourceloader: Log if MessageBlobStore failed to fetch a message

https://gerrit.wikimedia.org/r/252393

Mvolz added a comment.Sep 30 2016, 2:58 PM

Is this resolved? I notice the fall back message was removed from citoid, so I assume so?

Jdforrester-WMF closed this task as Resolved.Oct 3 2016, 9:29 PM
Jdforrester-WMF claimed this task.

Yes, I'm content declaring this fixed.