Page MenuHomePhabricator

MessageBlobStore::clear() causes scaling problems on multi-server setups with CDB l10ncache
Closed, ResolvedPublic

Description

We had to disable MessageBlobStore::clear() on WMF and replace it with a maintenance script to run upon sync, because on multi-server setups where l10ncache is in CDB, LocalisationCache::recache() is run once per server per language, causing the MBS to be cleared lots of times. This led to DB deadlocks and possibly to other performance issues.

I guess the least we can do is offer a $wg variable to disable clear(). A better solution, suggested by Tim, would be to add CacheDependency::getModifiedTime(), add a way to retrieve the maximum mtime from LocalisationCache, and use that in the startup module to conditionally call MessageBlobStore::clear() before retrieving any module timestamps. This would scale because the startup module is cached for 5 minutes.


Version: 1.18.x
Severity: critical

Details

Reference
bz27320

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 21 2014, 11:24 PM
bzimport set Reference to bz27320.
Catrope created this task.Feb 11 2011, 8:10 AM

The maintenance script run upon sync (clearMessageBlobs.php) is regularly causing s1 master overload, for a few minutes after each execution. Increasing priority.

Related thread on ops mailing list from 05 Apr 2013:
"outage this evening - possible localization updates issue"

Roan: Do you plan to look into this, as it's assigned to you (probably by default at that time)? If not, wondering who could.

Tim/Roan, do we need to disable localisation updates til this is resolved?

Result of conversation on #wikimedia-tech: Brad is going to take a crack at fixing this tomorrow. In the meantime, we would like to disable updates so that we're not taking down the site daily.

Related URL: https://gerrit.wikimedia.org/r/58604 (Gerrit Change I7ed047a3802c7186eb0c040556022e58b266a2be)

Related URL: https://gerrit.wikimedia.org/r/58660 (Gerrit Change I50d366a03af649bc87158dde4516aae1a2c24924)

Related URL: https://gerrit.wikimedia.org/r/58909 (Gerrit Change I3b6ae12875f2f323210fdfba36c5c5d9183588e2)

Related URL: https://gerrit.wikimedia.org/r/58909 (Gerrit Change I3b6ae12875f2f323210fdfba36c5c5d9183588e2)

Related URL: https://gerrit.wikimedia.org/r/58910 (Gerrit Change I72a1557f9c18b845c952dd2e2697d92e8eb71d93)

Related URL: https://gerrit.wikimedia.org/r/58911 (Gerrit Change Ic633a7fde8d4a1d9e9326aa5ae52bf1227e8d30f)

Brad has a proposed fix for this with Gerrit #58911. We plan to leave localization update disabled over the weekend, giving Tim time to review this on his Monday. If Tim thinks this is worth a shot and is up for deploying it, he can do that. Otherwise, we'll figure out what to do on Monday in the U.S.

There are three patches in Gerrit related to this.

Gerrit change 58909 adds a new script to the WikimediaMaintenance extension. This new script updates the RL message cache directly, instead of wiping it out and relying on client requests to repopulate it.

Gerrit change 58910 adjusts l10nupdate to preserve the timestamps on the l10n cdb files from LocalisationUpdate when copying them into position. This should improve the efficiency of the new script, since it will allow it to skip updating messages in languages that haven't actually changed.

Gerrit change 58911 changes l10nupdate to actually call the new script. Note that 58909 must be deployed to all wikis (so likely 1.22wmf1 and 1.22wmf2) before this is deployed.

https://gerrit.wikimedia.org/r/58909 (Gerrit Change I3b6ae12875f2f323210fdfba36c5c5d9183588e2) | change APPROVED and MERGED [by Aaron Schulz]

All three patches have been merged.

Now we're waiting to see how the new code does.

(In reply to comment #16)

Now we're waiting to see how the new code does.

anomie: So can we say already if it works (/close this ticket)? :)

(In reply to comment #17)

(In reply to comment #16)

Now we're waiting to see how the new code does.

anomie: So can we say already if it works (/close this ticket)? :)

So far, so good. The #wikimedia-operations logs since April 17 aren't showing the icinga notifications about all of the apaches being down around the time of the update run that was a hallmark of the problem before the RL cache purge was disabled on April 11.

I'm inclined to be cautious and wait until the 24th, making it a full week with no issues, but if someone wants to close before then I wouldn't complain.

/me takes on Brad's offer to close this. :-)