Page MenuHomePhabricator

Avoid "Lock wait timeout exceeded" from MessageGroupStats::clear
Closed, ResolvedPublicPRODUCTION ERROR

Description

After fixing T102229: MessageGroupStats::clear performs the same deletion on all mediawiki servers, locking itself, there are still occasional spikes about "Lock wait timeout exceeded; try restarting transaction". It would be a good thing to isolate the remaining causes and fix them completely.

See also T70530: Many INSERT queries on some web requests

Event Timeline

From where are these are errors triggered now?

That's the thing that should be investigated. Logstash does not tell what is holding the lock. I see entries for both MessageGroupStats::clear and MessageGroupStats::queueUpdates and some slow query logs from hhvm (which might be slow because they were waiting for the lock?). What is clear is that they usually come in bursts.

Is it possible to paste the full details of some errors as on T102229? Knowing the url, sql1line and timestamp would be of great help.

Curiously enough, for the past week there is only one message. Maybe this has been solved after all?

2016-05-09T16:56:00.000Z
SlowTimer [11614ms] at runtime/ext_mysql: slow query: DELETE /* MessageGroupStats::clear <username> */ FROM translate_groupstats WHERE tgs_group = 'page-Wikidata:List of properties/Geographical feature' AND tgs_lang = 'hu'

Going further back in time, here are the contents for one group messages happening at the same time.

2016-05-01T13:17:38.615Z

{"mobileview":{"normalizedtitle":"Wikimedia Highlights, May 2014/pl","lastmodified":"2016-02-18T18:55:08Z","revision":15360487,"languagecount":0,"displaytitle":"Aktualności Wikimedia - Maj 2014","id":5004653,"image":{"file":"ZoomProof Wikisource tool concept.ogv","width":1112,"height":672},"thumb":{"url":"//upload.wikimedia.org/wikipedia/commons/thumb/8/83/ZoomProof_Wikisource_tool_concept.ogv/1024px--ZoomProof_Wikisource_tool_concept.ogv.jpg","width":1024,"height":619},"sections":[],"protection":[],"editable":true}}<!DOCTYPE html><html><head><title>Internal Server Error</title></head><body><h1>Internal Server Error</h1><p>A database query error has occurred.<br />
Function: MessageGroupStats::queueUpdates<br />
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.30)<br />
[snip more html]


MessageGroupStats::queueUpdates 10.64.16.30 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.30) INSERT IGNORE INTO translate_groupstats (tgs_group,tgs_lang,tgs_total,tgs_translated,tgs_fuzzy,tgs_proofread) VALUES ('page-Wikimedia Highlights, May 2014','ar','54','10','0','8')


{"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/db/Database.php","line":901,"function":"reportQueryError","class":"DatabaseBase","type":"->","args":["string","integer","string","string","boolean"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/db/Database.php","line":1464,"function":"query","class":"DatabaseBase","type":"->","args":["string","string"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/extensions/Translate/utils/MessageGroupStats.php","line":477,"function":"insert","class":"DatabaseBase","type":"->","args":["string","array","string","string"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/extensions/Translate/utils/MessageGroupStats.php","line":492,"function":"Closure$MessageGroupStats::queueUpdates","args":["DatabaseMysqli","string"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/db/Database.php","line":2482,"function":"Closure$MessageGroupStats::runWithLock","args":[]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/db/Database.php","line":2693,"function":"runOnTransactionIdleCallbacks","class":"DatabaseBase","type":"->","args":[]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/db/loadbalancer/LoadBalancer.php","line":1055,"function":"commit","class":"DatabaseBase","type":"->","args":["string","string"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/db/loadbalancer/LBFactory.php","line":206,"function":"commitMasterChanges","class":"LoadBalancer","type":"->","args":["string"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/db/loadbalancer/LBFactoryMulti.php","line":411,"function":"Closure$LBFactory::forEachLBCallMethod","args":["LoadBalancer","string","array"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/db/loadbalancer/LBFactory.php","line":209,"function":"forEachLB","class":"LBFactoryMulti","type":"->","args":["Closure$LBFactory::forEachLBCallMethod;1204592053","array"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/db/loadbalancer/LBFactory.php","line":250,"function":"forEachLBCallMethod","class":"LBFactory","type":"->","args":["string","array"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/MediaWiki.php","line":561,"function":"commitMasterChanges","class":"LBFactory","type":"->","args":["string","array"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/api/ApiMain.php","line":478,"function":"preOutputCommit","class":"MediaWiki","type":"::","args":["DerivativeContext"]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/includes/api/ApiMain.php","line":434,"function":"executeActionWithErrorHandling","class":"ApiMain","type":"->","args":[]}, {"file":"/srv/mediawiki/php-1.27.0-wmf.22/api.php","line":83,"function":"execute","class":"ApiMain","type":"->","args":[]}, {"file":"/srv/mediawiki/w/api.php","line":3,"function":"include","args":["string"]}


SlowTimer [15888ms] at runtime/ext_mysql: slow query: INSERT /* MessageGroupStats::queueUpdates 10.64.0.16 */ IGNORE INTO translate_groupstats (tgs_group,tgs_lang,tgs_total,tgs_translated,tgs_fuzzy,tgs_proofread) VALUES ('page-Wikimedia Highlights, May 2014','ar','54','10','0','8')

There is group of 6 messages just 8 seconds before this one:

MessageGroupStats::clear 10.64.16.30 1205 Lock wait timeout exceeded; try restarting transaction (10.64.16.30) DELETE FROM translate_groupstats WHERE tgs_group = 'page-Wikimedia Highlights, May 2014' AND tgs_lang = 'ar'

This one is also via api.php but has a referer. Looks like someone was translating at the same time.

https://meta.wikimedia.org/w/index.php?title=Special:Translate/!recent&action=proofread&language=ar&filter=translated%7C%21reviewer%3A<uid>%7C%21last-translator%3A<uid>&group=agg-Wikimedia_Highlights

Krinkle subscribed.

WMF Logstash reported this fatal error 48 times in the last 30 days, including on versions 1.32.0-wmf.19 and 1.32.0-wmf.22, and on multiple wikis (wikidata.org, mediawiki.org, meta.wikimedia.org).

Recent sample:

A database query error has occurred. ###
Query: DELETE FROM `translate_groupstats` WHERE tgs_group = 'page-###||###' AND tgs_lang = 'hi'
Function: MessageGroupStats::clear
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.192.32.103)
trace
#3 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Translate/utils/MessageGroupStats.php(168): Wikimedia\Rdbms\Database->delete(string, string, string)
#4 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Translate/TranslateEditAddons.php(230): MessageGroupStats::clear(MessageHandle)
#5 /srv/mediawiki/php-1.32.0-wmf.22/includes/Hooks.php(174): TranslateEditAddons::onSave(WikiPage, User, WikitextContent, string, integer, NULL, NULL, integer, Revision, Status, boolean, integer)
#6 /srv/mediawiki/php-1.32.0-wmf.22/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#7 /srv/mediawiki/php-1.32.0-wmf.22/includes/Storage/PageUpdater.php(1209): Hooks::run(string, array)
#8 /srv/mediawiki/php-1.32.0-wmf.22/includes/libs/rdbms/database/Database.php(3746): Closure$MediaWiki\Storage\PageUpdater::getAtomicSectionUpdate(Wikimedia\Rdbms\DatabaseMysqli, string)
#9 /srv/mediawiki/php-1.32.0-wmf.22/includes/libs/rdbms/database/DBConnRef.php(49): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure$MediaWiki\Storage\PageUpdater::getAtomicSectionUpdate;3070)
#10 /srv/mediawiki/php-1.32.0-wmf.22/includes/libs/rdbms/database/DBConnRef.php(529): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#11 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/AtomicSectionUpdate.php(35): Wikimedia\Rdbms\DBConnRef->doAtomicSection(string, Closure$MediaWiki\Storage\PageUpdater::getAtomicSectionUpdate;3070)
#12 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/DeferredUpdates.php(268): AtomicSectionUpdate->doUpdate()
#13 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/DeferredUpdates.php(214): DeferredUpdates::runUpdate(AtomicSectionUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#14 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/DeferredUpdates.php(130): DeferredUpdates::execute(array, string, integer)
#15 /srv/mediawiki/php-1.32.0-wmf.22/includes/MediaWiki.php(607): DeferredUpdates::doUpdates(string, integer)
#16 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiMain.php(548): MediaWiki::preOutputCommit(DerivativeContext)
#17 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#18 /srv/mediawiki/php-1.32.0-wmf.22/api.php(87): ApiMain->execute()
Krinkle renamed this task from Avoid lock contention issues in MessageGroupStats to Avoid "Lock wait timeout exceeded" from MessageGroupStats::cleare.Sep 19 2018, 2:58 AM
Krinkle renamed this task from Avoid "Lock wait timeout exceeded" from MessageGroupStats::cleare to Avoid "Lock wait timeout exceeded" from MessageGroupStats::clear.
Krinkle moved this task from Untriaged to Older on the Wikimedia-production-error board.
Nikerabbit raised the priority of this task from Low to High.EditedSep 19 2018, 8:42 AM

The eventual fix should take care of both translate_messagegroupstats and translate_reviews (See T195293: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) ).

Currently leading proposal is to use Pool Counter to avoid stampedes. There should also be another look at the code to see if there are any possible changes to avoid deadlocks or unnecessary work.

Change 468319 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Reduce contention on translate_messagegroupstats

https://gerrit.wikimedia.org/r/468319

Change 468319 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Reduce contention on translate_messagegroupstats

https://gerrit.wikimedia.org/r/468319

Change 470618 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Fix message group stats not upating

https://gerrit.wikimedia.org/r/470618

This is quite nasty issue to slip into an MLEB release... I'm wondering whether we should do a follow-up.

This patch has been cherry-picked to translatewiki.net a few days ago.

Change 470618 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Fix message group stats not updating

https://gerrit.wikimedia.org/r/470618

Change 474659 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Special:PageTranslation: Fix bunch of warnings with encourage&discourage

https://gerrit.wikimedia.org/r/474659

I found one more regression from Logstash :(

Logstash no longer shows results with "Lock wait timeout exceeded" AND MessageGroupStats so the original problem at least seems to be resolved.

Change 474659 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Special:PageTranslation: Fix bunch of warnings with encourage&discourage

https://gerrit.wikimedia.org/r/474659

@Nikerabbit I see numerous WARNINGS in logstash - https://logstash.wikimedia.org/app/kibana#/discover?_g=h@44136fa&_a=h@59a810c:

Expectation (masterConns <= 0) by MediaWiki::restInPeace not met (actual: 1):
[connect to 10.64.0.205 (mediawikiwiki)]
#0 /srv/mediawiki/php-1.33.0-wmf.6/includes/libs/rdbms/TransactionProfiler.php(165): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated()
#1 /srv/mediawiki/php-1.33.0-wmf.6/includes/libs/rdbms/loadbalancer/LoadBalancer.php(761): Wikimedia\Rdbms\TransactionProfiler->recordConnection()
#2 /srv/mediawiki/php-1.33.0-wmf.6/includes/libs/rdbms/database/DBConnRef.php(46): Wikimedia\Rdbms\LoadBalancer->getConnection()
#3 /srv/mediawiki/php-1.33.0-wmf.6/includes/libs/rdbms/database/DBConnRef.php(604): Wikimedia\Rdbms\DBConnRef->__call()
#4 /srv/mediawiki/php-1.33.0-wmf.6/extensions/Translate/utils/MessageGroupStats.php(553): Wikimedia\Rdbms\DBConnRef->lock()
#5 /srv/mediawiki/php-1.33.0-wmf.6/includes/deferred/MWCallableUpdate.php(34): Closure$MessageGroupStats::withLock()
#6 /srv/mediawiki/php-1.33.0-wmf.6/includes/deferred/DeferredUpdates.php(270): MWCallableUpdate->doUpdate()
#7 /srv/mediawiki/php-1.33.0-wmf.6/includes/deferred/DeferredUpdates.php(216): DeferredUpdates::runUpdate()
#8 /srv/mediawiki/php-1.33.0-wmf.6/includes/deferred/DeferredUpdates.php(140): DeferredUpdates::execute()
#9 /srv/mediawiki/php-1.33.0-wmf.6/includes/MediaWiki.php(905): DeferredUpdates::doUpdates()
#10 /srv/mediawiki/php-1.33.0-wmf.6/includes/MediaWiki.php(728): MediaWiki->restInPeace()
#11 (): Closure$MediaWiki::doPostOutputShutdown()
#12 {main}

Any concerns about such warnings? Are they related to the patches in this task?

It is not a new thing. It is related to T134841#4711772.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM