Page MenuHomePhabricator

WikiPage::updateCategoryCounts causing Lock wait timeout exceeded
Closed, ResolvedPublicPRODUCTION ERROR

Description

https://logstash.wikimedia.org/goto/a62df87d34aea105ba4da2178252b277

Wikimedia production logstash shows ~ 350 fatal errors (last 30 days) with this exception:

sample
A database query error has occurred.  [..]
Query: UPDATE  `category` SET cat_pages = cat_pages - 1,cat_files = cat_files - 1 WHERE cat_title IN  [..]
Function: WikiPage::updateCategoryCounts
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.48.23)

exception_id: W0iuOQpAEMUAAIr4Vr8AAAAN
wiki: commonswiki
exception.trace
#0 /srv/mediawiki/php-1.32.0-wmf.12/includes/libs/rdbms/database/Database.php(1413): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.32.0-wmf.12/includes/libs/rdbms/database/Database.php(1186): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.32.0-wmf.12/includes/libs/rdbms/database/Database.php(2043): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.12/includes/page/WikiPage.php(3339): Wikimedia\Rdbms\Database->update(string, array, array, string)
#4 /srv/mediawiki/php-1.32.0-wmf.12/includes/deferred/LinksDeletionUpdate.php(91): WikiPage->updateCategoryCounts(array, array, integer)
#5 /srv/mediawiki/php-1.32.0-wmf.12/includes/deferred/DeferredUpdates.php(268): LinksDeletionUpdate->doUpdate()
#6 /srv/mediawiki/php-1.32.0-wmf.12/includes/deferred/DeferredUpdates.php(214): DeferredUpdates::runUpdate(LinksDeletionUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#7 /srv/mediawiki/php-1.32.0-wmf.12/includes/deferred/DeferredUpdates.php(134): DeferredUpdates::execute(array, string, integer)
#8 /srv/mediawiki/php-1.32.0-wmf.12/includes/MediaWiki.php(913): DeferredUpdates::doUpdates(string)
#9 /srv/mediawiki/php-1.32.0-wmf.12/includes/MediaWiki.php(733): MediaWiki->restInPeace(string, boolean)
#10 [internal function]: Closure$MediaWiki::doPostOutputShutdown()
#11 {main}

See also:

Event Timeline

Krinkle added a subscriber: Anomie.

Tagging performance team given we recently touched this code as part of T163801, which seems like it should've avoided this sort of error.

I ran some checks across the log files in mwlog1001, which go back to 2018-04-17, looking at exception.log entries containing Lock wait timeout and `category`.

Looking at rates by wiki, only four have any entries logged: commonswiki had 2407, enwiki had 787, itwikisource had 643, and foundationwiki had 598. All the entries for itwikisource were on 2018-04-18 and for foundationwiki were on 2018-05-05, which makes me suspect something special on those days on those wikis rather than the systematic problem we're concerned about. Further analysis only considers enwiki and commonswiki.

Looking at rates by MW version I see low numbers for wmf.1 to wmf.7 (although there's a spike at wmf.3) then an increase starting in wmf.8, although that may be because those versions were having deadlocks instead (see T195397). 1.31.0-wmf.29 (almost the full week at the start of the data set) is also very low, though.

Looking at stack traces, I see:

  • 2506 via doPostOutputShutdown → deferred LinksDeletionUpdate → WikiPage::updateCategoryCounts()
  • 485 via doPostOutputShutdown → deferred LinksUpdate → WikiPage::updateCategoryCounts()
  • 179 via jobqueue RefreshLinksJob → LinksUpdate → WikiPage::updateCategoryCounts()
  • 12 via jobqueue → deferred LinksUpdate → WikiPage::updateCategoryCounts()
  • 12 via doPostOutputShutdown → WikiPage calling Category::refreshCounts()

I also note that the lock wait timeout being referred to seems to be 50 seconds.

wikiadmin@10.64.16.77(enwiki)> select @@innodb_lock_wait_timeout\G
*************************** 1. row ***************************
@@innodb_lock_wait_timeout: 50

Guesses:

  • The refreshCounts() calls (rate greatly increased in wmf.1 by de75c4e63) are slow enough to block concurrent LinksDeletionUpdates long enough to hit the lock timeouts. Although the low rate of refreshCounts() calls themselves failing may argue against this.
  • LinksDeletionUpdates are slow enough that stampedes on the same category can starve some of them, even though each one takes less than 50 seconds.
  • LinksDeletionUpdates are slow enough that one can last 50 seconds to block others.

We might try reducing the rate of those refreshCounts() calls, moving the category stuff to the end of LinksUpdate and LinksDeletionUpdate so category is locked for less of the transaction, or even farming the category updates out to a separate deferred update (and a separate transaction) as is being done for the refreshCategory() calls. @aaron, what do you think?

Imarlier moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.
Imarlier subscribed.

Aaron is preparing a patch that may address this. If not, may need to reassign for further evaluation.

My first inclination is to try to reduce the refreshCounts() calls.

Change 446469 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Reduce the rate of calls to Category::refreshCounts

https://gerrit.wikimedia.org/r/446469

Change 446469 merged by jenkins-bot:
[mediawiki/core@master] Reduce the rate of calls to Category::refreshCounts

https://gerrit.wikimedia.org/r/446469

Awaiting confirmation once wmf.14 is everywhere.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM