Page MenuHomePhabricator

Database "Lock wait timeout exceeded" from HTMLCacheUpdateJob::invalidateTitles
Closed, DuplicatePublicPRODUCTION ERROR

Description

Error

Request ID: W6uv4ArAACkAAEOtbmAAAAAQ

message
A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: UPDATE  `page` SET page_touched = '20180926161220' WHERE page_id IN ('71237625','72854789')  AND (page_touched < '20180926161219')
Function: HTMLCacheUpdateJob::invalidateTitles
Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.192.16.22)
stacktrace
#0 /srv/mediawiki/php-1.32.0-wmf.23/includes/libs/rdbms/database/Database.php(1428): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.32.0-wmf.23/includes/libs/rdbms/database/Database.php(1198): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.32.0-wmf.23/includes/libs/rdbms/database/Database.php(2074): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.23/includes/jobqueue/jobs/HTMLCacheUpdateJob.php(148): Wikimedia\Rdbms\Database->update(string, array, array, string)
#4 /srv/mediawiki/php-1.32.0-wmf.23/includes/jobqueue/jobs/HTMLCacheUpdateJob.php(97): HTMLCacheUpdateJob->invalidateTitles(array)
#5 /srv/mediawiki/php-1.32.0-wmf.23/extensions/EventBus/includes/JobExecutor.php(64): HTMLCacheUpdateJob->run()
#6 /srv/mediawiki/rpc/RunSingleJob.php(67): JobExecutor->execute(array)

Notes

The example in question is for two pages on commons.wikimedia.org that were edited/created together by the same user as part of a the "Nominate for deletion" feature.

At the same time as the above error, the were also a bunch of other jobs failing with the same page_in IN() condition, but with similar or slightly different touch-timestamps. Which would explain why they all dead locks because they're all waiting for each other in some way, I guess.

[W6uv9wrAAC0AACBKWiwAAABE]
Query: UPDATE  `page` SET page_touched = '20180926161305' WHERE page_id IN ('71237625','72854789')  AND (page_touched < '20180926161244')


[W6uv4QrAEHUAAB1wFb8AAAAR]
Query: UPDATE  `page` SET page_touched = '20180926161305' WHERE page_id IN ('71237625','72854789')  AND (page_touched < '20180926161242')

[W6uv@ArAIGYAAHvODEAAAABV
Query: UPDATE  `page` SET page_touched = '20180926161305' WHERE page_id IN ('71237625','72854789')  AND (page_touched < '20180926161243')

[W6uv8ArAIF4AAK8uKoUAAAAX]
Query: UPDATE  `page` SET page_touched = '20180926161304' WHERE page_id IN ('71237625','72854789')  AND (page_touched < '20180926161240')

[W6uv9ArAAC4AAJzZqBAAAAAK]
Query: UPDATE  `page` SET page_touched = '20180926161304' WHERE page_id IN ('71237625','72854789')  AND (page_touched < '20180926161242')
...

[W6uv5wrAACgAAEOlUzMAAACP]
Query: UPDATE  `page` SET page_touched = '20180926161228' WHERE page_id IN ('71237625','72854789')  AND (page_touched < '20180926161226')

[W6uv5QrAIFoAAKdmghAAAAAA]
Query: UPDATE  `page` SET page_touched = '20180926161227' WHERE page_id IN ('71237625','72854789')  AND (page_touched < '20180926161226')

All together 31 jobs were executing for the same set of page IDs all failing with deadlocks.

Looking at the past 30 days in WMF Logstash, the deadlock has happened before a few times (starting at August 25, which is today's log cut off so unrelated). However it has only happened for 1 or 2 jobs at most, never 31. That might mean the problem has gotten worse, or maybe it is just a coincidence.

In any event, deadlocks are bad.

Event Timeline

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM

This is effectively the same issue as T206283 and T202715. They are defered updates performing an operation that can be subject to lock contention when the same user or same page is part of several edits in a short time span. The current strategy for those is covered by T206283, which is basically to catch it, queue as job, and let it retry there asynchronously.