Page MenuHomePhabricator

Page Deletion Fatal exception of type "Wikimedia\Rdbms\DBQueryError" due to lock timeout
Open, MediumPublic5 Estimated Story PointsPRODUCTION ERROR

Description

Hi, I get a lot of error messages while deleting copyright violations on Commons. Example:

[0828bfa9-1b9e-42b7-92db-2026742d7774] 2024-12-23 11:16:06: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"

while deleting https://commons.wikimedia.org/wiki/File:AIRIN_RACHMI_DIANY.jpg (it is now deleted, but I had to try 3 times).

Event Timeline

Dec 23, 2024 @ 13:41:37.245 ERROR exception mw-web.codfw.main-7f966856f8-4s67m commonswiki [2ab32948-3b20-4ed9-9eae-2c676457b561] /w/index.php?action=delete&title=File:Air-France_Madame._Invitation_au_voyage_num%C3%A9ro_39_-_D%C3%A9pliant_publicitaire_-_photographies,_Henri_Cartier-Bresson,_Burri,_A.F._-_btv1b10102329m_(08_of_13).jpg Wikimedia\Rdbms\DBQueryError: Error 1205: Lock wait timeout exceeded; try restarting transaction
Function: WikiPage::lockAndGetLatest

Query: SELECT  page_latest  FROM `page`    WHERE page_id = 154589251 AND page_namespace = 6 AND page_title = 'Air-France_Madame._Invitation_au_voyage_numéro_39_-_Dépliant_publicitaire_-_photographies,_Henri_Cartier-Bresson,_Burri,_A.F._-_btv1b10102329m_(08_of_13).jpg'  LIMIT 1   FOR UPDATE
from /srv/mediawiki/php-1.44.0-wmf.8/includes/libs/rdbms/database/Database.php(1193)
#0 /srv/mediawiki/php-1.44.0-wmf.8/includes/libs/rdbms/database/Database.php(1177): Wikimedia\Rdbms\Database->getQueryException(string, int, string, string)
#1 /srv/mediawiki/php-1.44.0-wmf.8/includes/libs/rdbms/database/Database.php(1151): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, int, string, string)
#2 /srv/mediawiki/php-1.44.0-wmf.8/includes/libs/rdbms/database/Database.php(645): Wikimedia\Rdbms\Database->reportQueryError(string, int, string, string, bool)
#3 /srv/mediawiki/php-1.44.0-wmf.8/includes/libs/rdbms/database/Database.php(1340): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string)
#4 /srv/mediawiki/php-1.44.0-wmf.8/includes/libs/rdbms/database/Database.php(1291): Wikimedia\Rdbms\Database->select(array, string, array, string, array, array)
#5 /srv/mediawiki/php-1.44.0-wmf.8/includes/libs/rdbms/database/DBConnRef.php(127): Wikimedia\Rdbms\Database->selectField(array, string, array, string, array, array)
#6 /srv/mediawiki/php-1.44.0-wmf.8/includes/libs/rdbms/database/DBConnRef.php(338): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#7 /srv/mediawiki/php-1.44.0-wmf.8/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(763): Wikimedia\Rdbms\DBConnRef->selectField(array, string, array, string, array, array)
#8 /srv/mediawiki/php-1.44.0-wmf.8/includes/page/WikiPage.php(2445): Wikimedia\Rdbms\SelectQueryBuilder->fetchField()
#9 /srv/mediawiki/php-1.44.0-wmf.8/includes/page/DeletePage.php(517): WikiPage->lockAndGetLatest()
#10 /srv/mediawiki/php-1.44.0-wmf.8/includes/page/DeletePage.php(434): MediaWiki\Page\DeletePage->deleteInternal(WikiFilePage, string, string)
#11 /srv/mediawiki/php-1.44.0-wmf.8/includes/page/DeletePage.php(336): MediaWiki\Page\DeletePage->deleteUnsafe(string)
#12 /srv/mediawiki/php-1.44.0-wmf.8/includes/page/File/FileDeleteForm.php(111): MediaWiki\Page\DeletePage->deleteIfAllowed(string)
#13 /srv/mediawiki/php-1.44.0-wmf.8/includes/actions/FileDeleteAction.php(131): MediaWiki\Page\File\FileDeleteForm::doDelete(MediaWiki\Title\Title, LocalFile, string, string, bool, MediaWiki\User\User, array, bool)
#14 /srv/mediawiki/php-1.44.0-wmf.8/includes/actions/FileDeleteAction.php(76): MediaWiki\Actions\FileDeleteAction->tempExecute(LocalFile)
#15 /srv/mediawiki/php-1.44.0-wmf.8/includes/actions/DeleteAction.php(140): MediaWiki\Actions\FileDeleteAction->tempDelete()
#16 /srv/mediawiki/php-1.44.0-wmf.8/includes/actions/ActionEntryPoint.php(733): DeleteAction->show()
#17 /srv/mediawiki/php-1.44.0-wmf.8/includes/actions/ActionEntryPoint.php(510): MediaWiki\Actions\ActionEntryPoint->performAction(ImagePage, MediaWiki\Title\Title)
#18 /srv/mediawiki/php-1.44.0-wmf.8/includes/actions/ActionEntryPoint.php(146): MediaWiki\Actions\ActionEntryPoint->performRequest()
#19 /srv/mediawiki/php-1.44.0-wmf.8/includes/MediaWikiEntryPoint.php(201): MediaWiki\Actions\ActionEntryPoint->execute()
#20 /srv/mediawiki/php-1.44.0-wmf.8/index.php(58): MediaWiki\MediaWikiEntryPoint->run()
#21 /srv/mediawiki/w/index.php(3): require(string)
#22 {main}
HCoplin-WMF renamed this task from Fatal exception of type "Wikimedia\Rdbms\DBQueryError" to Page Deletion Fatal exception of type "Wikimedia\Rdbms\DBQueryError".Jan 8 2025, 2:38 PM

@Ladsgroup -- Do you have thoughts about this? @aaron mentioned that it might be related to work you completed recently.

aaron renamed this task from Page Deletion Fatal exception of type "Wikimedia\Rdbms\DBQueryError" to Page Deletion Fatal exception of type "Wikimedia\Rdbms\DBQueryError" due to lock timeout.Jan 9 2025, 4:23 PM

@Ladsgroup -- Do you have thoughts about this? @aaron mentioned that it might be related to work you completed recently.

My changes hasn't touched production yet (they are behind a feature flag) so it really shouldn't cause this. Once the data has been migrated and we stopped writing to the image and oldimage tables, the writes should become simpler and less prone to deadlocks but that's at least a quarter away.

Okay, thanks for the update and clarification! Do you have capacity to take a look at this sooner, since it might be in the same areas of code you're already working on? If you don't have time for it, we can also dig in and include you on code reviews to keep you in the loop.

Unfortunately I'm swamped at the moment, I'd be happy to help in the review though.

HCoplin-WMF set the point value for this task to 5.Jan 14 2025, 4:31 PM
HCoplin-WMF added a subscriber: daniel.

Assigning to @aaron to take a look! We should have a reasonable idea about what's going on soon.

Also noting that we chatted with @daniel about it, and although there is also some Domain Events work happening around deletions, that code is not yet merged. Sooo, we will try to fix this before both those and @Ladsgroup's changes need to go in. Please include both of them on code reviews, as well :)

To delete a file, the whole file needs to be copied and deleted in Swift while the main transaction is open. If Swift is slow and there are concurrent requests to delete the same file, a lock wait timeout is expected.

A deletion for AIRIN RACHMI DIANY.jpg began at 11:14:20 with an API request recorded in xff.log

2024-12-23 11:14:20.361479 [d07d552e-2075-44df-9064-0757a0f673fa] mw-api-ext.codfw.main-644678588b-sbxnf commonswiki 1.44.0-wmf.8 xff INFO: Mon, 23 Dec 2024 11:14:20 +0000     https://commons.wikimedia.org/w/api.php ...

The deletion continues in FileOperation.log:

2024-12-23 11:14:20.757111 [d07d552e-2075-44df-9064-0757a0f673fa] mw-api-ext.codfw.main-644678588b-sbxnf commonswiki 1.44.0-wmf.8 FileOperation DEBUG: HTTP start: HEAD https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.cd/c/cd/AIRIN_RACHMI_DIANY.jpg {"method":"HEAD","url":"https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.cd/c/cd/AIRIN_RACHMI_DIANY.jpg","headers":{"x-auth-token":"[redacted]","x-newest":"true","content-length":0}} 
2024-12-23 11:17:20.760173 [d07d552e-2075-44df-9064-0757a0f673fa] mw-api-ext.codfw.main-644678588b-sbxnf commonswiki 1.44.0-wmf.8 FileOperation DEBUG: HTTP complete: HEAD https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.cd/c/cd/AIRIN_RACHMI_DIANY.jpg code=200 size=0 total=0.098452 connect=0.000022 {"method":"HEAD","url":"https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.cd/c/cd/AIRIN_RACHMI_DIANY.jpg","response_code":200,"size":0.0,"total_time":"0.098452","connect_time":"0.000022"} 
2024-12-23 11:17:20.760455 [d07d552e-2075-44df-9064-0757a0f673fa] mw-api-ext.codfw.main-644678588b-sbxnf commonswiki 1.44.0-wmf.8 FileOperation ERROR: HTTP 504 (Gateway Timeout) in 'Wikimedia\FileBackend\SwiftFileBackend::doGetFileStatMulti' (given '{"srcs":["mwstore://local-swift-codfw/local-public/c/cd/AIRIN_RACHMI_DIANY.jpg","mwstore://local-swift-codfw/local-deleted/0/j/f/0jfe0ywrll1fj373txqevjssogc1faf.jpg"],"latest":true,"concurrency":50}') {"code":504,"desc":"Gateway Timeout","func":"Wikimedia\\FileBackend\\SwiftFileBackend::doGetFileStatMulti","req_params":"{\"srcs\":[\"mwstore://local-swift-codfw/local-public/c/cd/AIRIN_RACHMI_DIANY.jpg\",\"mwstore://local-swift-codfw/local-deleted/0/j/f/0jfe0ywrll1fj373txqevjssogc1faf.jpg\"],\"latest\":true,\"concurrency\":50}"}

total_time=0.098 but exactly three minutes separates the log entries? That field must be wrong. Note that the exception occurred during those three minutes with a different request ID. Despite the timeout for this HEAD request, it goes on to delete the file:

2024-12-23 11:17:21.078686 [d07d552e-2075-44df-9064-0757a0f673fa] mw-api-ext.codfw.main-644678588b-sbxnf commonswiki 1.44.0-wmf.8 FileOperation DEBUG: HTTP start: PUT https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-deleted.0j/0/j/f/0jfe0ywrll1fj373txqevjssogc1faf.jpg {"method":"PUT","url":"https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-deleted.0j/0/j/f/0jfe0ywrll1fj373txqevjssogc1faf.jpg","headers":{"x-auth-token":"[redacted]","x-copy-from":"/wikipedia-commons-local-public.cd/c/cd/AIRIN_RACHMI_DIANY.jpg","content-length":0}} 
   ##### missing completion log entry
2024-12-23 11:17:21.427655 [d07d552e-2075-44df-9064-0757a0f673fa] mw-api-ext.codfw.main-644678588b-sbxnf commonswiki 1.44.0-wmf.8 FileOperation DEBUG: HTTP start: DELETE https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.cd/c/cd/AIRIN_RACHMI_DIANY.jpg {"method":"DELETE","url":"https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.cd/c/cd/AIRIN_RACHMI_DIANY.jpg","headers":{"x-auth-token":"[redacted]","content-length":0}} 
2024-12-23 11:17:21.473171 [d07d552e-2075-44df-9064-0757a0f673fa] mw-api-ext.codfw.main-644678588b-sbxnf commonswiki 1.44.0-wmf.8 FileOperation DEBUG: HTTP complete: DELETE https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.cd/c/cd/AIRIN_RACHMI_DIANY.jpg code=204 size=0 total=0.045375 connect=0.000023 {"method":"DELETE","url":"https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.cd/c/cd/AIRIN_RACHMI_DIANY.jpg","response_code":204,"size":0.0,"total_time":"0.045375","connect_time":"0.000023"}

The failed deletion can be seen in xff.log starting at 11:15:51:

2024-12-23 11:15:51.068998 [0828bfa9-1b9e-42b7-92db-2026742d7774] mw-web.codfw.main-7f966856f8-pqklw commonswiki 1.44.0-wmf.8 xff INFO: Mon, 23 Dec 2024 11:15:51 +0000https://commons.wikimedia.org/w/index.php?action=delete&title=File:AIRIN_RACHMI_DIANY.jpg ...

The commonswiki logging table shows that the file was deleted at 11:14:20, which is the start time of the API request.

In summary, Yann tried to delete the file at least twice. The first request, done with the API, showed a client-side timeout but continued executing. The second request, done with the web UI, showed a lock wait timeout due to the lock still being held by the first request.

@daniel and I have previously talked about how to deal with transactions and atomicity in file updates. My opinion is that an update should acquire an application-level lock. If another request tries to act on the same file, and the lock is not available, it should immediately fail with an error "somebody else is doing something to this file". There should be no server-side waiting.

Background rate of 504s at the time:

2024-12-23 rate of 504s.png (316×533 px, 23 KB)

@daniel and I have previously talked about how to deal with transactions and atomicity in file updates. My opinion is that an update should acquire an application-level lock. If another request tries to act on the same file, and the lock is not available, it should immediately fail with an error "somebody else is doing something to this file". There should be no server-side waiting.

Yes, using non-blocking outer advisory locks makes sense here.

Yes, using non-blocking outer advisory locks makes sense here.

Can we just use LockManager, or is something else needed here?

There's FileBackend::getScopedFileLocks() and FileBackendStore::getScopedLocksForOps()... looks like we are already using advisory locks, then? What's missing?

The current advisory locks are acquired after the DB locks. They'd have to be re-ordered at the least.

aaron lowered the priority of this task from High to Medium.Jan 28 2025, 3:37 PM