Page MenuHomePhabricator

Wikibase\Repo\Content\DataUpdateAdapter::doUpdate: Commit failed on server(s) 10.64.48.172: Cannot execute query from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate while transaction status is ERROR
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.35.0-wmf.22

message
Wikibase\Repo\Content\DataUpdateAdapter::doUpdate: Commit failed on server(s) 10.64.48.172: Cannot execute query from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate while transaction status is ERROR

Impact

After promotion of 1.35.0-wmf.22 to group1 there were hundreds of database errors like this. I rolled back the train on group1. Until this is resolved or found to be unimportant, this blocks the train.

Notes

Details

Request ID
Xl@4DgpAEDMAAHnqJvAAAABL
Request URL
https://www.wikidata.org/w/api.php
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.35.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactory.php(243): Wikimedia\Rdbms\LoadBalancer->commitMasterChanges(string, integer)
#1 /srv/mediawiki/php-1.35.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(235): Wikimedia\Rdbms\LBFactory->Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\LoadBalancer, string, array)
#2 /srv/mediawiki/php-1.35.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactory.php(245): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure, array)
#3 /srv/mediawiki/php-1.35.0-wmf.22/includes/libs/rdbms/lbfactory/LBFactory.php(308): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod(string, array)
#4 /srv/mediawiki/php-1.35.0-wmf.22/includes/deferred/DeferredUpdates.php(419): Wikimedia\Rdbms\LBFactory->commitMasterChanges(string)
#5 /srv/mediawiki/php-1.35.0-wmf.22/includes/Storage/DerivedPageDataUpdater.php(1648): DeferredUpdates::attemptUpdate(Wikibase\Repo\Content\DataUpdateAdapter, Wikimedia\Rdbms\LBFactoryMulti)
#6 /srv/mediawiki/php-1.35.0-wmf.22/includes/Storage/DerivedPageDataUpdater.php(1458): MediaWiki\Storage\DerivedPageDataUpdater->doSecondaryDataUpdates(array)
#7 /srv/mediawiki/php-1.35.0-wmf.22/includes/deferred/MWCallableUpdate.php(38): MediaWiki\Storage\DerivedPageDataUpdater->MediaWiki\Storage\{closure}()
#8 /srv/mediawiki/php-1.35.0-wmf.22/includes/deferred/DeferredUpdates.php(417): MWCallableUpdate->doUpdate()
#9 /srv/mediawiki/php-1.35.0-wmf.22/includes/deferred/DeferredUpdates.php(296): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#10 /srv/mediawiki/php-1.35.0-wmf.22/includes/deferred/DeferredUpdates.php(233): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#11 /srv/mediawiki/php-1.35.0-wmf.22/includes/deferred/DeferredUpdates.php(150): DeferredUpdates::handleUpdateQueue(array, string, integer)
#12 /srv/mediawiki/php-1.35.0-wmf.22/includes/MediaWiki.php(1058): DeferredUpdates::doUpdates(string)
#13 /srv/mediawiki/php-1.35.0-wmf.22/includes/MediaWiki.php(838): MediaWiki->restInPeace()
#14 /srv/mediawiki/php-1.35.0-wmf.22/includes/MediaWiki.php(847): MediaWiki->{closure}()
#15 /srv/mediawiki/php-1.35.0-wmf.22/api.php(119): MediaWiki->doPostOutputShutdown()
#16 /srv/mediawiki/w/api.php(3): require(string)
#17 {main}

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Addshore lowered the priority of this task from Unbreak Now! to High.Mar 4 2020, 3:51 PM

No longer blocking the train.
I'll write some details here later but these deadlocks actually seem to be caused by edit spikes on wikdiata.
One of which just so happened to coincide with the train deployment.

The deadlocks still happen at a low rate due to the edit spikes

image.png (312×1 px, 38 KB)

Change 576883 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Schedule 1 CleanTermsIfUnusedJob per ID to clean

https://gerrit.wikimedia.org/r/576883

Change 576963 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.22] Schedule 1 CleanTermsIfUnusedJob per ID to clean

https://gerrit.wikimedia.org/r/576963

Change 576883 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Schedule 1 CleanTermsIfUnusedJob per ID to clean

https://gerrit.wikimedia.org/r/576883

Change 576963 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.22] Schedule 1 CleanTermsIfUnusedJob per ID to clean

https://gerrit.wikimedia.org/r/576963

Mentioned in SAL (#wikimedia-operations) [2020-03-05T11:29:29Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.22/extensions/Wikibase: [[gerrit:576963|Schedule 1 CleanTermsIfUnusedJob per ID to clean (T244115 T246898)]] (duration: 01m 08s)

The merged patch did reduce the size / length of the cleanup transactions and thus shorten the length of the locks held there. (and increase the number of jobs)

image.png (323×614 px, 33 KB)

image.png (357×613 px, 29 KB)

However we still see a few deadlocks appearing in the logs.
I imagine these will go away once the migration is all done.

Looking at some of the items where the deadlocks are occurring, such as https://www.wikidata.org/w/index.php?title=Q76851570&action=history
which has 2 edits in quick succession.

image.png (400×1 px, 92 KB)

This could now mainly be 2 data updates for a not already migrated item happening in quick succession and thus trying to insert the same things.

This pattern can be found in other deadlocks too https://www.wikidata.org/w/index.php?title=Q86805935&action=history

For items that have not been migrated yet, the storeTerms call will end up writing all of the things to the new storage resulting in quite a hefty bit of work.
Post migration this pattern will likely be mostly gone

This is probably the same deadlock case as we had for T151789 with the old storage.
data updates getting in the way of each other when multiple edits happen to the same item in quick succession

Seeing quite a bit of this, or what I think is this, in logstash over the last several hours.

Change 580302 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Store item terms at late as possible to avoid deadlocks

https://gerrit.wikimedia.org/r/580302

Change 580328 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.23] Store item terms at late as possible to avoid deadlocks

https://gerrit.wikimedia.org/r/580328

Change 580302 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Store item terms at late as possible to avoid deadlocks

https://gerrit.wikimedia.org/r/580302

Change 580328 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.23] Store item terms at late as possible to avoid deadlocks

https://gerrit.wikimedia.org/r/580328

Mentioned in SAL (#wikimedia-operations) [2020-03-17T14:14:34Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.23/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: [[gerrit:580328|Store item terms at late as possible to avoid deadlocks (T247553 T246898)]] (duration: 01m 07s)

Change 580342 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Do not lock rows when there's no term returned

https://gerrit.wikimedia.org/r/580342

Change 580352 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.23] Do not lock rows when there's no term returned

https://gerrit.wikimedia.org/r/580352

Change 580342 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Do not lock rows when there's no term returned

https://gerrit.wikimedia.org/r/580342

Change 580390 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.24] Do not lock rows when there's no term returned

https://gerrit.wikimedia.org/r/580390

Change 580352 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.23] Do not lock rows when there's no term returned

https://gerrit.wikimedia.org/r/580352

Mentioned in SAL (#wikimedia-operations) [2020-03-17T17:46:05Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.23/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: [[gerrit:580352|Do not lock rows when there's no term returned (T247553 T246898)]] (duration: 01m 07s)

Change 580390 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.24] Do not lock rows when there's no term returned

https://gerrit.wikimedia.org/r/580390

Mentioned in SAL (#wikimedia-operations) [2020-03-17T18:53:11Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.24/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseItemTermStoreWriter.php: [[gerrit:580390|Do not lock rows when there's no term returned (T247553 T246898)]], To catch the train (duration: 01m 08s)

Addshore added a subscriber: Ladsgroup.

So, these still happen, but at a much lower rate.
This is probably still something that we would like to iron out and that we would probably tackle with some locks for updating entities higher up in the application (not in sql).

So, these still happen, but at a much lower rate.
This is probably still something that we would like to iron out and that we would probably tackle with some locks for updating entities higher up in the application (not in sql).

Sorry if I'm missing something obvious but I don't see it's happening anymore. I have looked at the numbers again and the differed update failing for wikidata in the past 24 hours is only six cases which none is related to term store (one is echo not working, several others is mediawiki not being able to push jobs and fataling: "Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable") but error like this is gone. The value in the graphs I posted is not down to zero because it's graph of all errors in mediawiki for wikidatawiki and not this particular error. Even at failure rate of six per day, which is 6 per 712K ~= 1 / 100K = 10 ppm or 0.0001%. I think that's pretty normal, edits fail to save at almost the same rate (5 every 24h)

Addshore claimed this task.

Okay, Going to close this as the issue with 1.35.0-wmf.22 is actually done.
The thing causing the errors still occurs but very rarely.
Leaving the lock manager ticket open to track the possible technical solution, but it is super low prio.