Page MenuHomePhabricator

New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction
Closed, ResolvedPublic

Assigned To
Authored By
jcrespo
Oct 8 2019, 3:59 PM
Referenced Files
F31010064: image.png
Nov 5 2019, 7:29 AM
F31010066: image.png
Nov 5 2019, 7:29 AM
F30986578: image.png
Nov 4 2019, 3:57 PM
F30986583: image.png
Nov 4 2019, 3:57 PM
F30879681: wikidata_term_deadlock.png
Oct 25 2019, 10:00 AM
F30615822: image.png
Oct 10 2019, 1:28 PM
F30614795: image.png
Oct 10 2019, 9:48 AM
F30592633: image.png
Oct 8 2019, 4:31 PM
Tokens
"Y So Serious" token, awarded by Addshore.

Description

Before 2019-10-08T00:00:02 there was almost no wikidata deadlocks, since then there are 650-1250 per hours at a relatively consistent rate:

https://logstash.wikimedia.org/goto/654dbe1fd799f62f062798cb82e12e23

Happening when executing:

INSERT  INTO `wbt_item_terms` (wbit_item_id,wbit_term_in_lang_id) VALUES ('XXXX

There is no apparent version change or configuration change at the time, it could have started at the same time as dumps or other automated process.

The rate is not of immediate concern, but reporting as it could be a regression from the almost 0 rate we recently achieved.

Todo
As discussed in Wikidata team, we will do the following (in order):

  1. Put limit of writing to both to Q70m for the moment (highest existing item id currently is > 70,202,000). This will stop writing to new store for newly created pages, which result in higher traffic, probably impacting this increase of dead locks the most.
  2. Read through https://www.mediawiki.org/wiki/Database_transactions to find if there some features we can use to improve locking situation when we write to new store
  3. Prioritize resolving the tasks blocking switching to reading from new store (T232132: Add EntitiesWithoutTermFinder for the new store, T232040: Add label and description collision detectors for new terms store and T232393: Find out why rebuilding some items in new term store failed) and start writing to and reading from new store for property terms and chunk of item terms (up to Q10m maybe)
  4. Check later if we can optimize (lower down the number of) writes we do to new store

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It can be that we can actually live, until migration is done, with Most Linked special pages on wikidata not being that up-to-date for about a month?

@Lydia_Pintscher let's discuss this on Monday.. I think the results on the first page of most linked pages probably do not change all that often, which I guess readers/editors/others are probably most interested in?

Until then, @jcrespo @Marostegui please feel free to kill that script and/or stop running it at all (if possible) when necessary or convenient, and thanks a lot for your help figuring this out thus far :)

It can be that we can actually live, until migration is done, with Most Linked special pages on wikidata not being that up-to-date for about a month?

@Lydia_Pintscher let's discuss this on Monday.. I think the results on the first page of most linked pages probably do not change all that often, which I guess readers/editors/others are probably most interested in?

We're talking about https://www.wikidata.org/wiki/Special:MostLinkedPages not being updated for about a month during the migration and then updating regularly again? Yeah that should be fine.

Until then, @jcrespo @Marostegui please feel free to kill that script and/or stop running it at all (if possible) when necessary or convenient, and thanks a lot for your help figuring this out thus far :)

Thanks! If you could send a patch to get it disabled that'd be helpful!

It can be that we can actually live, until migration is done, with Most Linked special pages on wikidata not being that up-to-date for about a month?

@Lydia_Pintscher let's discuss this on Monday.. I think the results on the first page of most linked pages probably do not change all that often, which I guess readers/editors/others are probably most interested in?

We're talking about https://www.wikidata.org/wiki/Special:MostLinkedPages not being updated for about a month during the migration and then updating regularly again? Yeah that should be fine.

@Lydia_Pintscher It's not just the special page, it has an API counter-part.

To make matter even more interesting, we can either disable none or all of these for wikidata: fewestrevisions, wantedpages, mostrevisions, mostlinked, deadendpages, ancientpages. If that's a no-no. I can make a patch in puppet that will be complex and big.

Ah. Do we have pageview/access stats for these?

Change 542915 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] mediawiki: Disable query page updates for wikidatawiki

https://gerrit.wikimedia.org/r/542915

It can be that we can actually live, until migration is done, with Most Linked special pages on wikidata not being that up-to-date for about a month?

@Lydia_Pintscher let's discuss this on Monday.. I think the results on the first page of most linked pages probably do not change all that often, which I guess readers/editors/others are probably most interested in?

We're talking about https://www.wikidata.org/wiki/Special:MostLinkedPages not being updated for about a month during the migration and then updating regularly again? Yeah that should be fine.

@Lydia_Pintscher It's not just the special page, it has an API counter-part.

To make matter even more interesting, we can either disable none or all of these for wikidata: fewestrevisions, wantedpages, mostrevisions, mostlinked, deadendpages, ancientpages. If that's a no-no. I can make a patch in puppet that will be complex and big.

Another possibility would be to split modules/mediawiki/manifests/maintenance/updatequerypages/cronjob.pp into one cron per "feature"?

Another possibility would be to split modules/mediawiki/manifests/maintenance/updatequerypages/cronjob.pp into one cron per "feature"?

Yeah, on it but it won't be super simple. Let me give it a try.

Change 542956 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] mediawiki: Split cronjob for updatequerypages to multiple modules

https://gerrit.wikimedia.org/r/542956

Change 542956 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/puppet@production] mediawiki: Split cronjob for updatequerypages to multiple modules

https://gerrit.wikimedia.org/r/542956

With this patch, we only disable mostlinked pages on wikidata: https://puppet-compiler.wmflabs.org/compiler1001/18877/mwmaint1002.eqiad.wmnet/

Change 542956 merged by Marostegui:
[operations/puppet@production] mediawiki: Split cronjob for updatequerypages to multiple modules

https://gerrit.wikimedia.org/r/542956

I have merged your change and disable MostLinked cronjob entry.
Thank you!

It seems to have improved the situation a lot since yesterday https://logstash.wikimedia.org/app/kibana#/dashboard/87348b60-90dd-11e8-8687-73968bebd217?_g=h@af6022c&_a=h@d6ea75e

There are still few spikes, and it seems that they reach a dead-lock only when so many terms need to be written to new store.

We will double check whether there's any data loss due to these errors atm. If not, it is probably fine and we can live with it until migration is done.

After the puppet patch, this is clean and nice, 134 cases in the last 24 hours only. I suggest we call this done @Marostegui

So we checked in new store what terms exist on some of the items that had failing inserts due the dead-locks. None of them, except one, has terms stored on them. Meaning the transactions were not repeated later, and we don't know whether they will fail again.

We don't know what will happen when rebuild script reaches them either. I think we should treat this as a possible data loss and invest more time investigating why it is happening.

That's indeed not ideal (the fact that the transaction isn't retried), and probably merits some investigation and some potential mitigation methods as deadlocks can happen anytime (ideally not often, but are a possibility with high concurrency environments).

@Ladsgroup is rebuild script close to Q40m? I'm thinking of: 1. we lower write both to Q40m, 2. stop rebuild script shortly before it reaches Q40m and 3. investigate what is going on here. But if rebuild script still not that close, we can investigate a bit first

It's on Q23Mio. Even though the script fails, it logs all of them in mwmaint node, I can get list of all of them and fix them via a script (python to do purge, mwscript, anything) so I don't think that would be a very big deal IMO. These are all secondary data.

It's on Q23Mio. Even though the script fails, it logs all of them in mwmaint node, I can get list of all of them and fix them via a script (python to do purge, mwscript, anything) so I don't think that would be a very big deal IMO. These are all secondary data.

So you suggest that we just continue, regardless of the dead-locks?

At this rate, I think it's fine but if we want to increase the speed or set write_both for all, we should fix the lock time somehow (points 2, 3, and 4) before we move forward

Change 542915 abandoned by Ladsgroup:
mediawiki: Disable query page updates for wikidatawiki

Reason:
Done in I7d0f973b4416

https://gerrit.wikimedia.org/r/542915

WikidataCon is happening right now, it's likely that one or more of the cool new tools people have unleashed for Wikidata's birthday are adding a lot of load, or just that overall it's a problem.

Trying to do point number 4, Ran it locally and couldn't find anything we can batch:

amsa@amsa-Latitude-7480:/var/log/mediawiki$ tail debug-repo.log --lines 200 | grep DBQuery | grep -v LCStoreDB | grep wbt_ | grep -i insert
[DBQuery] Wikibase\Lib\Store\Sql\Terms\DatabasePropertyTermStore::acquireAndInsertTerms [0.001s] [Null]: SELECT  wbpt_term_in_lang_id AS `value`  FROM `wbt_property_terms`    WHERE wbpt_property_id = '21'   FOR UPDATE
[DBQuery] Wikimedia\Rdbms\Database::insert [0.001s] [Null]: INSERT  INTO `wbt_text` (wbx_text) VALUES ('fa-value'),('fa-vaeeelue')
[DBQuery] Wikimedia\Rdbms\Database::insert [0s] [Null]: INSERT  INTO `wbt_text_in_lang` (wbxl_text_id,wbxl_language) VALUES ('19','fa'),('18','fa')
[DBQuery] Wikimedia\Rdbms\Database::insert [0s] [Null]: INSERT  INTO `wbt_term_in_lang` (wbtl_text_in_lang_id,wbtl_type_id) VALUES ('19','1'),('18','2')
[DBQuery] Wikibase\Lib\Store\Sql\Terms\{closure} [0s] [Null]: INSERT  INTO `wbt_property_terms` (wbpt_property_id,wbpt_term_in_lang_id) VALUES ('21','18'),('21','19')

The transaction part seems more important.

I sampled and went through errors caused by the new term store. Here's the summary of one hour I sampled:

Change 546925 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Stop locking and use DISTINCT when finding used terms to delete

https://gerrit.wikimedia.org/r/546925

Change 546925 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Stop locking and use DISTINCT when finding used terms to delete

https://gerrit.wikimedia.org/r/546925

Change 547243 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.3] Stop locking and use DISTINCT when finding used terms to delete

https://gerrit.wikimedia.org/r/547243

Change 547244 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.4] Stop locking and use DISTINCT when finding used terms to delete

https://gerrit.wikimedia.org/r/547244

Change 547243 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.3] Stop locking and use DISTINCT when finding used terms to delete

https://gerrit.wikimedia.org/r/547243

Change 547244 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.4] Stop locking and use DISTINCT when finding used terms to delete

https://gerrit.wikimedia.org/r/547244

Mentioned in SAL (#wikimedia-operations) [2019-10-30T16:05:15Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, [[gerrit:547244|Stop locking and use DISTINCT when finding used terms to delete]] (T234948) (duration: 01m 04s)

Change 547262 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Add IGNORE on insert in ReplicaMasterAwareRecordIdsAcquirer

https://gerrit.wikimedia.org/r/547262

wikiadmin@10.64.32.198(wikidatawiki)> EXPLAIN SELECT /* Wikibase\Lib\Store\Sql\Terms\DatabaseItemTermStore::cleanTermsIfUnused */ DISTINCT wbit_term_in_lang_id AS `value` FROM `wbt_item_terms` WHERE wbit_term_in_lang_id IN ('828227', '828232', '828257', '828265', '828269', '828272', '828276', '828287', '828290', '828293', '828297', '828302', '828314', '828315', '828319', '828333', '828336', '828337', '828342', '828350', '828358', '828362', '828363', '828368', '828374', '828377', '828380', '828382', '828386', '828387', '828388', '828390', '828401', '828404', '828405', '828414', '828428', '828433', '828435', '828438', '828448', '828449', '828450', '828451', '870671', '870674', '870679');
+------+-------------+----------------+-------+----------------------------------------+----------------------------------------+---------+------+----------+--------------------------+
| id   | select_type | table          | type  | possible_keys                          | key                                    | key_len | ref  | rows     | Extra                    |
+------+-------------+----------------+-------+----------------------------------------+----------------------------------------+---------+------+----------+--------------------------+
|    1 | SIMPLE      | wbt_item_terms | range | wbt_item_terms_term_in_lang_id_item_id | wbt_item_terms_term_in_lang_id_item_id | 4       | NULL | 30481196 | Using where; Using index |
+------+-------------+----------------+-------+----------------------------------------+----------------------------------------+---------+------+----------+--------------------------+
1 row in set (0.16 sec)

😊🔫

It looks like SELECT DISTINCT … WHERE … IN (…, …, …) might not be the best query for “which of these IDs are still used?” when the IDs may be used millions of times. The “range” join type seems especially worrying, but even when it’s not used:

wikiadmin@10.64.0.96(wikidatawiki)> EXPLAIN SELECT /* Wikibase\Lib\Store\Sql\Terms\DatabaseItemTermStore::cleanTermsIfUnused */ DISTINCT wbit_term_in_lang_id AS value FROM wbt_item_terms WHERE wbit_term_in_lang_id IN ('828227');
+------+-------------+----------------+------+----------------------------------------+----------------------------------------+---------+-------+---------+-------------------------------------------+
| id   | select_type | table          | type | possible_keys                          | key                                    | key_len | ref   | rows    | Extra                                     |
+------+-------------+----------------+------+----------------------------------------+----------------------------------------+---------+-------+---------+-------------------------------------------+
|    1 | SIMPLE      | wbt_item_terms | ref  | wbt_item_terms_term_in_lang_id_item_id | wbt_item_terms_term_in_lang_id_item_id | 4       | const | 1330386 | Using where; Using index; Using temporary |
+------+-------------+----------------+------+----------------------------------------+----------------------------------------+---------+-------+---------+-------------------------------------------+
1 row in set (0.00 sec)

Is it actually reading 1330386 rows, to get the same constant value from all of them and then run it through DISTINCT? I sure hope not…

Since we’re really only interested in existence of any row with this (non-PK) ID, I wondered if this can instead be written as some kind of SELECT value, EXISTS (SELECT wbit_term_in_lang_id …). This is possible when turning the list of potential IDs into a temporary table (based on this SO answer):

SELECT /* Wikibase\Lib\Store\Sql\Terms\DatabaseItemTermStore::cleanTermsIfUnused */ `value`, EXISTS (SELECT * FROM `wbt_item_terms` WHERE `wbit_term_in_lang_id` = `value`) FROM (SELECT '828227' AS `value` UNION SELECT '828232' AS `value` UNION SELECT '828257' AS `value` UNION SELECT '828265' AS `value` UNION SELECT '828269' AS `value` UNION SELECT '828272' AS `value` UNION SELECT '828276' AS `value` UNION SELECT '828287' AS `value` UNION SELECT '828290' AS `value` UNION SELECT '828293' AS `value` UNION SELECT '828297' AS `value` UNION SELECT '828302' AS `value` UNION SELECT '828314' AS `value` UNION SELECT '828315' AS `value` UNION SELECT '828319' AS `value` UNION SELECT '828333' AS `value` UNION SELECT '828336' AS `value` UNION SELECT '828337' AS `value` UNION SELECT '828342' AS `value` UNION SELECT '828350' AS `value` UNION SELECT '828358' AS `value` UNION SELECT '828362' AS `value` UNION SELECT '828363' AS `value` UNION SELECT '828368' AS `value` UNION SELECT '828374' AS `value` UNION SELECT '828377' AS `value` UNION SELECT '828380' AS `value` UNION SELECT '828382' AS `value` UNION SELECT '828386' AS `value` UNION SELECT '828387' AS `value` UNION SELECT '828388' AS `value` UNION SELECT '828390' AS `value` UNION SELECT '828401' AS `value` UNION SELECT '828404' AS `value` UNION SELECT '828405' AS `value` UNION SELECT '828414' AS `value` UNION SELECT '828428' AS `value` UNION SELECT '828433' AS `value` UNION SELECT '828435' AS `value` UNION SELECT '828438' AS `value` UNION SELECT '828448' AS `value` UNION SELECT '828449' AS `value` UNION SELECT '828450' AS `value` UNION SELECT '828451' AS `value` UNION SELECT '870671' AS `value` UNION SELECT '870674' AS `value` UNION SELECT '870679' AS `value`) AS `values`;

It’s ugly, but the EXPLAIN shows a “ref” join type instead of “range”, doesn’t report millions of rows to be examined, and when you actually run the query it finishes in 0.00 sec instead of 14.36 sec. (Though the explain actually only reports 2 rows to be examined, which is definitely not enough, so I won’t pretend to completely understand this.) Is there a less silly way to arrive at a similar query plan that doesn’t scan millions of rows?

going back and forth and trying things in mysql in productions says if we stop using distinct and stop batching (using limit 1 instead) would make it waaaay faster. I make a patch for that.

Change 547535 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Revert "Revert "Stop locking and use DISTINCT when finding used terms to delete""

https://gerrit.wikimedia.org/r/547535

Change 548279 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.4] Retrying "Stop locking and use DISTINCT when finding used terms to delete"

https://gerrit.wikimedia.org/r/548279

Change 547535 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Retrying "Stop locking and use DISTINCT when finding used terms to delete"

https://gerrit.wikimedia.org/r/547535

Change 548279 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.4] Retrying "Stop locking and use DISTINCT when finding used terms to delete"

https://gerrit.wikimedia.org/r/548279

Unless there is a last time change, this issue is for me fixed. Obviously, please do any data checks/refill that you think is necessary (if any) to fix potential problems caused by this. Thanks for working on this.

Obviously, please do any data checks/refill that you think is necessary (if any) to fix potential problems caused by this.

Tracked in T234329: Investigate, fix and reimport failing item terms into new store

Change 547262 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add IGNORE on insert in ReplicaMasterAwareRecordIdsAcquirer

https://gerrit.wikimedia.org/r/547262

I'm seeing a handful of these in 1.35.0-wmf.8:

PHP Warning: [data-update-failed]: A data update callback triggered an exception (Fail-safe exception. Number of ids to be restored is not equal to the number of records that are about to be inserted into master. This should never happen, except for an edge-case that was not detected during development or due to a race-condition that is not covered by this implementation.) [Called from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate in /srv/mediawiki/php-1.35.0-wmf.8/extensions/Wikibase/repo/includes/Content/DataUpdateAdapter.php at line 62]

Stack trace:

#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.35.0-wmf.8/includes/debug/MWDebug.php(333): trigger_error(string, integer)
#2 /srv/mediawiki/php-1.35.0-wmf.8/includes/debug/MWDebug.php(188): MWDebug::sendMessage(string, array, string, integer)
#3 /srv/mediawiki/php-1.35.0-wmf.8/includes/GlobalFunctions.php(1079): MWDebug::warning(string, integer, integer, string)
#4 /srv/mediawiki/php-1.35.0-wmf.8/extensions/Wikibase/lib/includes/Reporting/LogWarningExceptionHandler.php(28): wfLogWarning(string, integer)
#5 /srv/mediawiki/php-1.35.0-wmf.8/extensions/Wikibase/repo/includes/Content/DataUpdateAdapter.php(62): Wikibase\Lib\Reporting\LogWarningExceptionHandler->handleException(Exception, string, string)
#6 /srv/mediawiki/php-1.35.0-wmf.8/includes/deferred/DeferredUpdates.php(416): Wikibase\Repo\Content\DataUpdateAdapter->doUpdate()
#7 /srv/mediawiki/php-1.35.0-wmf.8/includes/Storage/DerivedPageDataUpdater.php(1630): DeferredUpdates::attemptUpdate(Wikibase\Repo\Content\DataUpdateAdapter, Wikimedia\Rdbms\LBFactoryMulti)
#8 /srv/mediawiki/php-1.35.0-wmf.8/includes/Storage/DerivedPageDataUpdater.php(1446): MediaWiki\Storage\DerivedPageDataUpdater->doSecondaryDataUpdates(array)
#9 /srv/mediawiki/php-1.35.0-wmf.8/includes/deferred/MWCallableUpdate.php(38): MediaWiki\Storage\DerivedPageDataUpdater->MediaWiki\Storage\{closure}()
#10 /srv/mediawiki/php-1.35.0-wmf.8/includes/deferred/DeferredUpdates.php(416): MWCallableUpdate->doUpdate()
#11 /srv/mediawiki/php-1.35.0-wmf.8/includes/deferred/DeferredUpdates.php(296): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#12 /srv/mediawiki/php-1.35.0-wmf.8/includes/deferred/DeferredUpdates.php(233): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#13 /srv/mediawiki/php-1.35.0-wmf.8/includes/deferred/DeferredUpdates.php(150): DeferredUpdates::handleUpdateQueue(array, string, integer)
#14 /srv/mediawiki/php-1.35.0-wmf.8/includes/MediaWiki.php(1053): DeferredUpdates::doUpdates(string)
#15 /srv/mediawiki/php-1.35.0-wmf.8/includes/MediaWiki.php(833): MediaWiki->restInPeace()
#16 /srv/mediawiki/php-1.35.0-wmf.8/includes/MediaWiki.php(842): MediaWiki->{closure}()
#17 /srv/mediawiki/php-1.35.0-wmf.8/api.php(118): MediaWiki->doPostOutputShutdown()
#18 /srv/mediawiki/w/api.php(3): require(string)
#19 {main}

(I think this is the right issue, but I may have gotten myself a bit lost in a maze of exception messages here.)

This is the right issue but given the size of migration still a little bit of deadlock is unavoidable. How big the number is?

What @brennen reported isn’t a deadlock, isn’t it? It seems to be some race condition that we thought Should Never Happen™.

What @brennen reported isn’t a deadlock, isn’t it? It seems to be some race condition that we thought Should Never Happen™.

yes, you're right. Sorry I didn't check the stack trace fully, that's tracked in T233414: Merge triggered warning about unequal numbers of ids to be restored and records to be inserted into master