Page MenuHomePhabricator

DBError on Wikibase term store
Closed, ResolvedPublicPRODUCTION ERROR

Description

Now, the top DB errors on the production is Wikibase term store: https://logstash.wikimedia.org/goto/fc2df3f12166030c02ec92bba71d7924

It's full of things like:

Duplicate entry '\xE8\xAA\x9E\xE8\xA8\x80\xE7\x9A\x84\xE9\x9F\xB3\xE9\x9F\xBB\xE5' for key 'wbt_text_text' (10.64.48.26)	INSERT  INTO `wbt_text` (wbx_text) VALUES ('語言的音韻包括這種聲音')
Duplicate entry 'lingua di scrittura' for key 'wbt_text_text' (10.64.48.26)
Duplicate entry '178959-be-tarask' for key 'wbt_text_in_lang_text_id_text_id' (10.64.48.26)
Duplicate entry '246301-ml' for key 'wbt_text_in_lang_text_id_text_id' (10.64.48.26)

(it happened 20 times in the last four hours)

How to reproduce:

  • update some terms of a sandbox property (e.g. https://www.wikidata.org/wiki/Property:P370) , in different languages at once (two languages, one term each should be enough), making sure that the new values are uniquely new in the database.

What to expect as a fix:

Event Timeline

This is blocker for moving on to items.

Should these not all be doing INSERT IGNOREs?
(perhaps I should go and read the code before saying that though)...

So, it looks like this is probably a race condition within ReplicaMasterAwareRecordIdsAcquirer.

It looks like generally the slave will be read ( i didnt look into isIgnoringReplica much but it looks like it is only used in restoreCleanedUpIds).
If the records are not in the slave insertNonExistingRecordsIntoMaster will be called which does a plane insert not accounting or the possible race condition.
And INSERT IGNORE here would probably solve this particular issue.

The alternative would be check the slave, then check the master and then try the insert. But that is probably not the preferred solution for the scale we are talking about.

It can be lag issue between replica and master. I'm also suspecting another issue that we faced once. Reading a record from replica once didn't match the existing record due to the silent truncation mariadb was doing, resulting in trying to reinsert the same record again into master. At the time, I though that this could theoretically happen if reading doesn't match again what already exist, perhaps for encoding-related issue or smth of that sort.

So, it looks like this is probably a race condition within ReplicaMasterAwareRecordIdsAcquirer.

It looks like generally the slave will be read ( i didnt look into isIgnoringReplica much but it looks like it is only used in restoreCleanedUpIds).
If the records are not in the slave insertNonExistingRecordsIntoMaster will be called which does a plane insert not accounting or the possible race condition.
And INSERT IGNORE here would probably solve this particular issue.

We avoided INSERT IGNORE here as it does increment the auto_increment field even when it doesn't insert (so if it inserts 4, ignoring 2 as they exist, next auto increment will be increased by 4 and not 2). That is also not very favorable side-effect for scalability.

What ReplicaMasterAwareRecordIdAcquirer does is that in this case, it then reads from master, and potentially tries to insert again if it still didn't get all the records it is searching for.

A solution that saves us from log spam and keeps this scalable solution is probably to catch and not log those Duplicate entiry errors to the log somehow. Is that possible?

So the error appears because Wikimedia\Rdbms\Database::getQueryExceptionAndLog() logs those errors internally. There is a away to silence those errors, by passing a flag IDatabase::QUERY_SILENCE_ERRORS to IDatabase::query(). Issue is, IDatabase::insert does not seem to accept such flags, and it calls IDatabase::query without passing any flags (defaulting to logging the error here).

This means if we want to solve it by silencing those errors (as they are expected due to lag, by design) we will have to patch core to allow passing flags to insert()

I wonder if waiting for replication might be another option here to avoid duplicate inserts caused by lag altogether?

So checking one the properties that cause one of these logs https://www.wikidata.org/wiki/Property:P4189 .. it has the value of 西藏和喜馬拉雅圖書館藏文轉寫 twice, in Chinese and Traditional Chinese. The lag is hit probably due to the fact that the current frontend will send separate requests per language. Effectively in this case sending a request for Chinese containing this value, that inserts it into master and another for Traditional Chinese that tries to insert again into master as replica hasn't caught up from the insert from previous request.

Waiting on replication here won't solve the problem then, as it would wait on replication for current request changes as far as I understand the documentation.

Our best option here seems to be silencing those errors. That should be fine as they are not persistent. Once replica catches up, they would not happen again unless similar situation for a new value (used twice across languages) appear in the same edit (causing such several requests with similar values).

Change 519287 had a related patch set uploaded (by Alaa Sarhan; owner: Alaa Sarhan):
[mediawiki/core@master] Allow silencing errors on instance level in Database.

https://gerrit.wikimedia.org/r/519287

Change 519291 had a related patch set uploaded (by Alaa Sarhan; owner: Alaa Sarhan):
[mediawiki/extensions/Wikibase@master] Silence ignorable errors when inserting into master.

https://gerrit.wikimedia.org/r/519291

I'm not super sure silencing errors is a good idea here. The number of edits on properties are really small (once in several minutes) and we are getting errors at this ratio. That doesn't seem right. Let me dig deep in this tomorrow.

Change 519389 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Avoid inserting duplicates in one batch in term store

https://gerrit.wikimedia.org/r/519389

I wonder if waiting for replication might be another option here to avoid duplicate inserts caused by lag altogether?

We can definitely consider this, but only with a) a short timeout and b) not during web-requests (never let the user wait, but I think this isn't a problem, as we already do all of this deferred?).

Change 519291 abandoned by Alaa Sarhan:
Silence ignorable errors when inserting into master.

Reason:
the solution requiring this is abandoned

https://gerrit.wikimedia.org/r/519291

Change 519287 abandoned by Alaa Sarhan:
Allow silencing errors on instance level in Database.

Reason:
the solution requiring this is abandoned

https://gerrit.wikimedia.org/r/519287

I wonder if waiting for replication might be another option here to avoid duplicate inserts caused by lag altogether?

We can definitely consider this, but only with a) a short timeout and b) not during web-requests (never let the user wait, but I think this isn't a problem, as we already do all of this deferred?).

That sounds promising. We can try to wait for replication with a time out of 2 seconds. Yes we are doing this whole thing as a deferred update, so the user wouldn't wait.

We have alreayd tested (@Ladsgroup and @alaa_wmde) few cases to confirm that it is happening due to race condition between replication and the multiple requests sent for different languages by old term box.

Change 519389 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Avoid inserting duplicates in one batch in term store

https://gerrit.wikimedia.org/r/519389

Change 519542 had a related patch set uploaded (by Alaa Sarhan; owner: Alaa Sarhan):
[mediawiki/extensions/Wikibase@master] Wait for replication in ReplicaMasterAwareRecordIdsAcquirer

https://gerrit.wikimedia.org/r/519542

ReplicaMasterAwareRecordIdsAcquirer is requiring things that users of it look odd to have and require .. it must probably be refactored into a service instead of a utility class. Follow up task T226773

Ready for review: restored loop to not fail entirely for duplicate entry errors that could still happen even after waiting for replica

Change 519542 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Wait for replication in ReplicaMasterAwareRecordIdsAcquirer

https://gerrit.wikimedia.org/r/519542

I still get it in beta cluster :(((

Wikimedia\Rdbms\Database::insert	172.16.5.170:3306	1062	Duplicate entry 'adsfdgfghjlhgfdgdsfdgfhjgkhlj' for key 'wbt_text_text' (172.16.5.170:3306)	INSERT  INTO `wbt_text` (wbx_text) VALUES ('adsfdgfghjlhgfdgdsfdgfhjgkhlj'),('sdrftgyuhyitr354657687654rwsfdfghjlio876resdxcbnvhkitrurycnv')

 	#0 /srv/mediawiki/php-master/includes/libs/rdbms/database/Database.php(1534): Wikimedia\Rdbms\Database->getQueryExceptionAndLog()
#1 /srv/mediawiki/php-master/includes/libs/rdbms/database/Database.php(1130): Wikimedia\Rdbms\Database->reportQueryError()
#2 /srv/mediawiki/php-master/includes/libs/rdbms/database/Database.php(2095): Wikimedia\Rdbms\Database->query()
#3 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/Sql/Terms/Util/ReplicaMasterAwareRecordIdsAcquirer.php(270): Wikimedia\Rdbms\Database->insert()
#4 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/Sql/Terms/Util/ReplicaMasterAwareRecordIdsAcquirer.php(148): Wikibase\Lib\Store\Sql\Terms\Util\ReplicaMasterAwareRecordIdsAcquirer->insertNonExistingRecordsIntoMaster()
#5 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseTermIdsAcquirer.php(166): Wikibase\Lib\Store\Sql\Terms\Util\ReplicaMasterAwareRecordIdsAcquirer->acquireIds()
#6 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseTermIdsAcquirer.php(132): Wikibase\Lib\Store\Sql\Terms\DatabaseTermIdsAcquirer->acquireTextIds()
#7 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseTermIdsAcquirer.php(412): Wikibase\Lib\Store\Sql\Terms\DatabaseTermIdsAcquirer->mapToTextIds()
#8 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseTermIdsAcquirer.php(60): Wikibase\Lib\Store\Sql\Terms\DatabaseTermIdsAcquirer->mapTermsArrayToTermIds()
#9 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabasePropertyTermStore.php(168): Wikibase\Lib\Store\Sql\Terms\DatabaseTermIdsAcquirer->acquireTermIds()
#10 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabasePropertyTermStore.php(105): Wikibase\Lib\Store\Sql\Terms\DatabasePropertyTermStore->acquireAndInsertTerms()
#11 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/MultiPropertyTermStore.php(47): Wikibase\Lib\Store\Sql\Terms\DatabasePropertyTermStore->storeTerms()
#12 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/DelegatingEntityTermStoreWriter.php(51): Wikibase\MultiPropertyTermStore->storeTerms()
#13 /srv/mediawiki/php-master/extensions/Wikibase/lib/includes/Store/DelegatingEntityTermStoreWriter.php(34): Wikibase\Lib\Store\DelegatingEntityTermStoreWriter->storeProperty()
#14 (): Wikibase\Lib\Store\DelegatingEntityTermStoreWriter->saveTermsOfEntity()
#15 /srv/mediawiki/php-master/extensions/Wikibase/repo/includes/Content/DataUpdateAdapter.php(62): call_user_func_array()
#16 /srv/mediawiki/php-master/includes/deferred/DeferredUpdates.php(309): Wikibase\Repo\Content\DataUpdateAdapter->doUpdate()
#17 /srv/mediawiki/php-master/includes/Storage/DerivedPageDataUpdater.php(1626): DeferredUpdates::attemptUpdate()
#18 /srv/mediawiki/php-master/includes/Storage/DerivedPageDataUpdater.php(1447): MediaWiki\Storage\DerivedPageDataUpdater->doSecondaryDataUpdates()
#19 /srv/mediawiki/php-master/includes/deferred/MWCallableUpdate.php(38): Closure$MediaWiki\Storage\DerivedPageDataUpdater::doUpdates#2()
#20 /srv/mediawiki/php-master/includes/deferred/DeferredUpdates.php(304): MWCallableUpdate->doUpdate()
#21 /srv/mediawiki/php-master/includes/deferred/DeferredUpdates.php(265): DeferredUpdates::attemptUpdate()
#22 /srv/mediawiki/php-master/includes/deferred/DeferredUpdates.php(217): DeferredUpdates::handleUpdate()
#23 /srv/mediawiki/php-master/includes/deferred/DeferredUpdates.php(144): DeferredUpdates::handleUpdateQueue()
#24 /srv/mediawiki/php-master/includes/MediaWiki.php(926): DeferredUpdates::doUpdates()
#25 /srv/mediawiki/php-master/includes/MediaWiki.php(750): MediaWiki->restInPeace()
#26 (): Closure$MediaWiki::doPostOutputShutdown()
#27 {main}
`

How did you see it on Beta @Ladsgroup .. I tried with similar filters that we had on production but I get no results always

So we have one of two options:

  1. use INSERT IGNORE (I'd keep it last resort)
  2. query master first for existing records before insert. This we didn't want to do because it adds another query to master, but after thinking about it again, we are actually adding just another read query to 2 other queries (insert then read) that we will do anyway. In most cases, records that do not exist in replica, do not exist in master so we end up doing 2 reads and 1 insert on master in those most cases. The very sweet but very edge case (caused by replication) might be that we do only 1 read from master and get all records we want.

Now the question for 2 is, how much worse is it to do 2 reads and 1 insert, compared to doing 1 read and 1 insert, given that the former will avoid duplicate entry errors altogether?

Change 520858 had a related patch set uploaded (by Alaa Sarhan; owner: Alaa Sarhan):
[mediawiki/extensions/Wikibase@master] Query master before insert to lower the rate of duplicate entry errors

https://gerrit.wikimedia.org/r/520858

I put up a patch for 2nd suggestion in T226639#5307797. Let me know what you think on it

How did you see it on Beta @Ladsgroup .. I tried with similar filters that we had on production but I get no results always

It's in another instance residing in WMCS, you can get it in https://logstash-beta.wmflabs.org (tell me if you have trouble passing LDAP)

How did you see it on Beta @Ladsgroup .. I tried with similar filters that we had on production but I get no results always

It's in another instance residing in WMCS, you can get it in https://logstash-beta.wmflabs.org (tell me if you have trouble passing LDAP)

Aha I see. cool thx I also can access logstash!

Change 520858 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Query master before insert to lower the rate of duplicate entry errors

https://gerrit.wikimedia.org/r/520858

It didn't happen on beta cluster anymore. I would wait until next week to make sure it doesn't explode anything.

After the deployment to production, this doesn't happen anymore (the highlighted time is the rough time of the deployment)

image.png (192×1 px, 17 KB)

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:06 PM