Maniphest T205045

Exception from LinksUpdate: Deadlock found in database query (from Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages)
Closed, ResolvedPublic3 Estimated Story PointsPRODUCTION ERROR
Actions

Assigned To

Authored By

	Krinkle
	Sep 20 2018, 9:59 PM

Description

Error

Request ID: W6P5eArAIEsAAAYwlAIAAABI (https://commons.wikimedia.org/w/index.php?title=File:Kuindzhi_Sunflowers_Crimea_1880s.jpg&action=edit)

DBQuery error

Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages	10.192.16.22	1213
Deadlock found when trying to get lock; try restarting transaction (10.192.16.22)

INSERT IGNORE INTO `wbc_entity_usage` (eu_page_id,eu_aspect,eu_entity_id) VALUES ('23650776','L.en','Q656'),('23650776','S','Q656'),('23650776','O','Q353628'),('23650776','L.en','Q353628'),('23650776','S','Q353628'),('23650776','S','Q54919'),('23650776','S','Q423048'),('23650776','S','Q2494649'),('23650776','S','Q13219454'),('23650776','S','Q131454'),('23650776','S','Q36578'),('23650776','S','Q47757534'),('23650776','S','Q2826570'),('23650776','S','Q188915'),('23650776','S','Q1201876'),('23650776','S','Q1526131'),('23650776','L.en','Q1028181'),('23650776','L.en','Q1622272'),('23650776','L.en','Q37133'),('23650776','S','Q37133'),('23650776','L.en','Q36524'),('23650776','S','Q36524'),('23650776','L.en','Q208826'),('23650776','O','Q174728'),('23650776','O','Q218593'),('23650776','L.en','Q35059'),('23650776','L.en','Q159'),('23650776','S','Q159'),('23650776','O','Q211043'),('23650776','L.en','Q211043'),('23650776','S','Q211043'),('23650776','S','Q19938912'),('23650776','S','Q193563'),('23650776','S','Q623578'),('23650776','S','Q13550863'),('23650776','S','Q1967876'),('23650776','S','Q50339681'),('23650776','L.en','Q36774')

exception

Function: Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages
Error: 1213 Deadlock found when trying to get lock; try restarting transaction (10.192.16.22)

#2 /srv/mediawiki/php-1.32.0-wmf.22/includes/libs/rdbms/database/Database.php(2030): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Wikibase/client/includes/Usage/Sql/EntityUsageTable.php(184): Wikimedia\Rdbms\Database->insert(string, array, string, string)
#4 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Wikibase/client/includes/Usage/Sql/SqlUsageTracker.php(212): Wikibase\Client\Usage\Sql\EntityUsageTable->addUsages(integer, array)
#5 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Wikibase/client/includes/Store/UsageUpdater.php(107): Wikibase\Client\Usage\Sql\SqlUsageTracker->replaceUsedEntities(integer, array)
#6 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Wikibase/client/includes/Hooks/DataUpdateHookHandlers.php(139): Wikibase\Client\Store\UsageUpdater->replaceUsagesForPage(integer, array)
#7 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Wikibase/client/includes/Hooks/DataUpdateHookHandlers.php(63): Wikibase\Client\Hooks\DataUpdateHookHandlers->doLinksUpdateComplete(LinksUpdate)
#8 /srv/mediawiki/php-1.32.0-wmf.22/includes/Hooks.php(174): Wikibase\Client\Hooks\DataUpdateHookHandlers::onLinksUpdateComplete(LinksUpdate, integer)
#9 /srv/mediawiki/php-1.32.0-wmf.22/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#10 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/LinksUpdate.php(186): Hooks::run(string, array)
#11 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/AutoCommitUpdate.php(42): Closure$LinksUpdate::doUpdate(Wikimedia\Rdbms\DatabaseMysqli, string)
#12 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/DeferredUpdates.php(268): AutoCommitUpdate->doUpdate()
#13 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/DeferredUpdates.php(226): DeferredUpdates::runUpdate(AutoCommitUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#14 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/DeferredUpdates.php(134): DeferredUpdates::execute(array, string, integer)
#15 /srv/mediawiki/php-1.32.0-wmf.22/includes/MediaWiki.php(914): DeferredUpdates::doUpdates(string)
#16 /srv/mediawiki/php-1.32.0-wmf.22/includes/MediaWiki.php(734): MediaWiki->restInPeace(string, boolean)
#17 [internal function]: Closure$MediaWiki::doPostOutputShutdown()

Notes

Logstash has 835 records matching this error from the past 30 days (at least since 1.32.0-wmf.16).

Affects commons.wikimedia.org post-edit, ru.wikipedia.org post-edit, and various job executions.

Details

Subject	Repo	Branch	Lines +/-
Add only needed entity usages in AddUsagesForPageJob	mediawiki/extensions/Wikibase	wmf/1.34.0-wmf.16	+67 -5
Add only needed entity usages in AddUsagesForPageJob	mediawiki/extensions/Wikibase	master	+67 -5
Lower wikibase-addUsagesForPage concurrency.	mediawiki/services/change-propagation/jobqueue-deploy	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T30599 Deadlock tracking bug (tracking)
Resolved	PRODUCTION ERROR	Ladsgroup	T205045 Exception from LinksUpdate: Deadlock found in database query (from Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages)
Resolved		Ladsgroup	T213809 Make EntityUsageTable::addUsages batchSize configurable
Resolved		Ladsgroup	T215146 Decrease EntityUsageTable addUsage batch size to 300
Resolved		Michael	T225500 Decrease EntityUsageTable addUsage batch size to 100
Resolved		Ladsgroup	T226818 Diff when updating wbc_entity_usage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 20 2018, 9:59 PM

Krinkle moved this task from Untriaged to Older on the Wikimedia-production-error board.Sep 20 2018, 9:59 PM

Krinkle added a parent task: T30599: Deadlock tracking bug (tracking).Oct 5 2018, 3:09 AM

Marostegui subscribed.Nov 30 2018, 8:24 AM

Logstash has 835 records matching this error from the past 30 days (at least since 1.32.0-wmf.16).

According to my query this was 5,300 for the last 30 days now?
If my snapshot & query is correct https://logstash.wikimedia.org/goto/1638470f94494c4efb6557f6e550caae
It looks like this has been increasing.

It looks like the main increase was around 2018 1st / 2nd of November

addshore@mwlog1001:/srv/mw-log$ zgrep -c 'Deadlock.*EntityUsageTable::addUsages' archive/DBQuery.log-2018* | grep -v :0
archive/DBQuery.log-20181015.gz:42
archive/DBQuery.log-20181016.gz:22
archive/DBQuery.log-20181017.gz:30
archive/DBQuery.log-20181018.gz:28
archive/DBQuery.log-20181019.gz:26
archive/DBQuery.log-20181020.gz:35
archive/DBQuery.log-20181021.gz:33
archive/DBQuery.log-20181022.gz:32
archive/DBQuery.log-20181023.gz:30
archive/DBQuery.log-20181024.gz:25
archive/DBQuery.log-20181025.gz:33
archive/DBQuery.log-20181026.gz:32
archive/DBQuery.log-20181027.gz:33
archive/DBQuery.log-20181028.gz:34
archive/DBQuery.log-20181029.gz:31
archive/DBQuery.log-20181030.gz:38
archive/DBQuery.log-20181031.gz:18
archive/DBQuery.log-20181101.gz:64
archive/DBQuery.log-20181102.gz:141
archive/DBQuery.log-20181103.gz:243
archive/DBQuery.log-20181104.gz:381
archive/DBQuery.log-20181105.gz:112
archive/DBQuery.log-20181106.gz:134
archive/DBQuery.log-20181107.gz:183
archive/DBQuery.log-20181108.gz:188
archive/DBQuery.log-20181110.gz:346
archive/DBQuery.log-20181111.gz:163
archive/DBQuery.log-20181112.gz:250
archive/DBQuery.log-20181113.gz:181
archive/DBQuery.log-20181114.gz:205
archive/DBQuery.log-20181115.gz:161
archive/DBQuery.log-20181116.gz:185
archive/DBQuery.log-20181117.gz:206
archive/DBQuery.log-20181118.gz:172
archive/DBQuery.log-20181119.gz:152
archive/DBQuery.log-20181120.gz:393
archive/DBQuery.log-20181121.gz:134
archive/DBQuery.log-20181122.gz:167
archive/DBQuery.log-20181123.gz:143
archive/DBQuery.log-20181124.gz:153
archive/DBQuery.log-20181125.gz:211
archive/DBQuery.log-20181126.gz:786
archive/DBQuery.log-20181127.gz:83
archive/DBQuery.log-20181128.gz:65
archive/DBQuery.log-20181129.gz:132
archive/DBQuery.log-20181130.gz:157
archive/DBQuery.log-20181201.gz:164
archive/DBQuery.log-20181202.gz:149
archive/DBQuery.log-20181203.gz:117
archive/DBQuery.log-20181204.gz:108
archive/DBQuery.log-20181205.gz:114
archive/DBQuery.log-20181206.gz:113
archive/DBQuery.log-20181207.gz:127
archive/DBQuery.log-20181208.gz:482
archive/DBQuery.log-20181209.gz:568
archive/DBQuery.log-20181210.gz:170
archive/DBQuery.log-20181211.gz:121
archive/DBQuery.log-20181212.gz:207
archive/DBQuery.log-20181213.gz:152
archive/DBQuery.log-20181214.gz:280
archive/DBQuery.log-20181215.gz:306
archive/DBQuery.log-20181216.gz:266
archive/DBQuery.log-20181217.gz:203
archive/DBQuery.log-20181218.gz:176
archive/DBQuery.log-20181219.gz:160
archive/DBQuery.log-20181220.gz:120
archive/DBQuery.log-20181221.gz:111
archive/DBQuery.log-20181222.gz:117
archive/DBQuery.log-20181223.gz:249
archive/DBQuery.log-20181224.gz:224
archive/DBQuery.log-20181225.gz:116
archive/DBQuery.log-20181226.gz:141
archive/DBQuery.log-20181227.gz:122
archive/DBQuery.log-20181228.gz:187
archive/DBQuery.log-20181229.gz:320
archive/DBQuery.log-20181230.gz:213
archive/DBQuery.log-20181231.gz:124

The increase could be a further regression between 33.1 and 33.2

addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181030.gz |grep addUsages | grep wmf.1 | wc -l
38
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181030.gz |grep addUsages | grep wmf.2 | wc -l
0
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181101.gz |grep addUsages | grep wmf.1 | wc -l
32
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181101.gz |grep addUsages | grep wmf.2 | wc -l
32
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181102.gz |grep addUsages | grep wmf.1 | wc -l
32
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181102.gz |grep addUsages | grep wmf.2 | wc -l
109
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181103.gz |grep addUsages | grep wmf.1 | wc -l
18
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181103.gz |grep addUsages | grep wmf.2 | wc -l
225
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181104.gz |grep addUsages | grep wmf.1 | wc -l
41
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181104.gz |grep addUsages | grep wmf.2 | wc -l
340
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181105.gz |grep addUsages | grep wmf.1 | wc -l
22
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181105.gz |grep addUsages | grep wmf.2 | wc -l
90
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181106.gz |grep addUsages | grep wmf.1 | wc -l
12
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181106.gz |grep addUsages | grep wmf.2 | wc -l
122
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181107.gz |grep addUsages | grep wmf.1 | wc -l
0
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181107.gz |grep addUsages | grep wmf.2 | wc -l
183

Addshore moved this task from incoming to needs discussion or investigation on the Wikidata board.Jan 14 2019, 6:00 PM

Maybe EntityUsageTable::INSERT_BATCH_SIZE_FACTOR should be changed from 5 to 3/2.5/2?

Addshore triaged this task as Medium priority.Jan 14 2019, 10:36 PM

Addshore added a project: Wikidata-Campsite.

Addshore moved this task from Incoming to Needs Work on the Wikidata-Campsite board.

In T205045#4878367, @hoo wrote:

Maybe EntityUsageTable::INSERT_BATCH_SIZE_FACTOR should be changed from 5 to 3/2.5/2?

None of the queries failing with deadlocks seem to be anywhere near the 100*5=500 insert limit, but I guess it's not necessarily the inserts in the logs causing the problems, but the ones that are actually running at that time.

Addshore added a project: User-Addshore.Jan 28 2019, 4:22 PM

Addshore moved this task from Unsorted 💣 to Back Burner 🏛️ on the User-Addshore board.

Addshore closed subtask T213809: Make EntityUsageTable::addUsages batchSize configurable as Resolved.Feb 4 2019, 9:40 AM

Addshore mentioned this in T215146: Decrease EntityUsageTable addUsage batch size to 300.Feb 4 2019, 9:43 AM

Addshore moved this task from Needs Work to Blocked / Waiting / External on the Wikidata-Campsite board.Feb 4 2019, 11:15 AM

Addshore moved this task from needs discussion or investigation to in progress on the Wikidata board.Feb 7 2019, 3:53 PM

So we changed from 500 to 300, and it's pretty early but I can see that the deadlocks are still happening, perhaps at a lower frequency though?

Addshore moved this task from Blocked / Waiting / External to Needs Work on the Wikidata-Campsite board.Feb 8 2019, 9:26 AM

I think so too, in mediawiki, usually it's 100.

In T205045#4937780, @Ladsgroup wrote:

I think so too, in mediawiki, usually it's 100.

Is there a legacy reason for us to be using such a high batch size that you are aware of?

In T205045#4964764, @Addshore wrote:

In T205045#4937780, @Ladsgroup wrote:

I think so too, in mediawiki, usually it's 100.

Is there a legacy reason for us to be using such a high batch size that you are aware of?

I'm not aware of any reason whatsoever. It might be that it was lots of additions back then (before xkill) and we couldn't use any lower batch size but that's just speculation.

FYI, because in the past I have been told by people they were not aware of the impact. I am not too worried about this because I don't think it creates user impact (it impacts only the jobqueue), but this is probably at the moment the most frequent Wikibase database error, and one of the top ones on the wikimedia database infrastructure (probably the top except T221577): https://logstash.wikimedia.org/goto/a224c66bbf0cc9969cfc4e71bc3c2e7a

@jcrespo thank you! That's useful information.

@Ladsgroup @Addshore What's needed to get this ready for pickup in camp?

In T205045#5223716, @Lydia_Pintscher wrote:

@jcrespo thank you! That's useful information.

@Ladsgroup @Addshore What's needed to get this ready for pickup in camp?

It's actually just a config change but we need someone and DBAs to babysit the deployment for a couple of days to make sure the number is not too low.

and DBAs to babysit the deployment for a couple of days

I can do that if you tell me anyting specific -metrics- to check for (other than typical error count and general stability. Specially if initial deployment happens at the beginning of next week for example, near EU timezone.

In T205045#5225805, @jcrespo wrote:

and DBAs to babysit the deployment for a couple of days

I can do that if you tell me anyting specific -metrics- to check for (other than typical error count and general stability. Specially if initial deployment happens at the beginning of next week for example, near EU timezone.

It will increase number of write queries (and probably increase run time of inject rc jobs) in all wikis. Specially commons and ruwiki. Is it possible to monitor that? Also, generally it would be great to make sure nothing goes out of ordinary in all metrics.

Lydia_Pintscher moved this task from Needs Work to Ready to estimate on the Wikidata-Campsite board.Jun 1 2019, 6:21 PM

WMDE-leszek set the point value for this task to 3.Jun 4 2019, 12:49 PM

Michael subscribed.Jun 11 2019, 7:05 AM

Michael mentioned this in T225500: Decrease EntityUsageTable addUsage batch size to 100.Jun 11 2019, 11:48 AM

It has improved since the deployment (dropping it to 200):

Thanks, that looks good. I saw no significant impact impact on db infra, if it does more queries/transactions it had not noticeable issue. Did it impact lag/performance on your side?

I see a gap here last week- https://logstash.wikimedia.org/goto/b4170e5964398cfb6ca2769f59479faa issue or unrelated?

• jcrespo awarded a token.Jun 19 2019, 2:04 PM

In T205045#5268632, @jcrespo wrote:

Thanks, that looks good. I saw no significant impact impact on db infra, if it does more queries/transactions it had not noticeable issue. Did it impact lag/performance on your side?

Not anything I know so far.

I see a gap here last week- https://logstash.wikimedia.org/goto/b4170e5964398cfb6ca2769f59479faa issue or unrelated?

It seems unrelated. We haven't done anything on that day (specially it was a deployment freeze)

Apparently there are jobqueue issues, potentially impacting to this: T226109

Okay, after decreasing the batch size to 100, the errors have reduced but haven't eliminated, the number is sorta high but I don't know if DBAs think this is bearable or it should be reduced further.

In T205045#5288488, @Ladsgroup wrote:

Okay, after decreasing the batch size to 100, the errors have reduced but haven't eliminated, the number is sorta high but I don't know if DBAs think this is bearable or it should be reduced further.

Could we do something about the error itself? The error message says Deadlock found when trying to get lock; try restarting transaction so we could try to catch it, log it and then try again a second or so later? Otherwise, it seems that the data that it is trying to write to the DB is just lost?
(Not sure if that makes sense, DB is not my strong suit.)

In T205045#5288520, @Michael wrote:

Could we do something about the error itself? The error message says Deadlock found when trying to get lock; try restarting transaction so we could try to catch it, log it and then try again a second or so later? Otherwise, it seems that the data that it is trying to write to the DB is just lost?
(Not sure if that makes sense, DB is not my strong suit.)

Because it's done through a job, the error bubbles up and then the job gets retried so the issue is not a big deal here in matter of data loss.

Yes, we were worried about the frequency of it (it was the most common DB-related error for some time) because of the performance implications, not that it happens sometimes. I am guessing that this is due to multiple edits happening at the same time about the same subject/item/related item? Sadly there is limited capabilities to make the jobqueue smarter, I think. Maybe there could be strict partitioning on ranges to prevent collision? I don't know.

In T205045#5288620, @jcrespo wrote:

Yes, we were worried about the frequency of it (it was the most common DB-related error for some time) because of the performance implications, not that it happens sometimes.

I agree, but is the current state acceptable or do you think it should be less frequent?

I am guessing that this is due to multiple edits happening at the same time about the same subject/item/related item?

It's actually a site-wide issue. User X, edits article foo and make it to redo the usage tracking again so the job tries to update it while User Y edits article bar and make it to do the similar thing.

Sadly there is limited capabilities to make the jobqueue smarter, I think. Maybe there could be strict partitioning on ranges to prevent collision? I don't know.

Reiterating how the thing works actually gave me the idea that we can be a little bit smarter on what we inject to the database and avoid injecting things that hasn't changed. I need to double check the code but I think it removes all usages and injects them again.

There's even a TODO for it already: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/client/includes/Hooks/DataUpdateHookHandlers.php#L165

This needs to be cleaned up ASAP. I will take a look later today.

This is the jobqueue after a change that doesn't change the entity usage without my patch:

2019-06-27 13:33:04 ORESFetchScoreJob BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn revid=52 originalRequest={"ip":"127.0.0.1","userAgent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"} models=["damaging","goodfaith"] precache=1 namespace=0 title=BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn requestId=195f60141e8534971e7f0ce0 (id=24,timestamp=20190627133302) STARTING
2019-06-27 13:33:05 ORESFetchScoreJob BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn revid=52 originalRequest={"ip":"127.0.0.1","userAgent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"} models=["damaging","goodfaith"] precache=1 namespace=0 title=BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn requestId=195f60141e8534971e7f0ce0 (id=24,timestamp=20190627133302) t=520 good
2019-06-27 13:33:05 htmlCacheUpdate BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn table=templatelinks recursive=1 rootJobIsSelf=1 rootJobSignature=f08fcf50a2e424957ae78cccadc76ef8a2f9a62f rootJobTimestamp=20190627133302 causeAction=page-edit causeAgent=unknown namespace=0 title=BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn requestId=195f60141e8534971e7f0ce0 (id=26,timestamp=20190627133302) STARTING
2019-06-27 13:33:05 htmlCacheUpdate BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn table=templatelinks recursive=1 rootJobIsSelf=1 rootJobSignature=f08fcf50a2e424957ae78cccadc76ef8a2f9a62f rootJobTimestamp=20190627133302 causeAction=page-edit causeAgent=unknown namespace=0 title=BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn requestId=195f60141e8534971e7f0ce0 (id=26,timestamp=20190627133302) t=3 good
2019-06-27 13:33:05 wikibase-addUsagesForPage BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn pageId=17 usages={"Q153#S":{"entityId":"Q153","aspect":"S","modifier":null}} namespace=0 title=BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn requestId=195f60141e8534971e7f0ce0 (id=25,timestamp=20190627133302) STARTING
2019-06-27 13:33:05 wikibase-addUsagesForPage BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn pageId=17 usages={"Q153#S":{"entityId":"Q153","aspect":"S","modifier":null}} namespace=0 title=BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn requestId=195f60141e8534971e7f0ce0 (id=25,timestamp=20190627133302) t=3 good

This is similar after my patch:

2019-06-27 13:33:44 recentChangesUpdate Special:RecentChanges type=cacheUpdate namespace=-1 title=RecentChanges requestId=14d17601a3a4448245792e46 (id=32,timestamp=20190627133337) STARTING
2019-06-27 13:33:44 recentChangesUpdate Special:RecentChanges type=cacheUpdate namespace=-1 title=RecentChanges requestId=14d17601a3a4448245792e46 (id=32,timestamp=20190627133337) t=7 good
2019-06-27 13:33:44 htmlCacheUpdate BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn table=templatelinks recursive=1 rootJobIsSelf=1 rootJobSignature=f08fcf50a2e424957ae78cccadc76ef8a2f9a62f rootJobTimestamp=20190627133337 causeAction=page-edit causeAgent=unknown namespace=0 title=BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn requestId=14d17601a3a4448245792e46 (id=30,timestamp=20190627133337) STARTING
2019-06-27 13:33:44 htmlCacheUpdate BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn table=templatelinks recursive=1 rootJobIsSelf=1 rootJobSignature=f08fcf50a2e424957ae78cccadc76ef8a2f9a62f rootJobTimestamp=20190627133337 causeAction=page-edit causeAgent=unknown namespace=0 title=BeforeEach-name-0.01342378469370753-Iñtërnâtiônàlizætiøn requestId=14d17601a3a4448245792e46 (id=30,timestamp=20190627133337) t=1 good

I will upload the patch soon, just need to write tests

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJun 27 2019, 1:27 PM

Ladsgroup moved this task from To Do (prioritised from top to bottom) to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Jun 27 2019, 2:44 PM

I want to mention that the highest invoked jobs in the jobqueue currently is addUages job and by fixing this we drop 30% of the jobs in jobqueue

FYI to @Joe, @mobrovac, and @Pchelolo so if the job queue insertion rate drops by 30%, you should know that it's for good and you probably can repurpose some job runner nodes. (Out of excitement, I ping @Krinkle and Performance-Team) (Yay to saving polar bears)

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.13; 2019-07-09).Jun 27 2019, 6:00 PM

• alaa_wmde moved this task from Peer Review to Test (Verification) on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Jun 27 2019, 7:30 PM

Jobs inserted in the last hour:

Basically it dropped 91% of addEntityUsage jobs ^_^

OMG! 😘❤️🎉👍👌

Another graph:

In ChangeProp for job queue, we have a concept of concurrency - how many jobs of the same type can be running in parallel. wikibase-addUsagesForPage had a special rule with concurrency 30, so if this pattern holds we can probably ramp it down to 5. Having max concurrency close-enough to actually needed concurrency is good to avoid spikes in load. I will create a patch for that, add @Ladsgroup there, and wait for him to be sure this pattern will hold to +1 it.

Change 519518 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Lower wikibase-addUsagesForPage concurrency.

https://gerrit.wikimedia.org/r/519518

gerritbot added a project: Patch-For-Review.Jun 27 2019, 8:59 PM

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.11; 2019-06-26); removed MW-1.34-notes (1.34.0-wmf.13; 2019-07-09).Jun 27 2019, 9:00 PM

• jcrespo rescinded a token.Jun 28 2019, 7:58 AM

• jcrespo awarded a token.

In T205045#5288987, @Ladsgroup wrote:

I will upload the patch soon, just need to write tests

For the record, the patch is I45b699a8b8 (already backported to wmf.11) – @gerritbot didn’t pick it up because of the blank line between the “Bug:” and “Change-Id:” trailers.

I also think this improvement should’ve had a dedicated task, but it’s a bit late for that now.

Back to the original issue, the error got reduced even further and it's rare now but if it needs more work I can think of diffing between current entity usage and the new ones but it's some work. I made another task for that. T226818: Diff when updating wbc_entity_usage which hopefully will be picked up soon.

Change 519518 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Lower wikibase-addUsagesForPage concurrency.

https://gerrit.wikimedia.org/r/519518

• alaa_wmde moved this task from Test (Verification) to Done on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Jul 26 2019, 1:22 PM

• alaa_wmde closed subtask T225500: Decrease EntityUsageTable addUsage batch size to 100 as Resolved.Jul 29 2019, 12:35 PM

While trying to do T226818: Diff when updating wbc_entity_usage I found a very very weird thing. The AddUsagesForPageJob, takes all entity usages of the given page and does some stuff plus adding all of the current entity usages to wbc_entity_usage table with INSERT IGNORE. There's no diffing here. It just adds all of the current ones through $this->usageUpdater->addUsagesForPage() that calls $this->usageTracker->addUsedEntities(). Then brings the question that how (or if) the removed entity usages actually gets removed from the table. I tested it locally and in production and I realized that it really removes the removed entity usages but then how? Looking at other usage of the UsageUpdater class it seems that DataUpdateHookHandlers calls $this->usageUpdater->replaceUsagesForPage() which does a pretty impressive diffing and updates the wbc_entity_usage into the proper state and then the AddUsagesForPageJob job comes in and do a INSERT IGNORE of the exact same rows that already exist into the database. It might the reason there is deadlock it's that both the hook handler (which I think is a deferred update) and the job, try to update the same rows at the same time.

One other important thing is that the way the job adds the data to the database is pretty inefficient. For example in one article that caused this issue in Ukrainian Wikipedia. We have this entity usages:

wikiadmin@10.64.48.153(ukwiki)> select * from wbc_entity_usage where eu_page_id = 1143221;
+-----------+--------------+-----------+------------+
| eu_row_id | eu_entity_id | eu_aspect | eu_page_id |
+-----------+--------------+-----------+------------+
|  67076587 | Q1026        | C.P1813   |    1143221 |
|  67076588 | Q1026        | C.P1448   |    1143221 |
|  67076589 | Q1026        | C.P1705   |    1143221 |

(Lots of rows)

+-----------+--------------+-----------+------------+
392 rows in set (0.00 sec)

But in most cases the entity usage change by one or two. Imagine I changed only one entity usage of this page from Q1026 C.P1813 to Q1026 C.P1814. The hook handler, removes Q1026 C.P1813 and adds Q1026 C.P1814. Then job comes in and happily insert ignores the whole 392 rows again, the job could completely be dropped and wbc_entity_usage would stay the same. And these types of mistakes used to happen 300 times a second and after T205045#5290536 it went down to 30 times per second.

I don't think we should completely drop the job, it does important things to the entity usage subscription that the hook handler seems to be not doing it (the job also prunes unused entity subscription but it's not related to this at all). We should just drop that part.

CC @Lydia_Pintscher and @hoo @Addshore @alaa_wmde @Lucas_Werkmeister_WMDE: Please double check everything and tell me I'm not crazy. I checked everything twice though.

One rather easy way to test my hypotheses is that you can avoid running jobs (by not running runJobs.php) in your localhost and then changing entity usages in pages and seeing wbc_entity_usage gets updated accordingly. I did it and it worked confirming my hypotheses

Yeah, this is on purpose… this is rather complex (unfortunately). The thing is that we make sure that usage are "correct" by: Immediately purging no-longer accurate usages on page touches/ re-parses on edit (it might be that we also add new usages here, but that's not technically needed), then we later on add all usage to the table that come up when parsing the page in different ways (different language, mostly, maybe exclusively).

AddUsagesForPageJob can't remove any usages as its invocation only ever represents one parse of a page (which has maybe already valid usage for another parser run of the same revision). UsageUpdater can remove usages, as it is invoked right away on the initial parse.

Possible improvements:

Either (I prefer this):

Only remove usages outdated usages in UsageUpdater (synchronous), don't add any new ones
Only INSERT actually new usages in AddUsagesForPageJob

Or:

Update usages in UsageUpdater (synchronous, on page update), like it is done now
Only inject AddUsagesForPageJob if that hasn't happened
Only INSERT actually new usages in AddUsagesForPageJob

In T205045#5382778, @hoo wrote:

Yeah, this is on purpose… this is rather complex (unfortunately). The thing is that we make sure that usage are "correct" by: Immediately purging no-longer accurate usages on page touches/ re-parses on edit (it might be that we also add new usages here, but that's not technically needed), then we later on add all usage to the table that come up when parsing the page in different ways (different language, mostly, maybe exclusively).

Thanks for explaining, Since it's intentional, I don't replace the thing but I have struggle understanding what is the usecase. Can you give me an example of a case that the hook fails to add the new usages but the job can do?

Possible improvements:

Either (I prefer this):

Only remove usages outdated usages in UsageUpdater (synchronous), don't add any new ones

Only INSERT actually new usages in AddUsagesForPageJob

Or:

Update usages in UsageUpdater (synchronous, on page update), like it is done now

Only inject AddUsagesForPageJob if that hasn't happened

Only INSERT actually new usages in AddUsagesForPageJob

The latter is sorta already implemented. The second bullet point is already live.

Change 527131 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/527131

Change 527131 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/527131

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.17; 2019-08-06); removed MW-1.34-notes (1.34.0-wmf.11; 2019-06-26).Aug 2 2019, 4:00 PM

Change 528072 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@wmf/1.34.0-wmf.16] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/528072

Change 528072 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.34.0-wmf.16] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/528072

Mentioned in SAL (#wikimedia-operations) [2019-08-05T11:29:41Z] <urbanecm@deploy1001> Synchronized php-1.34.0-wmf.16/extensions/Wikibase: SWAT: rEWBA3ecaa57561e8: Add only needed entity usages in AddUsagesForPageJob (T226818, T205045) (duration: 01m 12s)

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.16; 2019-07-30); removed MW-1.34-notes (1.34.0-wmf.17; 2019-08-06).Aug 5 2019, 12:00 PM

In T205045#5290536, @Ladsgroup wrote:

Jobs inserted in the last hour:

Basically it dropped 91% of addEntityUsage jobs ^_^

While measuring stuff for the new changes I realized this change on June 27th actually reduced 4% of traffic incoming to MySQL nodes (or 3MB/s)

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM

• alaa_wmde closed subtask T226818: Diff when updating wbc_entity_usage as Resolved.Oct 1 2019, 2:49 PM

• alaa_wmde closed this task as Resolved.Oct 1 2019, 2:52 PM

	F29658175: image.png
	Jun 27 2019, 8:45 PM

	F29658135: image.png
	Jun 27 2019, 8:43 PM

	F29655363: image.png
	Jun 27 2019, 2:55 PM

	F29604901: image.png
	Jun 19 2019, 12:11 PM

Exception from LinksUpdate: Deadlock found in database query (from Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages)Closed, ResolvedPublic3 Estimated Story PointsPRODUCTION ERRORActions