Page MenuHomePhabricator

Exception from LinksUpdate: Deadlock found in database query (from Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages)
Closed, ResolvedPublic3 Estimated Story PointsPRODUCTION ERROR

Description

Error

Request ID: W6P5eArAIEsAAAYwlAIAAABI (https://commons.wikimedia.org/w/index.php?title=File:Kuindzhi_Sunflowers_Crimea_1880s.jpg&action=edit)

DBQuery error
Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages	10.192.16.22	1213
Deadlock found when trying to get lock; try restarting transaction (10.192.16.22)

INSERT IGNORE INTO `wbc_entity_usage` (eu_page_id,eu_aspect,eu_entity_id) VALUES ('23650776','L.en','Q656'),('23650776','S','Q656'),('23650776','O','Q353628'),('23650776','L.en','Q353628'),('23650776','S','Q353628'),('23650776','S','Q54919'),('23650776','S','Q423048'),('23650776','S','Q2494649'),('23650776','S','Q13219454'),('23650776','S','Q131454'),('23650776','S','Q36578'),('23650776','S','Q47757534'),('23650776','S','Q2826570'),('23650776','S','Q188915'),('23650776','S','Q1201876'),('23650776','S','Q1526131'),('23650776','L.en','Q1028181'),('23650776','L.en','Q1622272'),('23650776','L.en','Q37133'),('23650776','S','Q37133'),('23650776','L.en','Q36524'),('23650776','S','Q36524'),('23650776','L.en','Q208826'),('23650776','O','Q174728'),('23650776','O','Q218593'),('23650776','L.en','Q35059'),('23650776','L.en','Q159'),('23650776','S','Q159'),('23650776','O','Q211043'),('23650776','L.en','Q211043'),('23650776','S','Q211043'),('23650776','S','Q19938912'),('23650776','S','Q193563'),('23650776','S','Q623578'),('23650776','S','Q13550863'),('23650776','S','Q1967876'),('23650776','S','Q50339681'),('23650776','L.en','Q36774')
exception
Function: Wikibase\Client\Usage\Sql\EntityUsageTable::addUsages
Error: 1213 Deadlock found when trying to get lock; try restarting transaction (10.192.16.22)

#2 /srv/mediawiki/php-1.32.0-wmf.22/includes/libs/rdbms/database/Database.php(2030): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Wikibase/client/includes/Usage/Sql/EntityUsageTable.php(184): Wikimedia\Rdbms\Database->insert(string, array, string, string)
#4 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Wikibase/client/includes/Usage/Sql/SqlUsageTracker.php(212): Wikibase\Client\Usage\Sql\EntityUsageTable->addUsages(integer, array)
#5 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Wikibase/client/includes/Store/UsageUpdater.php(107): Wikibase\Client\Usage\Sql\SqlUsageTracker->replaceUsedEntities(integer, array)
#6 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Wikibase/client/includes/Hooks/DataUpdateHookHandlers.php(139): Wikibase\Client\Store\UsageUpdater->replaceUsagesForPage(integer, array)
#7 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Wikibase/client/includes/Hooks/DataUpdateHookHandlers.php(63): Wikibase\Client\Hooks\DataUpdateHookHandlers->doLinksUpdateComplete(LinksUpdate)
#8 /srv/mediawiki/php-1.32.0-wmf.22/includes/Hooks.php(174): Wikibase\Client\Hooks\DataUpdateHookHandlers::onLinksUpdateComplete(LinksUpdate, integer)
#9 /srv/mediawiki/php-1.32.0-wmf.22/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#10 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/LinksUpdate.php(186): Hooks::run(string, array)
#11 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/AutoCommitUpdate.php(42): Closure$LinksUpdate::doUpdate(Wikimedia\Rdbms\DatabaseMysqli, string)
#12 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/DeferredUpdates.php(268): AutoCommitUpdate->doUpdate()
#13 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/DeferredUpdates.php(226): DeferredUpdates::runUpdate(AutoCommitUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#14 /srv/mediawiki/php-1.32.0-wmf.22/includes/deferred/DeferredUpdates.php(134): DeferredUpdates::execute(array, string, integer)
#15 /srv/mediawiki/php-1.32.0-wmf.22/includes/MediaWiki.php(914): DeferredUpdates::doUpdates(string)
#16 /srv/mediawiki/php-1.32.0-wmf.22/includes/MediaWiki.php(734): MediaWiki->restInPeace(string, boolean)
#17 [internal function]: Closure$MediaWiki::doPostOutputShutdown()

Notes

Logstash has 835 records matching this error from the past 30 days (at least since 1.32.0-wmf.16).

Affects commons.wikimedia.org post-edit, ru.wikipedia.org post-edit, and various job executions.

See also:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptSep 20 2018, 9:59 PM

Logstash has 835 records matching this error from the past 30 days (at least since 1.32.0-wmf.16).

According to my query this was 5,300 for the last 30 days now?
If my snapshot & query is correct https://logstash.wikimedia.org/goto/1638470f94494c4efb6557f6e550caae
It looks like this has been increasing.

It looks like the main increase was around 2018 1st / 2nd of November

addshore@mwlog1001:/srv/mw-log$ zgrep -c 'Deadlock.*EntityUsageTable::addUsages' archive/DBQuery.log-2018* | grep -v :0
archive/DBQuery.log-20181015.gz:42
archive/DBQuery.log-20181016.gz:22
archive/DBQuery.log-20181017.gz:30
archive/DBQuery.log-20181018.gz:28
archive/DBQuery.log-20181019.gz:26
archive/DBQuery.log-20181020.gz:35
archive/DBQuery.log-20181021.gz:33
archive/DBQuery.log-20181022.gz:32
archive/DBQuery.log-20181023.gz:30
archive/DBQuery.log-20181024.gz:25
archive/DBQuery.log-20181025.gz:33
archive/DBQuery.log-20181026.gz:32
archive/DBQuery.log-20181027.gz:33
archive/DBQuery.log-20181028.gz:34
archive/DBQuery.log-20181029.gz:31
archive/DBQuery.log-20181030.gz:38
archive/DBQuery.log-20181031.gz:18
archive/DBQuery.log-20181101.gz:64
archive/DBQuery.log-20181102.gz:141
archive/DBQuery.log-20181103.gz:243
archive/DBQuery.log-20181104.gz:381
archive/DBQuery.log-20181105.gz:112
archive/DBQuery.log-20181106.gz:134
archive/DBQuery.log-20181107.gz:183
archive/DBQuery.log-20181108.gz:188
archive/DBQuery.log-20181110.gz:346
archive/DBQuery.log-20181111.gz:163
archive/DBQuery.log-20181112.gz:250
archive/DBQuery.log-20181113.gz:181
archive/DBQuery.log-20181114.gz:205
archive/DBQuery.log-20181115.gz:161
archive/DBQuery.log-20181116.gz:185
archive/DBQuery.log-20181117.gz:206
archive/DBQuery.log-20181118.gz:172
archive/DBQuery.log-20181119.gz:152
archive/DBQuery.log-20181120.gz:393
archive/DBQuery.log-20181121.gz:134
archive/DBQuery.log-20181122.gz:167
archive/DBQuery.log-20181123.gz:143
archive/DBQuery.log-20181124.gz:153
archive/DBQuery.log-20181125.gz:211
archive/DBQuery.log-20181126.gz:786
archive/DBQuery.log-20181127.gz:83
archive/DBQuery.log-20181128.gz:65
archive/DBQuery.log-20181129.gz:132
archive/DBQuery.log-20181130.gz:157
archive/DBQuery.log-20181201.gz:164
archive/DBQuery.log-20181202.gz:149
archive/DBQuery.log-20181203.gz:117
archive/DBQuery.log-20181204.gz:108
archive/DBQuery.log-20181205.gz:114
archive/DBQuery.log-20181206.gz:113
archive/DBQuery.log-20181207.gz:127
archive/DBQuery.log-20181208.gz:482
archive/DBQuery.log-20181209.gz:568
archive/DBQuery.log-20181210.gz:170
archive/DBQuery.log-20181211.gz:121
archive/DBQuery.log-20181212.gz:207
archive/DBQuery.log-20181213.gz:152
archive/DBQuery.log-20181214.gz:280
archive/DBQuery.log-20181215.gz:306
archive/DBQuery.log-20181216.gz:266
archive/DBQuery.log-20181217.gz:203
archive/DBQuery.log-20181218.gz:176
archive/DBQuery.log-20181219.gz:160
archive/DBQuery.log-20181220.gz:120
archive/DBQuery.log-20181221.gz:111
archive/DBQuery.log-20181222.gz:117
archive/DBQuery.log-20181223.gz:249
archive/DBQuery.log-20181224.gz:224
archive/DBQuery.log-20181225.gz:116
archive/DBQuery.log-20181226.gz:141
archive/DBQuery.log-20181227.gz:122
archive/DBQuery.log-20181228.gz:187
archive/DBQuery.log-20181229.gz:320
archive/DBQuery.log-20181230.gz:213
archive/DBQuery.log-20181231.gz:124

The increase could be a further regression between 33.1 and 33.2

addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181030.gz |grep addUsages | grep wmf.1 | wc -l
38
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181030.gz |grep addUsages | grep wmf.2 | wc -l
0
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181101.gz |grep addUsages | grep wmf.1 | wc -l
32
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181101.gz |grep addUsages | grep wmf.2 | wc -l
32
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181102.gz |grep addUsages | grep wmf.1 | wc -l
32
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181102.gz |grep addUsages | grep wmf.2 | wc -l
109
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181103.gz |grep addUsages | grep wmf.1 | wc -l
18
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181103.gz |grep addUsages | grep wmf.2 | wc -l
225
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181104.gz |grep addUsages | grep wmf.1 | wc -l
41
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181104.gz |grep addUsages | grep wmf.2 | wc -l
340
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181105.gz |grep addUsages | grep wmf.1 | wc -l
22
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181105.gz |grep addUsages | grep wmf.2 | wc -l
90
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181106.gz |grep addUsages | grep wmf.1 | wc -l
12
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181106.gz |grep addUsages | grep wmf.2 | wc -l
122
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181107.gz |grep addUsages | grep wmf.1 | wc -l
0
addshore@mwlog1001:/srv/mw-log$ zcat archive/DBQuery.log-20181107.gz |grep addUsages | grep wmf.2 | wc -l
183

Maybe EntityUsageTable::INSERT_BATCH_SIZE_FACTOR should be changed from 5 to 3/2.5/2?

Addshore triaged this task as Medium priority.Jan 14 2019, 10:36 PM
Addshore added a project: Wikidata-Campsite.
Addshore moved this task from Incoming to Needs Work on the Wikidata-Campsite board.

Maybe EntityUsageTable::INSERT_BATCH_SIZE_FACTOR should be changed from 5 to 3/2.5/2?

None of the queries failing with deadlocks seem to be anywhere near the 100*5=500 insert limit, but I guess it's not necessarily the inserts in the logs causing the problems, but the ones that are actually running at that time.

So we changed from 500 to 300, and it's pretty early but I can see that the deadlocks are still happening, perhaps at a lower frequency though?

I think so too, in mediawiki, usually it's 100.

I think so too, in mediawiki, usually it's 100.

Is there a legacy reason for us to be using such a high batch size that you are aware of?

I think so too, in mediawiki, usually it's 100.

Is there a legacy reason for us to be using such a high batch size that you are aware of?

I'm not aware of any reason whatsoever. It might be that it was lots of additions back then (before xkill) and we couldn't use any lower batch size but that's just speculation.

FYI, because in the past I have been told by people they were not aware of the impact. I am not too worried about this because I don't think it creates user impact (it impacts only the jobqueue), but this is probably at the moment the most frequent Wikibase database error, and one of the top ones on the wikimedia database infrastructure (probably the top except T221577): https://logstash.wikimedia.org/goto/a224c66bbf0cc9969cfc4e71bc3c2e7a

@jcrespo thank you! That's useful information.

@Ladsgroup @Addshore What's needed to get this ready for pickup in camp?

@jcrespo thank you! That's useful information.

@Ladsgroup @Addshore What's needed to get this ready for pickup in camp?

It's actually just a config change but we need someone and DBAs to babysit the deployment for a couple of days to make sure the number is not too low.

and DBAs to babysit the deployment for a couple of days

I can do that if you tell me anyting specific -metrics- to check for (other than typical error count and general stability. Specially if initial deployment happens at the beginning of next week for example, near EU timezone.

and DBAs to babysit the deployment for a couple of days

I can do that if you tell me anyting specific -metrics- to check for (other than typical error count and general stability. Specially if initial deployment happens at the beginning of next week for example, near EU timezone.

It will increase number of write queries (and probably increase run time of inject rc jobs) in all wikis. Specially commons and ruwiki. Is it possible to monitor that? Also, generally it would be great to make sure nothing goes out of ordinary in all metrics.

WMDE-leszek set the point value for this task to 3.Jun 4 2019, 12:49 PM

It has improved since the deployment (dropping it to 200):

image.png (297ร—1 px, 42 KB)

Thanks, that looks good. I saw no significant impact impact on db infra, if it does more queries/transactions it had not noticeable issue. Did it impact lag/performance on your side?

I see a gap here last week- https://logstash.wikimedia.org/goto/b4170e5964398cfb6ca2769f59479faa issue or unrelated?

Thanks, that looks good. I saw no significant impact impact on db infra, if it does more queries/transactions it had not noticeable issue. Did it impact lag/performance on your side?

Not anything I know so far.

I see a gap here last week- https://logstash.wikimedia.org/goto/b4170e5964398cfb6ca2769f59479faa issue or unrelated?

It seems unrelated. We haven't done anything on that day (specially it was a deployment freeze)

Apparently there are jobqueue issues, potentially impacting to this: T226109

Okay, after decreasing the batch size to 100, the errors have reduced but haven't eliminated, the number is sorta high but I don't know if DBAs think this is bearable or it should be reduced further.

Okay, after decreasing the batch size to 100, the errors have reduced but haven't eliminated, the number is sorta high but I don't know if DBAs think this is bearable or it should be reduced further.

Could we do something about the error itself? The error message says Deadlock found when trying to get lock; try restarting transaction so we could try to catch it, log it and then try again a second or so later? Otherwise, it seems that the data that it is trying to write to the DB is just lost?
(Not sure if that makes sense, DB is not my strong suit.)

Could we do something about the error itself? The error message says Deadlock found when trying to get lock; try restarting transaction so we could try to catch it, log it and then try again a second or so later? Otherwise, it seems that the data that it is trying to write to the DB is just lost?
(Not sure if that makes sense, DB is not my strong suit.)

Because it's done through a job, the error bubbles up and then the job gets retried so the issue is not a big deal here in matter of data loss.

Yes, we were worried about the frequency of it (it was the most common DB-related error for some time) because of the performance implications, not that it happens sometimes. I am guessing that this is due to multiple edits happening at the same time about the same subject/item/related item? Sadly there is limited capabilities to make the jobqueue smarter, I think. Maybe there could be strict partitioning on ranges to prevent collision? I don't know.

Yes, we were worried about the frequency of it (it was the most common DB-related error for some time) because of the performance implications, not that it happens sometimes.

I agree, but is the current state acceptable or do you think it should be less frequent?

I am guessing that this is due to multiple edits happening at the same time about the same subject/item/related item?

It's actually a site-wide issue. User X, edits article foo and make it to redo the usage tracking again so the job tries to update it while User Y edits article bar and make it to do the similar thing.

Sadly there is limited capabilities to make the jobqueue smarter, I think. Maybe there could be strict partitioning on ranges to prevent collision? I don't know.

Reiterating how the thing works actually gave me the idea that we can be a little bit smarter on what we inject to the database and avoid injecting things that hasn't changed. I need to double check the code but I think it removes all usages and injects them again.

This is the jobqueue after a change that doesn't change the entity usage without my patch:

2019-06-27 13:33:04 ORESFetchScoreJob BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn revid=52 originalRequest={"ip":"127.0.0.1","userAgent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"} models=["damaging","goodfaith"] precache=1 namespace=0 title=BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn requestId=195f60141e8534971e7f0ce0 (id=24,timestamp=20190627133302) STARTING
2019-06-27 13:33:05 ORESFetchScoreJob BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn revid=52 originalRequest={"ip":"127.0.0.1","userAgent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"} models=["damaging","goodfaith"] precache=1 namespace=0 title=BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn requestId=195f60141e8534971e7f0ce0 (id=24,timestamp=20190627133302) t=520 good
2019-06-27 13:33:05 htmlCacheUpdate BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn table=templatelinks recursive=1 rootJobIsSelf=1 rootJobSignature=f08fcf50a2e424957ae78cccadc76ef8a2f9a62f rootJobTimestamp=20190627133302 causeAction=page-edit causeAgent=unknown namespace=0 title=BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn requestId=195f60141e8534971e7f0ce0 (id=26,timestamp=20190627133302) STARTING
2019-06-27 13:33:05 htmlCacheUpdate BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn table=templatelinks recursive=1 rootJobIsSelf=1 rootJobSignature=f08fcf50a2e424957ae78cccadc76ef8a2f9a62f rootJobTimestamp=20190627133302 causeAction=page-edit causeAgent=unknown namespace=0 title=BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn requestId=195f60141e8534971e7f0ce0 (id=26,timestamp=20190627133302) t=3 good
2019-06-27 13:33:05 wikibase-addUsagesForPage BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn pageId=17 usages={"Q153#S":{"entityId":"Q153","aspect":"S","modifier":null}} namespace=0 title=BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn requestId=195f60141e8534971e7f0ce0 (id=25,timestamp=20190627133302) STARTING
2019-06-27 13:33:05 wikibase-addUsagesForPage BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn pageId=17 usages={"Q153#S":{"entityId":"Q153","aspect":"S","modifier":null}} namespace=0 title=BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn requestId=195f60141e8534971e7f0ce0 (id=25,timestamp=20190627133302) t=3 good

This is similar after my patch:

2019-06-27 13:33:44 recentChangesUpdate Special:RecentChanges type=cacheUpdate namespace=-1 title=RecentChanges requestId=14d17601a3a4448245792e46 (id=32,timestamp=20190627133337) STARTING
2019-06-27 13:33:44 recentChangesUpdate Special:RecentChanges type=cacheUpdate namespace=-1 title=RecentChanges requestId=14d17601a3a4448245792e46 (id=32,timestamp=20190627133337) t=7 good
2019-06-27 13:33:44 htmlCacheUpdate BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn table=templatelinks recursive=1 rootJobIsSelf=1 rootJobSignature=f08fcf50a2e424957ae78cccadc76ef8a2f9a62f rootJobTimestamp=20190627133337 causeAction=page-edit causeAgent=unknown namespace=0 title=BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn requestId=14d17601a3a4448245792e46 (id=30,timestamp=20190627133337) STARTING
2019-06-27 13:33:44 htmlCacheUpdate BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn table=templatelinks recursive=1 rootJobIsSelf=1 rootJobSignature=f08fcf50a2e424957ae78cccadc76ef8a2f9a62f rootJobTimestamp=20190627133337 causeAction=page-edit causeAgent=unknown namespace=0 title=BeforeEach-name-0.01342378469370753-Iรฑtรซrnรขtiรดnร lizรฆtiรธn requestId=14d17601a3a4448245792e46 (id=30,timestamp=20190627133337) t=1 good

I will upload the patch soon, just need to write tests

I want to mention that the highest invoked jobs in the jobqueue currently is addUages job and by fixing this we drop 30% of the jobs in jobqueue

image.png (665ร—1 px, 254 KB)

FYI to @Joe, @mobrovac, and @Pchelolo so if the job queue insertion rate drops by 30%, you should know that it's for good and you probably can repurpose some job runner nodes. (Out of excitement, I ping @Krinkle and Performance-Team) (Yay to saving polar bears)

Jobs inserted in the last hour:

image.png (572ร—1 px, 54 KB)

Basically it dropped 91% of addEntityUsage jobs ^_^

OMG! ๐Ÿ˜˜โค๏ธ๐ŸŽ‰๐Ÿ‘๐Ÿ‘Œ

In ChangeProp for job queue, we have a concept of concurrency - how many jobs of the same type can be running in parallel. wikibase-addUsagesForPage had a special rule with concurrency 30, so if this pattern holds we can probably ramp it down to 5. Having max concurrency close-enough to actually needed concurrency is good to avoid spikes in load. I will create a patch for that, add @Ladsgroup there, and wait for him to be sure this pattern will hold to +1 it.

Change 519518 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Lower wikibase-addUsagesForPage concurrency.

https://gerrit.wikimedia.org/r/519518

I will upload the patch soon, just need to write tests

For the record, the patch is I45b699a8b8 (already backported to wmf.11) โ€“ @gerritbot didnโ€™t pick it up because of the blank line between the โ€œBug:โ€ and โ€œChange-Id:โ€ trailers.

I also think this improvement shouldโ€™ve had a dedicated task, but itโ€™s a bit late for that now.

Back to the original issue, the error got reduced even further and it's rare now but if it needs more work I can think of diffing between current entity usage and the new ones but it's some work. I made another task for that. T226818: Diff when updating wbc_entity_usage which hopefully will be picked up soon.

Change 519518 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Lower wikibase-addUsagesForPage concurrency.

https://gerrit.wikimedia.org/r/519518

While trying to do T226818: Diff when updating wbc_entity_usage I found a very very weird thing. The AddUsagesForPageJob, takes all entity usages of the given page and does some stuff plus adding all of the current entity usages to wbc_entity_usage table with INSERT IGNORE. There's no diffing here. It just adds all of the current ones through $this->usageUpdater->addUsagesForPage() that calls $this->usageTracker->addUsedEntities(). Then brings the question that how (or if) the removed entity usages actually gets removed from the table. I tested it locally and in production and I realized that it really removes the removed entity usages but then how? Looking at other usage of the UsageUpdater class it seems that DataUpdateHookHandlers calls $this->usageUpdater->replaceUsagesForPage() which does a pretty impressive diffing and updates the wbc_entity_usage into the proper state and then the AddUsagesForPageJob job comes in and do a INSERT IGNORE of the exact same rows that already exist into the database. It might the reason there is deadlock it's that both the hook handler (which I think is a deferred update) and the job, try to update the same rows at the same time.

One other important thing is that the way the job adds the data to the database is pretty inefficient. For example in one article that caused this issue in Ukrainian Wikipedia. We have this entity usages:

wikiadmin@10.64.48.153(ukwiki)> select * from wbc_entity_usage where eu_page_id = 1143221;
+-----------+--------------+-----------+------------+
| eu_row_id | eu_entity_id | eu_aspect | eu_page_id |
+-----------+--------------+-----------+------------+
|  67076587 | Q1026        | C.P1813   |    1143221 |
|  67076588 | Q1026        | C.P1448   |    1143221 |
|  67076589 | Q1026        | C.P1705   |    1143221 |

(Lots of rows)

+-----------+--------------+-----------+------------+
392 rows in set (0.00 sec)

But in most cases the entity usage change by one or two. Imagine I changed only one entity usage of this page from Q1026 C.P1813 to Q1026 C.P1814. The hook handler, removes Q1026 C.P1813 and adds Q1026 C.P1814. Then job comes in and happily insert ignores the whole 392 rows again, the job could completely be dropped and wbc_entity_usage would stay the same. And these types of mistakes used to happen 300 times a second and after T205045#5290536 it went down to 30 times per second.

I don't think we should completely drop the job, it does important things to the entity usage subscription that the hook handler seems to be not doing it (the job also prunes unused entity subscription but it's not related to this at all). We should just drop that part.

CC @Lydia_Pintscher and @hoo @Addshore @alaa_wmde @Lucas_Werkmeister_WMDE: Please double check everything and tell me I'm not crazy. I checked everything twice though.

One rather easy way to test my hypotheses is that you can avoid running jobs (by not running runJobs.php) in your localhost and then changing entity usages in pages and seeing wbc_entity_usage gets updated accordingly. I did it and it worked confirming my hypotheses

Yeah, this is on purposeโ€ฆ this is rather complex (unfortunately). The thing is that we make sure that usage are "correct" by: Immediately purging no-longer accurate usages on page touches/ re-parses on edit (it might be that we also add new usages here, but that's not technically needed), then we later on add all usage to the table that come up when parsing the page in different ways (different language, mostly, maybe exclusively).

AddUsagesForPageJob can't remove any usages as its invocation only ever represents one parse of a page (which has maybe already valid usage for another parser run of the same revision). UsageUpdater can remove usages, as it is invoked right away on the initial parse.

Possible improvements:

Either (I prefer this):

  • Only remove usages outdated usages in UsageUpdater (synchronous), don't add any new ones
  • Only INSERT actually new usages in AddUsagesForPageJob

Or:

  • Update usages in UsageUpdater (synchronous, on page update), like it is done now
  • Only inject AddUsagesForPageJob if that hasn't happened
  • Only INSERT actually new usages in AddUsagesForPageJob

Yeah, this is on purposeโ€ฆ this is rather complex (unfortunately). The thing is that we make sure that usage are "correct" by: Immediately purging no-longer accurate usages on page touches/ re-parses on edit (it might be that we also add new usages here, but that's not technically needed), then we later on add all usage to the table that come up when parsing the page in different ways (different language, mostly, maybe exclusively).

Thanks for explaining, Since it's intentional, I don't replace the thing but I have struggle understanding what is the usecase. Can you give me an example of a case that the hook fails to add the new usages but the job can do?

Possible improvements:

Either (I prefer this):

  • Only remove usages outdated usages in UsageUpdater (synchronous), don't add any new ones
  • Only INSERT actually new usages in AddUsagesForPageJob

Or:

  • Update usages in UsageUpdater (synchronous, on page update), like it is done now
  • Only inject AddUsagesForPageJob if that hasn't happened
  • Only INSERT actually new usages in AddUsagesForPageJob

The latter is sorta already implemented. The second bullet point is already live.

Change 527131 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/527131

Change 527131 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/527131

Change 528072 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@wmf/1.34.0-wmf.16] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/528072

Change 528072 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.34.0-wmf.16] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/528072

Mentioned in SAL (#wikimedia-operations) [2019-08-05T11:29:41Z] <urbanecm@deploy1001> Synchronized php-1.34.0-wmf.16/extensions/Wikibase: SWAT: rEWBA3ecaa57561e8: Add only needed entity usages in AddUsagesForPageJob (T226818, T205045) (duration: 01m 12s)

Jobs inserted in the last hour:

image.png (572ร—1 px, 54 KB)

Basically it dropped 91% of addEntityUsage jobs ^_^

While measuring stuff for the new changes I realized this change on June 27th actually reduced 4% of traffic incoming to MySQL nodes (or 3MB/s)

โ€ข mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM