Page MenuHomePhabricator

Diff when updating wbc_entity_usage
Closed, ResolvedPublic8 Estimated Story Points

Description

Motivation
There's lots of queries happening when updating the entity usage table (around 30 per second) in the addEntityUsage job. These queries are big and thus hold database lock for too long causing lots of issues, including long lock time of the database, database errors due to their size. And since this job gets triggered ~30 times every second, any improvement on its performance greatly effects database and the jobqueue.

Problem
The job instead of diffing the new and old entity usage of a page, removes all entity usages of the given page and puts the new ones in the database.

Suggested Solution / Technical Details
Diff against the current entity usage values and just put the ones that are needed and remove the removed ones. Most of entity usage in a page is unchanged and there's no need to remove and remake all of them all the time.

Event Timeline

alaa_wmde subscribed.

@Ladsgroup I'd love to experiment with smth here on this task with you.
That is to find a the right shape of our tasks and the right amount of into that need to go into them for being "Ready to Estimate/Pick-up"
wanna try preparing this task for that and see how that goes in next story time?

I'd suggest the following sections (but up to you of course how you end up structuring it):


Motivation
why we should do this task? what's the value? here numbers and links to dashboards are great.

Problem
Concrete description on the root cause of the problem that we need to fix.
Explaining the current setup of things might be part of this section, if some devs are expected
to not know much about the setup in this area.

Suggested Solution / Technical Details
Simple outline of a proposed solution, or simple list of technical
notes/knowledge for other devs to get them closer to knowing
what will they need to do/deal with.


@Ladsgroup thanks for updating the description .. looks fancy and hopefully useful :) let's see in story time

Maybe we move this to trailblazing exploration if we know (of if it isn't clear) that the technical solution is obvious? @Ladsgroup did you put any thought whether diffing here a relatively simple thing to do?

Maybe we move this to trailblazing exploration if we know (of if it isn't clear) that the technical solution is obvious? @Ladsgroup did you put any thought whether diffing here a relatively simple thing to do?

I think it would be rather simple, it's definitely simpler then most of tasks we put for the campsite.

Change 527131 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/527131

Change 527131 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/527131

Change 528072 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@wmf/1.34.0-wmf.16] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/528072

Change 528072 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.34.0-wmf.16] Add only needed entity usages in AddUsagesForPageJob

https://gerrit.wikimedia.org/r/528072

Mentioned in SAL (#wikimedia-operations) [2019-08-05T11:29:41Z] <urbanecm@deploy1001> Synchronized php-1.34.0-wmf.16/extensions/Wikibase: SWAT: rEWBA3ecaa57561e8: Add only needed entity usages in AddUsagesForPageJob (T226818, T205045) (duration: 01m 12s)

@Ladsgroup is there a way to test this locally and/or on beta?

also, out of curiosity, this seems to have been patched up and merged in the same day, and then backported. Yet, I don't find in the task nor in the commit messages any info regarding the urgency.. did I miss something?

@Ladsgroup is there a way to test this locally and/or on beta?

You can probably look at logstash job specifications and compare them with their current values but it's really hard to test I admit.

also, out of curiosity, this seems to have been patched up and merged in the same day, and then backported. Yet, I don't find in the task nor in the commit messages any info regarding the urgency.. did I miss something?

The patch was made in Thursday, got merged on Friday and backported on Monday, they are three different days but as a rule of thumb, anything that reduces production errors can get backported. It also makes checking the impact on the fatals/errors easier (and not alongside with twenty other wikibase patches).

The patch was made in Thursday, got merged on Friday and backported on Monday, they are three different days but as a rule of thumb, anything that reduces production errors can get backported. It also makes checking the impact on the fatals/errors easier (and not alongside with twenty other wikibase patches).

Sorry I wasn't so explicit, I was referring to the backport patch itself.. was wondering why we didn't wait for the next train, which you answered too in your reply.. all good!

You can probably look at logstash job specifications and compare them with their current values but it's really hard to test I admit.

Doesn't seem like something I would want to do. Is there any graph/metric that should change enough due to this change to prove to us it is working? that should be enough too. Moving it to Done anyway for now as nothing seem to break due to this, and fixes are likely to be covered by a new task.