Page MenuHomePhabricator

Deduplicate based on entity id
Open, Needs TriagePublic

Description

It's currently doesn't de-duplicate based on entity id.

Event Timeline

To be honest, I'm not sure why this isn't working yet. DispatchChangesJob has both $this->removeDuplicates = true; in the constructor and overrides getDeduplicationInfo:

	public function getDeduplicationInfo(): array {
		return [
			'entityId' => $this->entityIdSerialization,
		];
	}

What have I missed?

not necessarily anything. if we can somehow confirm it, it would be great. In production we have metrics for deduplication, we can deploy it to test wikidata and then see if it actually deduplicates or not.

Assuming I found the right dashboard+panel, I’m not seeing any DispatchChanges deduplication

Maybe we’re not overriding getDeduplicationInfo() correctly? I looked through some of the other implementations, and they all seem to call parent::getDeduplicationInfo() and then unsetting parts of it; we’re returning a completely different array (and notably, an array where the job parameter – the entity ID – is a top-level key, instead of being beneath a 'params' key next to a 'type' key).

Also… is it actually correct to deduplicate based only on the entity ID? If there are two DispatchChanges jobs for the same entity ID, one with changeId 10 and the other with changeId 20, and the first job gets discarded as a duplicate of the second, won’t that mean that the changes with IDs 10-19 might not get processed?

Also… is it actually correct to deduplicate based only on the entity ID? If there are two DispatchChanges jobs for the same entity ID, one with changeId 10 and the other with changeId 20, and the first job gets discarded as a duplicate of the second, won’t that mean that the changes with IDs 10-19 might not get processed?

My understanding was that deduplication works the other way around: The second job getting discarded as a duplicate of the first, and the first job handling all changes for its Entity id that are currently in wb_changes.

So far I haven’t seen any guarantee on which change gets discarded, so I wouldn’t feel comfortable relying on that at the moment.

So far I haven’t seen any guarantee on which change gets discarded, so I wouldn’t feel comfortable relying on that at the moment.

Ok, in that case, maybe we should deduplication of that job completely. Also, we kind of doing implicit deduplication anyway, as can be seen on this panel: https://grafana.wikimedia.org/d/hGFN2TH7z/edit-dispatching-via-jobs?viewPanel=4&orgId=1&from=now-2d&to=now

image.png (313×785 px, 56 KB)

Jobs dealing with multiple changes at a time means that the following jobs queued for that will find no changes in wb_changes for them to dispatch and thus exit early.

So maybe we should change the deduplication property to be explicitly false?

I think relying on false being the default should be enough, so maybe we can just remove the deduplication code.

Unless we don’t actually need the change ID parameter for anything, and can remove it? I’m not sure what it’s used for. (If we really need it, then IMHO that filtering by change ID should also happen in the database, not in PHP after we’ve already loaded all the changes for that entity.)

Ah, gotcha. Now I remember! (My mind still catching up from vacation :p)

The changeID parameter was only used so that during the transitional period we wouldn't redispatch old changes towards recent changes. But now that DispatchChangesJob is cleaning up after itself, we can remove the usage of that parameter completely. (Which actually makes deduplication viable again, right?)

Yeah, sounds like it to me too. Though I guess that might depend on T291162: Create Migration path for 3rd party Wikibase users.

One option would be to remove the deduplication code now, and then consider implementing it again after that migration path is finished? (And possibly leaving the job undeduplicated, and with a changeId parameter, for one release?)

Also… is it actually correct to deduplicate based only on the entity ID? If there are two DispatchChanges jobs for the same entity ID, one with changeId 10 and the other with changeId 20, and the first job gets discarded as a duplicate of the second, won’t that mean that the changes with IDs 10-19 might not get processed?

the deduplication is based on entity id, so if two edits happens back to back on item Q1 (which would be the case when using the old termbox for example) and they get injected into wb_changes with ids 10 and 20, the first job handles both 10 and 20 (because it looks them up from the database) and then the second job if doesn't get deduplicated, still won't have anything to do.

There is a mechanism to ensure not any job get deduplicated after the first job is started running (using locks IIRC) so if the first job doesn't manage to read both changes from wb_changes in time, the second job won't get deduplciated. If this mechanism is somehow broken, I suggest fixing it since a lot of parts mediawiki rely on this but I think it's reliable enough.

HTH

oh and you shouldn't return false, otherwise the job will be retried (you should return false if the job gets an exception for example). I know it's counter-intuitive but I didn't come up with it.

Change 730531 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/Wikibase@master] Make deduplication actually work for DispatchChangesJob

https://gerrit.wikimedia.org/r/730531

Change 731239 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Michael Große):

[mediawiki/extensions/Wikibase@wmf/1.38.0-wmf.4] Make deduplication actually work for DispatchChangesJob

https://gerrit.wikimedia.org/r/731239

Change 730531 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Make deduplication actually work for DispatchChangesJob

https://gerrit.wikimedia.org/r/730531

Mentioned in SAL (#wikimedia-operations) [2021-10-18T11:09:17Z] <lucaswerkmeister-wmde@deploy1002> Synchronized php-1.38.0-wmf.4/extensions/Wikibase/repo/includes/ChangeModification/DispatchChangesJob.php: Backport: [[gerrit:731238|Create DispatchChangesJob without change id (T291118)]] (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2021-10-18T11:10:33Z] <lucaswerkmeister-wmde@deploy1002> Synchronized php-1.38.0-wmf.4/extensions/Wikibase/repo/includes/Hooks/RecentChangeSaveHookHandler.php: Backport: [[gerrit:731238|Create DispatchChangesJob without change id (T291118)]] (2/2) (duration: 00m 56s)

Change 731239 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.38.0-wmf.4] Make deduplication actually work for DispatchChangesJob

https://gerrit.wikimedia.org/r/731239

Mentioned in SAL (#wikimedia-operations) [2021-10-18T11:37:00Z] <lucaswerkmeister-wmde@deploy1002> Synchronized php-1.38.0-wmf.4/extensions/Wikibase/repo/includes/ChangeModification/DispatchChangesJob.php: Backport: [[gerrit:731239|Make deduplication actually work for DispatchChangesJob (T291118)]] (duration: 00m 55s)

Hm, I’m still not seeing any deduplication in Grafana. Let’s wait a bit more I guess?

Still not seeing anything. This is strange 🤔