Page MenuHomePhabricator

Generalise deduplication in ChangeProp
Closed, ResolvedPublic

Description

JobQueue has 2 methods of dedupication: individual job deduplication and root-job deduplication. Here’s how both work:

  • Individual job deduplication: each job has an sha1 associated with it and a boolean flag whether to deduplicate. On each job push it checks with Redis whether a job with the same identifier was already pushed, if yes - it rejects the job.
  • Root job deduplication. Each job has identifiers of it root job - the one that’s created it, and a timestamp of the root job. After a newer root job is pushed, the latest timestamp is updated in Redis. When the leaf job is pulled, it’s root job timestamp is checked against latest and it’s potentially deduplicated.

We can create an absolutely matching algorithm in ChangeProp by adding some new fields to our events.

This is only possible after we add some persistency discussed in T157089

Event Timeline

Pchelolo created this task.Feb 3 2017, 12:18 AM
Restricted Application added a project: Analytics. · View Herald TranscriptFeb 3 2017, 12:18 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Pchelolo updated the task description. (Show Details)
Pchelolo removed a subscriber: Aklapper.
Restricted Application added a project: Analytics. · View Herald TranscriptFeb 3 2017, 12:19 AM
GWicke added a subscriber: GWicke.EditedFeb 3 2017, 7:58 PM

Most (if not all) use cases of individual job deduplication can alternatively be solved by making updates idempotent. This is what we have done for many RESTBase updates, using the If-Unmodified-Since header. The advantages of pushing idempotence and quick abort with if-unmodified-since downstream are

a) ease of scaling the storage of job progress, and
b) covering other means of triggering the same jobs, such as a manual purge for the same page, or another template edit (different template) triggering a re-render of the same page.

We should also check how frequently and how single-job deduplication is actually used in the JobQueue. I vaguely recall that it was deemed too expensive for the larger jobs like refreshlinks, so there might not be that many actual users of this functionality.

Most (if not all) use cases of individual job deduplication can alternatively be solved by making updates idempotent.

For some of the jobs that's impossible, for example email sending.

Anyway if we want feature parity between JobQueue and the new solution, we need to have this feature.

Most (if not all) use cases of individual job deduplication can alternatively be solved by making updates idempotent.

For some of the jobs that's impossible, for example email sending.

I wouldn't say it's impossible (you can always keep track of having actually sent a certain mail further downstream), but you are right that it's not as easy as piggy-backing on existing storage in all cases.

Anyway if we want feature parity between JobQueue and the new solution, we need to have this feature.

In the short term, this might be true. In the long term, I'm not completely convinced that this kind of functionality necessarily belongs in the job queue / change processing infrastructure.

I would propose to compile a list of current users of individual job deduplication. This should give us a better understanding of the situation, and lets us draw up a solid plan for the short & long term.

In the short term, this might be true. In the long term, I'm not completely convinced that this kind of functionality necessarily belongs in the job queue / change processing infrastructure.

I think it actually does. The role of change prop here is to do job management - managing retries, making sure jobs get executed and nothing is missed. Deduplication naturally fits into the 'job management' category.

I would propose to compile a list of current users of individual job deduplication. This should give us a better understanding of the situation, and lets us draw up a solid plan for the short & long term.

Most of the jobs existing in core use it except CdnPurge and range jobs for HTMLCacheUpdate and RefreshLinks, so I think we need to implement it.

Deduplication naturally fits into the 'job management' category.

Sure, but as I hinted at earlier it is often in a bad position to do so well, as it can only deduplicate the job being triggered through the job queue, but not other sources of triggering the same thing.

Pchelolo added a comment.EditedFeb 3 2017, 8:24 PM

Sure, but as I hinted at earlier it is often in a bad position to do so well, as it can only deduplicate the job being triggered through the job queue, but not other sources of triggering the same thing.

Why not? That's the point of the 'generalization' mechanism. If we gave a general contract for deduplication we can use the same contract for current Change-Prop use-cases. Maybe the one used in the JobQueue is not ideal, but we have to stick to it because we want to stay compatible with the current implementation. In the long term we might want to change that, but it's out of the scope of this particular project.

I agree that we should try to make updates idempotent, but it's a way bigger project then creating a new backed for the JobQueue.

GWicke added a comment.EditedFeb 3 2017, 8:43 PM

Sure, but as I hinted at earlier it is often in a bad position to do so well, as it can only deduplicate the job being triggered through the job queue, but not other sources of triggering the same thing.

Why not? That's the point of the 'generalization' mechanism.

@Pchelolo: The example I mentioned earlier is a page being (re-)rendered in response to a user request, while a job to do the same is being processed later. With idempotence & quick abort at the lowest level, you end up only (re-)rendering the page once, no matter which request came first. With deduplication only in the job queue, you end up doing duplicate work if the user request was processed first, as that request never goes through the job queue.

Nuria moved this task from Incoming to Radar on the Analytics board.Feb 9 2017, 4:49 PM
elukey added a subscriber: elukey.Jul 5 2017, 7:57 AM
Pchelolo closed this task as Resolved.Aug 30 2017, 10:42 PM

We have redid-based reduplication running in production now that's exactly matching the algorithm used in the JobQueue. There's a parent task to generalize it even more, but this one can be closed now.