Page MenuHomePhabricator

Cleanup interaction between per-job and root-job deduplication logic so they mix properly
Open, Needs TriagePublic

Description

If a job class uses removeDuplicates=true and root job A1 exists and A2 is enqueued, then A2 enqueue becomes a no-op. At the same time, the A1 job is considered as "superceded by a new job" and is no-oped upon being run. A similar problem can happen with the sub-jobs from these root jobs if they also use removeDuplicates=true.

Maybe jobs with root timestamps should flip the "oldest job wins" logic to "newest job wins" for per-job deduplication. A simpler option is to just not update the "latest root job timestamp" registry if no job was actually inserted (due to removeDuplicates). The later option retains the normal removeDuplicates behavior.

Event Timeline

Change 794098 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] jobqueue: do not bump the root job timestamp for hash-deduplicated jobs

https://gerrit.wikimedia.org/r/794098

@aaron wrote:

If a job class uses removeDuplicates=true and root job A1 exists and A2 is enqueued, then A2 enqueue becomes a no-op. At the same time, the A1 job is considered as "superceded by a new job" and is no-oped upon being run. […]

Is there something racey or further specific about this? If not, then it sounds like this means that basically as a general rule that when a popular template is edited twice in a row, that most of the backlinks will not be refreshed and both jobs die out early. Is that the case today?

If not, can you be more specific about what job in production would be affected by this issue and how it could happen in practice?

Krinkle added a project: Performance-Team.

Tracking and assigning for visibility, to reflect patch in progress.

If the root job A1 is popped before A2 is queued, the bug is largely avoided, which is what will usually happen.

Maybe jobs with root timestamps should flip the "oldest job wins" logic to "newest job wins" for per-job deduplication.

I think this would lead to starvation when new jobs keep coming fast (e.g. a lot of edits on a page).

@WDoranWMF I think task would be a good oppertunity for training and familiarisation of jobqueue code ("knowledge sharing") for anyone on the team besides Daniel, both a bit of coverage over the core API as well as the EventBus-ChangeProp pipeline that we plug in in production. Aaron, Daniel or myself could pair with one or two people on that.

The summary here is that Aaron found a race condition that is affecting the core API (fairly certain about that) and has prepared a patch for that. To be done here is to 1) understand the relevant core APIs and use the patch as guidance to understand and then review/improve the patch as needed, and 2) understand portions of the EventBus-ChangeProp pipeline that WMF uses in production to see if the same bug is present there and if so to similarly patch it there.

Per Timo's comment above and following Will's suggestion, I'm adding this to the Foundational Technology Requests board.

aaron removed aaron as the assignee of this task.Aug 16 2022, 10:04 PM