Cleanup interaction between per-job and root-job deduplication logic so they mix properly
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	aaron
	May 19 2022, 8:26 AM

Description

If a job class uses removeDuplicates=true and root job A1 exists and A2 is enqueued, then A2 enqueue becomes a no-op. At the same time, the A1 job is considered as "superceded by a new job" and is no-oped upon being run. A similar problem can happen with the sub-jobs from these root jobs if they also use removeDuplicates=true.

Maybe jobs with root timestamps should flip the "oldest job wins" logic to "newest job wins" for per-job deduplication. A simpler option is to just not update the "latest root job timestamp" registry if no job was actually inserted (due to removeDuplicates). The later option retains the normal removeDuplicates behavior.

Details

	Subject	Repo	Branch	Lines +/-
	jobqueue: do not bump the root job timestamp for hash-deduplicated jobs	mediawiki/core	master	+73 -26

Customize query in gerrit

Event Timeline

aaron created this task.May 19 2022, 8:26 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 19 2022, 8:26 AM

aaron updated the task description. (Show Details)May 21 2022, 12:03 AM

Change 794098 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] jobqueue: do not bump the root job timestamp for hash-deduplicated jobs

https://gerrit.wikimedia.org/r/794098

gerritbot added a project: Patch-For-Review.May 21 2022, 1:59 AM

@aaron wrote:

If a job class uses removeDuplicates=true and root job A1 exists and A2 is enqueued, then A2 enqueue becomes a no-op. At the same time, the A1 job is considered as "superceded by a new job" and is no-oped upon being run. […]

Is there something racey or further specific about this? If not, then it sounds like this means that basically as a general rule that when a popular template is edited twice in a row, that most of the backlinks will not be refreshed and both jobs die out early. Is that the case today?

If not, can you be more specific about what job in production would be affected by this issue and how it could happen in practice?

Krinkle removed a project: WMF-JobQueue.May 24 2022, 10:42 PM

Tracking and assigning for visibility, to reflect patch in progress.

If the root job A1 is popped before A2 is queued, the bug is largely avoided, which is what will usually happen.

Krinkle moved this task from Inbox, needs triage to Backlog: Maintenance, non-prioritized on the Performance-Team board.May 31 2022, 6:46 PM

Krinkle added a project: Platform Engineering.May 31 2022, 6:48 PM

Maybe jobs with root timestamps should flip the "oldest job wins" logic to "newest job wins" for per-job deduplication.

I think this would lead to starvation when new jobs keep coming fast (e.g. a lot of edits on a page).

@WDoranWMF I think task would be a good oppertunity for training and familiarisation of jobqueue code ("knowledge sharing") for anyone on the team besides Daniel, both a bit of coverage over the core API as well as the EventBus-ChangeProp pipeline that we plug in in production. Aaron, Daniel or myself could pair with one or two people on that.

The summary here is that Aaron found a race condition that is affecting the core API (fairly certain about that) and has prepared a patch for that. To be done here is to 1) understand the relevant core APIs and use the patch as guidance to understand and then review/improve the patch as needed, and 2) understand portions of the EventBus-ChangeProp pipeline that WMF uses in production to see if the same bug is present there and if so to similarly patch it there.

Per Timo's comment above and following Will's suggestion, I'm adding this to the Foundational Technology Requests board.

Ladsgroup subscribed.Jul 7 2022, 6:51 PM

aaron removed aaron as the assignee of this task.Aug 16 2022, 10:04 PM

Krinkle edited projects, added Wikimedia-Performance-recommendation; removed Performance-Team.Mar 10 2024, 2:31 AM

Cleanup interaction between per-job and root-job deduplication logic so they mix properlyOpen, Needs TriagePublicActions

Description

Details

Event Timeline

Cleanup interaction between per-job and root-job deduplication logic so they mix properly
Open, Needs TriagePublic
Actions