Page MenuHomePhabricator

[Story] Dispatching via job queue (instead of cron script)
Open, MediumPublic

Description

Current state: The "dispatch" process for wikidata / wikibase is kicked off from a cron job and maintenance script.

Problem
Wikibase users (including WIkibase developers) do not want to have to run extra maintenance scripts.
WMF SREs do not want to manage extra cron jobs (they cause complexity during in cross data centre work)

Docs
The Repository dispatching (script, db tables, jobs)
The Client receiving its notification event / job

Rough Idea:

  • Every edit schedules a DispatchTriggerJob that is totally generic.
    • The job holds no info at all, so all DispatchTriggerJob of this kind are the same. This means that new jobs get ignored if there is already an older job waiting for execution etc
    • We may want to consider having a configurable way to schedule less of these than 1 per edit, 1 per 100 on Wikidata production for example would likely be just fine. Examples in core
  • DispatchTriggerJob looks for wikis that meet our dispatch criteria( using the wb_changes_dispatch table as in the current maintenance script, regarding max interval etc) and that are not locked, scheduling 1 DispatchClientJob per wiki
  • DispatchClientJob would perform a "pass" for the wiki, as the existing maintenance script does, then unlocking the client wiki.
  • Everything from this point on would remain the same

This solution means that we meet the main goal of this work, which is no longer using a maintenance script, while also not having to rewrite the entire dispatching system.

For sites that only have a single client site (such as a local client setup) we could consider directly scheduling DispatchClientJobs, skipping out the in-direction of the DispatchTriggerJob.

It is likely that in production wikidata.org this solution might need some tuning to get the desired behaviour:

  • Adequate / desired % of edits triggering the initial job
  • Adequate waiting between batches of changes sent to individual clients
  • Settings limiting what a "pass" of a wiki can be (as we may now be able to run more passes for each wiki in general)

Acceptance criteria🏕️🌟:

  • No maintenance script / cron job needs to be run for the dispatching process to work
  • De-duplication should be used where possible and needed
  • Documentation should be updated (in Wikibase.git & architecture docs)
  • Grafana monitoring of the dispatch process remains useful for the new solution

Notes:
When this task is tackled it should be taken in mind that some refactoring will likely make sense, such as T256208: Consolidate places that read/write 'wb_changes' table (but this is also tracked and prioritized separately.

This should be gradually deployed, and this could possibly be done in a couple of different ways:

  • Per environment: beta, test, production
  • Per client wiki (or group of wikis) within each environment: group1, group2, (everything except enwiki), enwiki, commonswiki

Overall performance of these jobs will be dictated by the job queue processing, which is controlled by WMF SREs and service ops?
We have a general performance requirement of "The dispatching process for Wikidata should not be slower than it currently is"
The code to be deployed from this ticket likely won't have a big impact in performance, though the configuration of the processing of jobs may, and this would need to be figured out with serviceops.

In Wikidata production this cron jobs can be seen at https://github.com/wikimedia/puppet/blob/e1e13a59de3021afaa43c31745abbe348a93017d/modules/profile/manifests/mediawiki/maintenance/wikidata.pp

The current process is monitored on grafana and also has alarms:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Addshore lowered the priority of this task from High to Low.Jan 23 2019, 1:14 PM
Addshore added a subscriber: Addshore.

Change dispatching is currently very fast

I sorta copy what I said in T193733#5276659 on reasons

  • It's a SPOF, if mwmaint1002 node goes down for HW issues, we can't dispatch at all. If there's a need to restart the node, dispatching has to stop until it's done.
  • "Noisy neighbor" effect, people run maintenance scripts in the mwmaint node, it can be choked to death by other scripts and it can make running maintenance scripts impossible by having bugs that eats all of the resources.
  • The distributed system we designed for this (pulling the wikis using three cronjobs, dispatching and picking up basically random + most stalled ones). This can use the great infrastructure for jobqueues we have.
  • Cronjobs are hard to debug, moving them to jobqueue makes it easier to debug in logstash.

By reducing number of edits happening on wikidata (using one big wbeditentity API call instead of several when termbox v1 edit happens) can help, but there might be better ways to do it. @Joe has lots of good insight in this regard.

  • It's a SPOF, if mwmaint1002 node goes down for HW issues, we can't dispatch at all. If there's a need to restart the node, dispatching has to stop until it's done.
  • "Noisy neighbor" effect, people run maintenance scripts in the mwmaint node, it can be choked to death by other scripts and it can make running maintenance scripts impossible by having bugs that eats all of the resources.
  • The distributed system we designed for this (pulling the wikis using three cronjobs, dispatching and picking up basically random + most stalled ones). This can use the great infrastructure for jobqueues we have.
  • Cronjobs are hard to debug, moving them to jobqueue makes it easier to debug in logstash.

This sounds like a nice idea..

In order to be able to order this in proper priority, do we have any measurements/hypothesis on what we would solve/gain with this change? or is it just anticipating future risks and trying to proactively do something?

This sounds like a nice idea..

In order to be able to order this in proper priority, do we have any measurements/hypothesis on what we would solve/gain with this change? or is it just anticipating future risks and trying to proactively do something?

On gains, This is more about resilience IMO. We have two documented incident regarding dispatching but I think there are more. There have been scheduled down times due to reboot of mwmaint nodes, etc. All of them would be avoided. I say this might gives us another 9 in dipsatching uptime. Would that work for you?

This sounds like a nice idea..

In order to be able to order this in proper priority, do we have any measurements/hypothesis on what we would solve/gain with this change? or is it just anticipating future risks and trying to proactively do something?

On gains, This is more about resilience IMO. We have two documented incident regarding dispatching but I think there are more. There have been scheduled down times due to reboot of mwmaint nodes, etc. All of them would be avoided. I say this might gives us another 9 in dipsatching uptime. Would that work for you?

Yeap that's more info to include here, thanks!

We would like to understand the amount of effort needed to do this.

@hoo would be able to provide more information from your own experience with this, so that we can have a better estimation for it.

The estimation comes down to, do we:

  1. Just more the dispatching mechanism over to jobs, using the same or very similar logic that we currently have.
    • Probably not the most efficient dispatching logic
    • Least effort to get there
    • Solves most of the issues outlined in T48643#5336132
  2. Change the way dispatching works while we move over to jobs
    • More work & time
    • Probably makes dispatching faster, more efficient, more reliable and easier to understand
    • Best use of job queue for dispatching
    • Also solves the issues outlined in T48643#5336132

If number 2 then we also have to decide how exactly we will be changing the logic.

In T48643#1522606, @hoo wrote:

Do we want one job per edit or how exactly is this supposed to look? Wrapping the current dispatching mechanism in jobs doesn't really sound like a good idea to me.

The "Idea" in the description of this ticket currently talks about number 1 and keeping the same logic but making the job queue trigger it.
I'm on @hoo side here and don't think this would be the best thing to do.

Just a note from a few weeks ago.
This is really needed to make cross DC work easier in the SRE world.

We had a prioritisation call today where T256208: Consolidate places that read/write 'wb_changes' table came up as something that would likely make sense to do when this ticket is tackled. (rather than doing it before this ticket and thus touching the code twice)
cc @Tarrow

Addshore renamed this task from [Story] Dispatching via delayed jobs (instead of cron script) to [Story] Dispatching via job queue (instead of cron script).May 18 2021, 12:35 PM
Addshore raised the priority of this task from Low to Medium.
Addshore updated the task description. (Show Details)

This was looked at in story time today, and was estimated with an AC which would read "& deploy the thing to wikidatawiki".
This got quite high estimates with that AC of 13->20+ and was determined to be "too big for campsite".

This probably now needs to follow the same process as the recent service migration ticket, which would in the current status be a bonfire & hike forming around this task.

Addshore lowered the priority of this task from Medium to Low.Jul 14 2021, 11:13 AM
Addshore raised the priority of this task from Low to Medium.Jul 16 2021, 8:46 AM
Addshore moved this task from Triaged Low (0-50) to Triaged Big on the wdwb-tech board.

Change 724230 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Enable dispatching via jobs in testwikidatawiki

https://gerrit.wikimedia.org/r/724230

Change 724230 abandoned by Ladsgroup:

[operations/mediawiki-config@master] Enable dispatching via jobs in testwikidatawiki

Reason:

Done via Idff50d75af4

https://gerrit.wikimedia.org/r/724230

Change 724765 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Enable change dispatching via jobs in wikidatawiki

https://gerrit.wikimedia.org/r/724765

Change 724765 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable change dispatching via jobs in wikidatawiki

https://gerrit.wikimedia.org/r/724765

Mentioned in SAL (#wikimedia-operations) [2021-09-29T15:44:54Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:724765|Enable change dispatching via jobs in wikidatawiki (T48643)]] (duration: 01m 08s)

Change 725287 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/deployment-charts@master] Make two new jobs of Wikidata dispatcher high priority

https://gerrit.wikimedia.org/r/725287

Edit: Moved CirrusSearch queue report to T292291

We looked at it and it doesn't seem to be related.

Change 725287 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: Make new jobs of Wikidata dispatcher high priority

https://gerrit.wikimedia.org/r/725287

Change 725502 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Enable dispatching via jobs everywhere

https://gerrit.wikimedia.org/r/725502

Change 725673 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mediawiki: Stop wikidata dispatching via systemd timers

https://gerrit.wikimedia.org/r/725673

Change 725705 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Disable dispatch lag part of maxlag

https://gerrit.wikimedia.org/r/725705

Change 725502 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable dispatching via jobs everywhere

https://gerrit.wikimedia.org/r/725502

Mentioned in SAL (#wikimedia-operations) [2021-10-04T14:01:44Z] <ladsgroup@deploy1002> Synchronized wmf-config: Config: [[gerrit:725502|Enable dispatching via jobs everywhere (T48643)]] (duration: 01m 00s)

Change 725705 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable dispatch lag part of maxlag

https://gerrit.wikimedia.org/r/725705

Change 725905 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Explicitly enable dispatching and pruning for wikidata

https://gerrit.wikimedia.org/r/725905

Change 725905 merged by jenkins-bot:

[operations/mediawiki-config@master] Explicitly enable dispatching and pruning for wikidata

https://gerrit.wikimedia.org/r/725905

Mentioned in SAL (#wikimedia-operations) [2021-10-04T14:13:14Z] <ladsgroup@deploy1002> Synchronized wmf-config/Wikibase.php: Config: [[gerrit:725905|Explicitly enable dispatching and pruning for wikidata (T48643)]] (duration: 00m 58s)

Change 725927 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/deployment-charts@master] changeprop-jobqueue: Increase concurrancy of DispatchChanges to 7

https://gerrit.wikimedia.org/r/725927

Change 725927 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: Increase concurrancy of DispatchChanges to 7

https://gerrit.wikimedia.org/r/725927

Change 725673 merged by Giuseppe Lavagetto:

[operations/puppet@production] mediawiki: Stop wikidata dispatching via systemd timers

https://gerrit.wikimedia.org/r/725673