Page MenuHomePhabricator

Jobs for otrs-wiki are slower than expected to process
Closed, InvalidPublic

Description

I have the impression there is a major job queue for otrs-wiki.wikimedia.org from time to time, which causes some trouble in the workflow.
As reader responsiveness is not that important on private wikis, please increase $wgJobRunRate to 1 or higher and please let me know the actual value.

Event Timeline

Not sure if this falls under Wikimedia-Site-requests as I don't know if there is any public per-wiki setting to tweak.

Reedy changed the task status from Open to Stalled.Dec 1 2018, 7:53 AM
Reedy subscribed.

We don't use $wgJobRunRate on WMF wikis...

reedy@deploy1001:/srv/mediawiki-staging$ mwscript showJobs.php otrs_wikiwiki
0

What are you trying to achieve?

I notice that pages require purging after template changes. Not a big deal if it cannot be resolved, feel free to close the task. Thx.

As with all wikis, things take a little time to be processed by the job runners. Changes wouldn't appear immediately, but if they're still not appearing automatically after a reasonable amount of time, there's potentially an issue to look at

What is a reasonable time? For OTRS wiki time of is around "some hours" sometimes, which is quite annoying. I also noticed delays at arbcom-de.wikipedia.org, but at a lower level, around several minutes.

Legoktm renamed this task from increase wgJobRunRate for otrs-wiki to Jobs for otrs-wiki are slower than expected to process.Dec 1 2018, 8:21 AM
Legoktm added a project: WMF-JobQueue.
Legoktm added a subscriber: Pchelolo.
Krinkle changed the task status from Stalled to Open.Dec 20 2018, 8:12 PM
Krinkle added a project: Platform Engineering.
Krinkle subscribed.

This might have to do with the scheduling mechanism not being fragmented by wiki (which the old queue did).

Based on T210910#4790996, this issue appears to be about RefeshLinks and/or HTMLCacheUpdate jobs. It'd be good to perhaps spot check the latency for such jobs on a few of the mentioned smaller wikis to either confirm the behaviour or verify that it was a temporary issue.

As I understand it JobQueue in WMF prod is now entirely out of the hands of Performance Team.

The queuing implementation is in EventBus, which was developed by Services (now CPT). The queue is storing in Kafka (mainly looked after by Analytics), and the scheduling and execution of jobs happens through ChangeProp which was developed and is maintained by CPT (formerly Services).

I'd like us to help where we can, but we have little to nothing to offer I'm afraid. The only part of this we're experienced with is the abstract queueing logic in core, which this doesn't appear to be an issue with.

We're having a hard time validating this issue. We have analytics by job type but not by wiki (see https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1 ). If anyone has some data on this, it'd be helpful!

We have analytics by job type but not by wiki

Metrics per wiki requires us to switch to prometheus since the metric cardinality would be too high for statsd

The kafka job queue doesn't give preference to big wikis over the small wikis, so the metric you see on the board estimate the hard maximum limit of the delay for any wiki.

At the time of filing there indeed was an increase in the delay for refreshLinks up to 1 day, probably because of some template change.. However, I am not sure this could affect otrs wiki - the refreshLinks is partitioned according to MySQL sharding, so otrs wiki shares a partition with the rest of smaller wikis. Right now there seems to be no delay, so I believe we could close the task and reopen if it occurs again and w get mor information.

kchapman subscribed.

Is this still a problem? Please reopen if so.