Jobs for otrs-wiki are slower than expected to process
Closed, InvalidPublic
Actions

Assigned To

Authored By

	Krd
	Dec 1 2018, 7:27 AM

Description

I have the impression there is a major job queue for otrs-wiki.wikimedia.org from time to time, which causes some trouble in the workflow.
As reader responsiveness is not that important on private wikis, please increase $wgJobRunRate to 1 or higher and please let me know the actual value.

Event Timeline

Krd created this task.Dec 1 2018, 7:27 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 1 2018, 7:27 AM

Not sure if this falls under Wikimedia-Site-requests as I don't know if there is any public per-wiki setting to tweak.

We don't use $wgJobRunRate on WMF wikis...

reedy@deploy1001:/srv/mediawiki-staging$ mwscript showJobs.php otrs_wikiwiki
0

What are you trying to achieve?

I notice that pages require purging after template changes. Not a big deal if it cannot be resolved, feel free to close the task. Thx.

As with all wikis, things take a little time to be processed by the job runners. Changes wouldn't appear immediately, but if they're still not appearing automatically after a reasonable amount of time, there's potentially an issue to look at

Aklapper removed a project: Wikimedia-Site-requests.Dec 1 2018, 8:02 AM

What is a reasonable time? For OTRS wiki time of is around "some hours" sometimes, which is quite annoying. I also noticed delays at arbcom-de.wikipedia.org, but at a lower level, around several minutes.

Legoktm renamed this task from increase wgJobRunRate for otrs-wiki to Jobs for otrs-wiki are slower than expected to process.Dec 1 2018, 8:21 AM

Legoktm added a project: WMF-JobQueue.

Legoktm added a subscriber: • Pchelolo.

Krinkle moved this task from Untriaged to EventBus infra on the WMF-JobQueue board.Dec 20 2018, 7:51 PM

Krinkle removed a project: WMF-General-or-Unknown.Dec 20 2018, 8:08 PM

This might have to do with the scheduling mechanism not being fragmented by wiki (which the old queue did).

Based on T210910#4790996, this issue appears to be about RefeshLinks and/or HTMLCacheUpdate jobs. It'd be good to perhaps spot check the latency for such jobs on a few of the mentioned smaller wikis to either confirm the behaviour or verify that it was a temporary issue.

• kchapman edited projects, added Performance-Team, Platform Team Legacy; removed Platform Engineering.Jan 15 2019, 3:50 PM

• kchapman moved this task from Inbox to Watching / External on the Platform Team Legacy board.

• kchapman edited projects, added Platform Team Legacy (Watching / External); removed Platform Team Legacy.

As I understand it JobQueue in WMF prod is now entirely out of the hands of Performance Team.

The queuing implementation is in EventBus, which was developed by Services (now CPT). The queue is storing in Kafka (mainly looked after by Analytics), and the scheduling and execution of jobs happens through ChangeProp which was developed and is maintained by CPT (formerly Services).

I'd like us to help where we can, but we have little to nothing to offer I'm afraid. The only part of this we're experienced with is the abstract queueing logic in core, which this doesn't appear to be an issue with.

• kchapman edited projects, added Platform Engineering; removed Platform Team Legacy (Watching / External), Performance-Team.Jan 22 2019, 9:27 PM

EvanProdromou subscribed.Feb 5 2019, 3:37 PM

CCicalese_WMF edited projects, added Platform Team Workboards (Blocked Externally); removed Platform Engineering.Feb 5 2019, 3:39 PM

CCicalese_WMF added a project: Platform Engineering.

CCicalese_WMF moved this task from Inbox to Triage Meeting Inbox on the Platform Engineering board.

We're having a hard time validating this issue. We have analytics by job type but not by wiki (see https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1 ). If anyone has some data on this, it'd be helpful!

We have analytics by job type but not by wiki

Metrics per wiki requires us to switch to prometheus since the metric cardinality would be too high for statsd

The kafka job queue doesn't give preference to big wikis over the small wikis, so the metric you see on the board estimate the hard maximum limit of the delay for any wiki.

At the time of filing there indeed was an increase in the delay for refreshLinks up to 1 day, probably because of some template change.. However, I am not sure this could affect otrs wiki - the refreshLinks is partitioned according to MySQL sharding, so otrs wiki shares a partition with the rest of smaller wikis. Right now there seems to be no delay, so I believe we could close the task and reopen if it occurs again and w get mor information.

Is this still a problem? Please reopen if so.

CCicalese_WMF moved this task from Blocked Externally to Done with CPT on the Platform Team Workboards board.Mar 12 2019, 2:45 PM

CCicalese_WMF edited projects, added Platform Team Workboards (Done with CPT); removed Platform Team Workboards (Blocked Externally).

Jobs for otrs-wiki are slower than expected to processClosed, InvalidPublicActions

Description

Event Timeline

Jobs for otrs-wiki are slower than expected to process
Closed, InvalidPublic
Actions