Page MenuHomePhabricator

Beta cluster job queue is unmonitored / potentially not running all jobs
Closed, InvalidPublic

Description

The beta cluster job queue is un monitored. In statsd we are missing the equivalent of MediaWiki.jobqueue.size, should be BetaMediaWiki.jobqueue.size but it does not show up.

We have two job runners:

deployment-jobrunner01
deployment-tmh01

The later being solely for video transcoding afaik.

Looking at the labs monitoring dashboard https://grafana.wikimedia.org/dashboard/db/labs-project-board the instance can probably receive more load:

Capture d’écran 2016-02-29 à 09.46.32.png (550×860 px, 74 KB)

AFTER T130184 got fixed

Capture d’écran 2016-03-17 à 11.55.42.png (285×334 px, 34 KB)

Or we can spawn a second instance

Event Timeline

The graph is the one from Beta. I could not find the MediaWiki.jobqueue.size metric on the labs graphite though :(

commonswiki shows deleteLinks and cirrusSearchDeletePages are not processed :(

$ mwscript showJobs.php --wiki=commonswiki --group|column -t
deleteLinks:                  1019  queued;  1  claimed  (1  active,  0  abandoned);  0  delayed
enotifNotify:                 0     queued;  2  claimed  (2  active,  0  abandoned);  0  delayed
AssembleUploadChunks:         0     queued;  6  claimed  (0  active,  6  abandoned);  0  delayed
refreshLinksDynamic:          0     queued;  1  claimed  (0  active,  1  abandoned);  0  delayed
cirrusSearchDeletePages:      2124  queued;  1  claimed  (1  active,  0  abandoned);  0  delayed
gwtoolsetUploadMediafileJob:  0     queued;  4  claimed  (0  active,  4  abandoned);  0  delayed
EchoNotificationJob:          0     queued;  4  claimed  (4  active,  0  abandoned);  0  delayed

If we don't actually have a ton of video transcodes I'd just go with 2 general purpose job runners and no dedicated ones.

The jobrunner01 was kept busy due to errand labswiki jobs (the wiki has been renamed to deploymentwiki a while ago). At least that one is fixed now (was T130184 ).

hashar renamed this task from Beta cluster job queue is quite crowed (~200 k jobs) to Beta cluster job queue is unmonitored / potentially not running all jobs.Mar 17 2016, 10:42 AM
hashar updated the task description. (Show Details)

I originally filled the task stating the beta cluster job queue had 200k jobs. I made a mistake and was looking at the production queue :-(

I am thus repurposing this task to get some basic metrics for beta cluster job queue. Namely:

In statsd we are missing the equivalent of MediaWiki.jobqueue.size, should be BetaMediaWiki.jobqueue.size but it does not show up.

Asked Giuseppe about it. In production the job queue statistics are generated by a maintenance script on terbium. It runs every minute and is provisioned by the puppet class mediawiki::maintenance. Puppet manifest:

$ cat modules/mediawiki/manifests/maintenance/jobqueue_stats.pp 
# == Class: mediawiki::maintenance::jobqueue_stats
#
# Provisions a cron job which runs every minute and which reports the
# total size of the job queue to StatsD.
#
class mediawiki::maintenance::jobqueue_stats( $ensure = present ) {
    include ::mediawiki::users

    cron { 'jobqueue_stats_reporter':
        ensure  => $ensure,
        command => '/usr/local/bin/mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php --report 2>/dev/null >/dev/null',
        user    => $::mediawiki::users::web,
        minute  => '*',
    }
}

Assuming it is working properly:

$ mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php 
Total 0

Having a terbium equivalent / running cron script on beta cluster is T125976

hashar changed the task status from Open to Stalled.Mar 17 2016, 10:56 AM
hashar triaged this task as Low priority.
hashar updated the task description. (Show Details)

CPU is no more a concern:

Capture d’écran 2016-03-17 à 11.55.42.png (285×334 px, 34 KB)

Stalled until job queue metrics are reported on beta cluster which is T125976

Can someone take a look at the job queue? For example a spam page I deleted at dewiki beta (http://de.wikipedia.beta.wmflabs.org/wiki/Spezial:Letzte_%C3%84nderungen) is still visible at the recent changes (The creation of that page), but the deletion was more than 12 hours ago, so I guess the queue stopped working.

krenair@deployment-tin:~$ foreachwiki showJobs.php | grep :
aawiki:  0
arwiki:  0
cawiki:  0
commonswiki:  0
deploymentwiki:  0
dewiki:  0
en_rtlwiki:  0
enwiki:  0
enwikibooks:  0
enwikinews:  0
enwikiquote:  0
enwikisource:  0
enwikiversity:  0
enwikivoyage:  0
enwiktionary:  0
eowiki:  0
eswiki:  0
fawiki:  0
hewiki:  0
hiwiki:  0
jawiki:  0
kowiki:  0
loginwiki:  0
metawiki:  0
nlwiki:  0
ruwiki:  0
simplewiki:  0
sqwiki:  0
testwiki:  0
ukwiki:  0
wikidatawiki:  0
zerowiki:  0
zhwiki:  0``

Strange. The deleted page is still visible at RCs. Maybe this is "potentially not running all jobs"?

That is no more accurate. The background jobs processing has been overhauled meanwhile.