Page MenuHomePhabricator

beta cluster job runner keep running some periodic tasks
Closed, ResolvedPublic

Description

The deployment-jobrunner01 instance is quite busy. Looking at /data/project/logs/runJobs.log it keeps running 31 periodic queue tasks every few seconds for the simplewiki:

2014-04-08 15:44:51 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:44:55 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:03 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:07 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:10 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:16 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:19 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).

I have no clue how to list the tasks. maintenance/showJobs.php does not give any clue nor getJobQueueLenghts :-/

Looking at the cli.log there is a bunch of:

simplewiki-c91d295c: 2.0489  22.5M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
simplewiki-c91d295c: 2.0498  22.5M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
simplewiki-c91d295c: 2.0507  22.5M  [memcached] result: NOT STORED

Ccing Aaron who probably knows how to fix that :-]


Version: unspecified
Severity: normal

Details

Reference
bz63681

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:24 AM
bzimport set Reference to bz63681.
bzimport added a subscriber: Unknown Object (MLST).

Do the cas operations succeed? If they do, maybe memcached has a high eviction rate or something?

I don't see this probably locally or on production, so it seems specific to beta.

Here is the debug log for session simplewiki-6a538816 from cli.log:

1.7059 14.2M [caches] main: MemcachedPeclBagOStuff, message: MemcachedPeclBagOStuff, parser: MemcachedPhpBagOStuff
2.1357 18.2M [caches] LocalisationCache: using store LCStoreCDB
2.9236 22.0M Fully initialised
3.0402 22.0M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1017 22.5M Creating a new RedisConnectionPool instance with id 95797b002a3bb33394c101391fbb456c43b076e3.
3.1666 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1680 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1688 22.8M [memcached] result: NOT STORED
3.1690 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1696 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1703 22.8M [memcached] result: NOT STORED
3.1705 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1712 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1720 22.8M [memcached] result: NOT STORED
3.1722 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1730 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1744 22.8M [memcached] result: NOT STORED
3.1748 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1754 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1762 22.8M [memcached] result: NOT STORED
3.1763 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1769 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1776 22.8M [memcached] result: NOT STORED
3.1777 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1783 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1790 22.8M [memcached] result: NOT STORED
3.1791 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1797 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1809 22.8M [memcached] result: NOT STORED
3.1811 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1822 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1829 22.8M [memcached] result: NOT STORED
3.1831 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1838 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1848 22.8M [memcached] result: NOT STORED

Looking at the memcached key simplewiki:jobqueuegroup:taskruns:v1:

$ mwscript mcc.php
> get simplewiki:jobqueuegroup:taskruns:v1
Getting simplewiki:jobqueuegroup:taskruns:v1[]
array(31) {

["refreshLinks"]=>
array(1) {
  ["recyclePruneAndUndelayJobs"]=>
  int(1396479307)
}

<snip>

I get a 31 elements array, each element having a timestamp of 1396479307 which is Wed, 02 Apr 2014 22:55:07 GMT.

Seems to me there is some logical error in JobQueueGroup::executeReadyPeriodicTasks() which never clear out the previous runs :-/

I am not deleting the key so we can keep investigating.

If the add() gave NOT STORED then you should have seen cas(). That sounds like some problem happens with memcached in BagOStuff::mergeViaCas().

Do we have any unit test covering that code or anyway to potentially reproduce the issue? I am not familiar at all with the cas stuff, much less with the job queue.

(still seeing those log entries)

greg added subscribers: bzimport, aaron.

aschulz4587 wrote:

If the add() gave NOT STORED then you should have seen cas(). That sounds like some problem happens with memcached in BagOStuff::mergeViaCas().

@aaron: anything we should do here?

greg raised the priority of this task from Medium to Needs Triage.Nov 25 2014, 9:51 PM
greg set Security to None.

+ MediaWiki-Core-Team

The spam of messages still occurs in /data/project/logs/runJobs.log

An example of a spike for commonswiki:

2014-11-25 21:16:22 deployment-jobrunner01 commonswiki: Executed 52 periodic queue task(s).
2014-11-25 21:16:22 deployment-jobrunner01 commonswiki: Executed 52 periodic queue task(s).
2014-11-25 21:16:24 deployment-jobrunner01 commonswiki: Executed 52 periodic queue task(s).
2014-11-25 21:16:24 deployment-jobrunner01 commonswiki: Executed 52 periodic queue task(s).
2014-11-25 21:16:24 deployment-jobrunner01 commonswiki: Executed 52 periodic queue task(s).

For enwiki they seem to come in blocks of 5 messages roughly every 5 minutes, though that depends on the time of the day.

Example:

2014-11-25 21:50:33 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:50:35 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:50:35 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:50:35 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:50:36 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:56:05 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:56:05 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:56:06 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:56:06 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:56:07 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).

Count per minutes:

$ fgrep periodic /data/project/logs/runJobs.log |fgrep enwiki:|cut -d\: -f1-2|uniq -c
...
      5 2014-11-25 21:27
      5 2014-11-25 21:33
     10 2014-11-25 21:38
     10 2014-11-25 21:45
      5 2014-11-25 21:50
      5 2014-11-25 21:56
$
greg triaged this task as Medium priority.Nov 26 2014, 5:10 PM
aaron claimed this task.

Labs now uses the new job runner loop and the periodic tasks stuff in MW uses redis, not memcached (anymore). With the old loop, it was expected for batches of sub-loops to do the same tasks sometimes (the later times just not doing anything). This happens since several processes see that a task should be done and decide to do them (racing on updating the last-completion time). It's minor, and indeed happens every ~5min as expected.

The new loop means the MediaWiki periodic tasks methods can be disabled however, being redundant (the background runner takes care of them). I posted https://gerrit.wikimedia.org/r/#/c/177275/ to do this.