beta cluster job runner keep running some periodic tasks
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Apr 8 2014, 3:49 PM

Description

The deployment-jobrunner01 instance is quite busy. Looking at /data/project/logs/runJobs.log it keeps running 31 periodic queue tasks every few seconds for the simplewiki:

2014-04-08 15:44:51 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:44:55 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:03 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:07 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:10 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:16 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).
2014-04-08 15:45:19 deployment-jobrunner01 simplewiki: Executed 31 periodic queue task(s).

I have no clue how to list the tasks. maintenance/showJobs.php does not give any clue nor getJobQueueLenghts :-/

Looking at the cli.log there is a bunch of:

simplewiki-c91d295c: 2.0489  22.5M  [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
simplewiki-c91d295c: 2.0498  22.5M  [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
simplewiki-c91d295c: 2.0507  22.5M  [memcached] result: NOT STORED

Ccing Aaron who probably knows how to fix that :-]

Version: unspecified
Severity: normal

Details

Reference: bz63681

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:24 AM

• bzimport added a project: Beta-Cluster-Infrastructure.

• bzimport set Reference to bz63681.

• bzimport added a subscriber: Unknown Object (MLST).

hashar created this task.Apr 8 2014, 3:49 PM

Do the cas operations succeed? If they do, maybe memcached has a high eviction rate or something?

I don't see this probably locally or on production, so it seems specific to beta.

Here is the debug log for session simplewiki-6a538816 from cli.log:

1.7059 14.2M [caches] main: MemcachedPeclBagOStuff, message: MemcachedPeclBagOStuff, parser: MemcachedPhpBagOStuff
2.1357 18.2M [caches] LocalisationCache: using store LCStoreCDB
2.9236 22.0M Fully initialised
3.0402 22.0M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1017 22.5M Creating a new RedisConnectionPool instance with id 95797b002a3bb33394c101391fbb456c43b076e3.
3.1666 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1680 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1688 22.8M [memcached] result: NOT STORED
3.1690 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1696 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1703 22.8M [memcached] result: NOT STORED
3.1705 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1712 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1720 22.8M [memcached] result: NOT STORED
3.1722 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1730 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1744 22.8M [memcached] result: NOT STORED
3.1748 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1754 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1762 22.8M [memcached] result: NOT STORED
3.1763 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1769 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1776 22.8M [memcached] result: NOT STORED
3.1777 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1783 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1790 22.8M [memcached] result: NOT STORED
3.1791 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1797 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1809 22.8M [memcached] result: NOT STORED
3.1811 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1822 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1829 22.8M [memcached] result: NOT STORED
3.1831 22.8M [memcached] get(simplewiki:jobqueuegroup:taskruns:v1)
3.1838 22.8M [memcached] add(simplewiki:jobqueuegroup:taskruns:v1)
3.1848 22.8M [memcached] result: NOT STORED

Looking at the memcached key simplewiki:jobqueuegroup:taskruns:v1:

$ mwscript mcc.php
> get simplewiki:jobqueuegroup:taskruns:v1
Getting simplewiki:jobqueuegroup:taskruns:v1[]
array(31) {

["refreshLinks"]=>
array(1) {
  ["recyclePruneAndUndelayJobs"]=>
  int(1396479307)
}

<snip>

I get a 31 elements array, each element having a timestamp of 1396479307 which is Wed, 02 Apr 2014 22:55:07 GMT.

Seems to me there is some logical error in JobQueueGroup::executeReadyPeriodicTasks() which never clear out the previous runs :-/

I am not deleting the key so we can keep investigating.

If the add() gave NOT STORED then you should have seen cas(). That sounds like some problem happens with memcached in BagOStuff::mergeViaCas().

Do we have any unit test covering that code or anyway to potentially reproduce the issue? I am not familiar at all with the cas stuff, much less with the job queue.

(still seeing those log entries)

In T65681#710874, @bzimport wrote:

aschulz4587 wrote:

If the add() gave NOT STORED then you should have seen cas(). That sounds like some problem happens with memcached in BagOStuff::mergeViaCas().

@aaron: anything we should do here?

greg raised the priority of this task from Medium to Needs Triage.Nov 25 2014, 9:51 PM

greg set Security to None.

+ MediaWiki-Core-Team

The spam of messages still occurs in /data/project/logs/runJobs.log

An example of a spike for commonswiki:

2014-11-25 21:16:22 deployment-jobrunner01 commonswiki: Executed 52 periodic queue task(s).
2014-11-25 21:16:22 deployment-jobrunner01 commonswiki: Executed 52 periodic queue task(s).
2014-11-25 21:16:24 deployment-jobrunner01 commonswiki: Executed 52 periodic queue task(s).
2014-11-25 21:16:24 deployment-jobrunner01 commonswiki: Executed 52 periodic queue task(s).
2014-11-25 21:16:24 deployment-jobrunner01 commonswiki: Executed 52 periodic queue task(s).

For enwiki they seem to come in blocks of 5 messages roughly every 5 minutes, though that depends on the time of the day.

Example:

2014-11-25 21:50:33 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:50:35 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:50:35 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:50:35 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:50:36 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:56:05 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).

2014-11-25 21:56:05 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:56:06 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:56:06 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).
2014-11-25 21:56:07 deployment-jobrunner01 enwiki: Executed 40 periodic queue task(s).

Count per minutes:

$ fgrep periodic /data/project/logs/runJobs.log |fgrep enwiki:|cut -d\: -f1-2|uniq -c
...
      5 2014-11-25 21:27
      5 2014-11-25 21:33
     10 2014-11-25 21:38
     10 2014-11-25 21:45
      5 2014-11-25 21:50
      5 2014-11-25 21:56
$

hashar updated the task description. (Show Details)Nov 25 2014, 10:01 PM

hashar moved this task from To Triage to Externally Blocked on the Beta-Cluster-Infrastructure board.Nov 25 2014, 10:26 PM

Petrb unsubscribed.Nov 26 2014, 8:39 AM

greg triaged this task as Medium priority.Nov 26 2014, 5:10 PM

greg added a project: Scrum-of-Scrums.Nov 26 2014, 5:15 PM

Labs now uses the new job runner loop and the periodic tasks stuff in MW uses redis, not memcached (anymore). With the old loop, it was expected for batches of sub-loops to do the same tasks sometimes (the later times just not doing anything). This happens since several processes see that a task should be done and decide to do them (racing on updating the last-completion time). It's minor, and indeed happens every ~5min as expected.

The new loop means the MediaWiki periodic tasks methods can be disabled however, being redundant (the background runner takes care of them). I posted https://gerrit.wikimedia.org/r/#/c/177275/ to do this.

bd808 moved this task from Backlog to Done on the MediaWiki-Core-Team board.Dec 3 2014, 7:20 PM

greg moved this task from Externally Blocked to Done on the Beta-Cluster-Infrastructure board.Dec 3 2014, 9:32 PM

bd808 moved this task from Done to Archive on the MediaWiki-Core-Team board.Dec 8 2014, 10:54 PM

bd808 moved this task from Scheduled to Done on the Scrum-of-Scrums board.Dec 9 2014, 10:42 PM

beta cluster job runner keep running some periodic tasksClosed, ResolvedPublicActions

Description

Details

Event Timeline

beta cluster job runner keep running some periodic tasks
Closed, ResolvedPublic
Actions