Page MenuHomePhabricator

Bad LocalRenameUserJob stuck in jobrunner for vewikimedia
Closed, ResolvedPublic

Description

In exception.log 900 to 1000 times per second:

2015-01-22 06:14:03 mw1001 vewikimedia: [db53b619] /rpc/RunJobs.php?wiki=vewikimedia&type=LocalRenameUserJob&maxtime=60&maxmem=300M MWException from line 366 of /srv/mediawiki/php-1.25wmf14/includes/jobqueue/JobQueue.php: Unrecognized job type 'LocalRenameUserJob'.
#0 /srv/mediawiki/php-1.25wmf14/includes/jobqueue/JobQueueGroup.php(155): JobQueue->pop()
#1 /srv/mediawiki/php-1.25wmf14/includes/jobqueue/JobRunner.php(112): JobQueueGroup->pop()
#2 /srv/mediawiki/rpc/RunJobs.php(42): JobRunner->run()
#3 {main}

Event Timeline

bd808 raised the priority of this task from to Unbreak Now!.
bd808 updated the task description. (Show Details)
bd808 subscribed.

[21:35] < legoktm> so, the CentralAuth db had references to a "vewikimedia", which at one point was connected to SUL. It was recently re-opened as a fishbowl, meaning it's not SUL and CA isn't installed. But a user who the db thought existed on vewikimedia was global renamed, and it queued a job on vewikimedia except the job class doesn't exist there

[21:37] < legoktm> legoktm@terbium:~$ mwscript showJobs.php --wiki=vewikimedia --group <-- outputs nothing

The crazy volume of this exception event may be what is killing logstash as well.

I tried this:

$ mwscript eval.php --wiki=vewikimedia
> print_r( JobQueueGroup::singleton()->get('LocalRenameUserJob')->getSize() );
1
> print_r( JobQueueGroup::singleton()->get('LocalRenameUserJob')->delete() );

> print_r( JobQueueGroup::singleton()->get('LocalRenameUserJob')->getSize() );
1

@Tgr found a way to kill jobs from redis for T87040#984282 (path refers to tin), perhaps the same thing could be used here.

Although getSize now returns 0 for me, the exception log is still getting flooded.

Just deleting vewikimedia would solve this problem? It's currently not in use and they requested that it should be a redirect to their current wiki instead. See T57737 and linked bugs.

@Tgr found a way to kill jobs from redis for T87040#984282 (path refers to tin), perhaps the same thing could be used here.

Trying to purge directly via redis commands as used in T87040#984282:

LPOP vewikimedia:jobqueue:LocalRenameUserJob:l-unclaimed
ZREMRANGEBYRANK vewikimedia:jobqueue:LocalRenameUserJob:z-claimed 0 10
ZREMRANGEBYRANK vewikimedia:jobqueue:LocalRenameUserJob:z-abandoned 0 10
ZREMRANGEBYRANK vewikimedia:jobqueue:LocalRenameUserJob:z-delayed 0 10
$ redis-cli -a $PASSWORD -h rdb1003 < redis-vewikimedia-clear.txt
(nil)
(integer) 0
(integer) 0
(integer) 0

Log continues to flood.

bd808 claimed this task.
bd808 added a subscriber: ori.

@ori was able to help me fix this yesterday. The redis purges I had done removed the job, but the jobrunner instances were still trying to poll the "LocalRenameUserJob/vewikimedia" job queue. This fix for this was to remove the queue from the list of ready queues:

redis-cli -h rdb1001.eqiad.wmnet -a $PASSWORD hdel jobqueue:aggregator:h-ready-queues:v2 LocalRenameUserJob/vewikimedia

Trying to make the answers here easier to find the next time I'm looking for them by adding the job queue and runner projects.