Page MenuHomePhabricator

jobrunner trapped in a loop cause of webVideoTranscode job
Closed, ResolvedPublic

Description

The job-loop triggers calls to MediaWiki maintenance/runJobs.php script. For some reason, the processes never ends and eat up all CPU.

They are jobs like:

mwscript runJobs.php --wiki=commonswiki --procs=5 &

Aka there is no type.

The commonswiki job table had two job requests for webVideoTranscode :

(mw@deployment-sql) [commonswiki]> select * from job \G

  • 1. row ******* job_id: 1917 job_cmd: webVideoTranscode

job_namespace: 6

job_title: Mayday2012-edit-1.ogv

job_timestamp: 20120523195317

job_params: a:2:{s:13:"transcodeMode";s:10:"derivative";s:12:"transcodeKey";s:8:"160p.ogv";}
  • 2. row ******* job_id: 1918 job_cmd: webVideoTranscode

job_namespace: 6

job_title: Mayday2012-edit-1.ogv

job_timestamp: 20120523195317

job_params: a:2:{s:13:"transcodeMode";s:10:"derivative";s:12:"transcodeKey";s:9:"480p.webm";}

2 rows in set (0.00 sec)
(mw@deployment-sql) [commonswiki]>

So it seems the runJobs.php script keep looping forever trying to achieves the jobs.

Deleting the jobs solve the looping issue:

(mw@deployment-sql) [commonswiki]> delete from job;
Query OK, 2 rows affected (0.38 sec)


Version: unspecified
Severity: normal

Details

Reference
bz37072

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:30 AM
bzimport set Reference to bz37072.
bzimport added a subscriber: Unknown Object (MLST).

Find out:

  • why triggered job never ends up running
  • why despite having only 2 jobs, there is several forked process
  • why job stick in the queue

Looks like job::pop() fail to delete the jobs from the database :-(

I found the root cause while sleeping this week-end.

The cause is that transcode jobs are excluded from being processed by runJobs.php (through the use of $wgJobTypesExcludedFromDefaultQueue) whereas nextJobDB.php still consider those jobs as in need of processing. End result is an infinite loop since jobs are never processed.

Hence the addition of $wgJobTypesExcludedFromDefaultQueue, by commit 45f9da8ad7, need to be enhanced.

Raising priority as a remember to get that reviewed asap. It causes disruptions on deployment-prep .

Patch to MW Core:
https://gerrit.wikimedia.org/r/9116

gerrit change 9116, which fixed nextJobDB.php, has been merged in. A similar issue is occurring with runJobs.php which also can lead to an infinite loop. Proposed change is:

https://gerrit.wikimedia.org/r/10692

Both patches merged. I have them applied to the beta cluster and there is no more infinite loop issue.