Page MenuHomePhabricator

RunSingleJob.php timeout too low at 180 seconds
Closed, ResolvedPublic

Description

I noticed some fatal errors from the job queue in Logstash for execution timeouts. This seems surprising as jobs are typically (for better or worse) the place that we defer slower and more costly computations to.

A timeout of 180s seems too low for that. Not in the least because even regular POST web requests are given 200 seconds.

When parsing a normal edit from an end-user, we allow 200s of processing time. But when a job is processing a batch of muliple cascading updates for edits, we abort after 180s? That seems broken.

In wmf-config/set-time-limit.php it says:

		case 'jobrunner.svc.eqiad.wmnet':
		case 'jobrunner.svc.codfw.wmnet':
		case 'jobrunner.discovery.wmnet':
			$limit = 1200;
			break;

		default:
			if ( $_SERVER['REQUEST_METHOD'] === 'POST' ) {
				$limit = 200;
			} else {
				$limit = 60;
			}

So the fact that Logstash is showing fatal execution timeouts at 180s suggests strongly something has recently regressed here.

 Maximum execution time of 180 seconds exceeded

…
#25 /srv/mediawiki/rpc/RunSingleJob.php(76): JobExecutor->execute()


…
host:  jobrunner.discovery.wmnet

Event Timeline

Krinkle triaged this task as Unbreak Now! priority.Mar 13 2020, 5:15 PM
Krinkle updated the task description. (Show Details)

Nothing regressed, we just never had jobs that executed php code for more than 180 seconds. max_execution_time in fact doesn't count time spent in syscalls, stream calls or c functions from extensions, so we never hit that limit before.

It could be important to inspect what job has gotten so slow as to need this much execution time in pure PHP.

Joe lowered the priority of this task from Unbreak Now! to High.Mar 13 2020, 5:43 PM

While this needs fixing quickly, it's not a regression and we can wait monday to manage the rollout of the fix and any unintended consequence.

Change 579961 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::jobrunner: raise max_execution_time

https://gerrit.wikimedia.org/r/579961

Change 579961 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::jobrunner: raise max_execution_time

https://gerrit.wikimedia.org/r/579961

The max execution time has been elevated to 1200 seconds on the jobrunner. The errors should disappear from now on.

Mentioned in SAL (#wikimedia-operations) [2020-03-17T10:07:49Z] <_joe_> sudo cumin -b2 -s 50 'A:mw-jobrunner' 'restart-php7.2-fpm' T247622