RunSingleJob.php timeout too low at 180 seconds
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Mar 13 2020, 5:14 PM

Description

I noticed some fatal errors from the job queue in Logstash for execution timeouts. This seems surprising as jobs are typically (for better or worse) the place that we defer slower and more costly computations to.

A timeout of 180s seems too low for that. Not in the least because even regular POST web requests are given 200 seconds.

When parsing a normal edit from an end-user, we allow 200s of processing time. But when a job is processing a batch of muliple cascading updates for edits, we abort after 180s? That seems broken.

In wmf-config/set-time-limit.php it says:

		case 'jobrunner.svc.eqiad.wmnet':
		case 'jobrunner.svc.codfw.wmnet':
		case 'jobrunner.discovery.wmnet':
			$limit = 1200;
			break;

		default:
			if ( $_SERVER['REQUEST_METHOD'] === 'POST' ) {
				$limit = 200;
			} else {
				$limit = 60;
			}

So the fact that Logstash is showing fatal execution timeouts at 180s suggests strongly something has recently regressed here.

 Maximum execution time of 180 seconds exceeded

…
#25 /srv/mediawiki/rpc/RunSingleJob.php(76): JobExecutor->execute()


…
host:  jobrunner.discovery.wmnet

Details

	Subject	Repo	Branch	Lines +/-
	mediawiki::jobrunner: raise max_execution_time	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: T204761: Failed executing job: categoryMembershipChange
T247621: Fatal errors caused in ParserFileProcessingHookHandlers for certain images during RunSingleJob.php

Event Timeline

Krinkle created this task.Mar 13 2020, 5:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 13 2020, 5:14 PM

Krinkle triaged this task as Unbreak Now! priority.Mar 13 2020, 5:15 PM

Krinkle updated the task description. (Show Details)

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptMar 13 2020, 5:15 PM

Krinkle mentioned this in T247621: Fatal errors caused in ParserFileProcessingHookHandlers for certain images during RunSingleJob.php.Mar 13 2020, 5:17 PM

Nothing regressed, we just never had jobs that executed php code for more than 180 seconds. max_execution_time in fact doesn't count time spent in syscalls, stream calls or c functions from extensions, so we never hit that limit before.

It could be important to inspect what job has gotten so slow as to need this much execution time in pure PHP.

While this needs fixing quickly, it's not a regression and we can wait monday to manage the rollout of the fix and any unintended consequence.

Joe claimed this task.Mar 16 2020, 2:47 PM

Change 579961 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::jobrunner: raise max_execution_time

https://gerrit.wikimedia.org/r/579961

gerritbot added a project: Patch-For-Review.Mar 16 2020, 2:51 PM

Change 579961 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::jobrunner: raise max_execution_time

https://gerrit.wikimedia.org/r/579961

Maintenance_bot removed a project: Patch-For-Review.Mar 16 2020, 3:10 PM