Page MenuHomePhabricator

AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes
Closed, ResolvedPublic

Description

[Description edited to replace previous misunderstandings]:
Right now AssembleUploadChunks jobs fail after about 3 minutes, then get retried 1 minute later despite retries on this job type being disabled. Similarly for PublishStashedFile jobs.

These jobs need a fair amount of time, because in the worst case, they may be calculating an SHA1 hash on a 5GB file, and potentially moving the file to/from Swift. 3 minutes is way too short. 10-20 minutes would be a more reasonable timeout.

I think the retry happens because the "no retry" code lives in EventBus\JobExecutor, which is run from rpc/RunSingleJob.php. It works by marking failed jobs that cannot be retried as success. If the HTTP connection dies due to a timeout, this error is never communicated to the change-propogation service, so it sees this as a failure and retries the job.

I believe this points to the HTTP request being killed after 3 minutes. job queue MW is configured to have a $wgRequestTimeLimit of 1200, so we should really have the HTTP timeout be higher than that, so MW can properly communicate errors.

It seems like k8s has the service mesh configured to have a default timeout for MediaWiki of 180 seconds. change-propogation has a default timeout of 7 minutes.

An example of such a case is request id of 9b4b9d1e-6a32-4591-8f72-89af061843b6

Event Timeline

Reading config files, it seems like the timeout should already be about 7 minutes (change-propagation DEFAULT_REQUEST_TIMEOUT = 7 minutes. job runner $wgRequestTimeout = 20 minutes).

Retry happens after exactly 4 minutes. Since it looks like change-prop waits 1 minute after a failure, this suggests a timeout is being reached after 3 minutes. Assuming all this is actually right, it would have to be something for normal job queues and not videoscalers, since clearly the video scalars are having their timeout set correctly.

180 seconds is what the ATS & haproxy timeouts are set to, but I would assume that internal requests to https://mw-jobrunner.discovery.wmnet:4448/rpc/RunSingleJob.php would not go through those.

However, there is a difference in that AssembleChunkedUpload is using kubernetes where webVideoTranscode is using the old system. So if i were to guess, it is maybe something related to k8s

Oh, i see that deployment-charts/charts/mediawiki/values.yaml has an upstream_timeout set to 180.0 seconds. I believe mw-jobrunners will inherit that. So presumably helmfile.d/services/mw-jobrunner/values.yaml should set upstream_timeout under mesh: to be 1203 seconds (to be 1 more second than the apache timeout). Or at least be closer to that. I'm unclear if it is expected that upstream_timeout & change-propagation timeout should be above the MW $wgRequestTimeout in order to capture all errors or if the HTTP connection should close before that point in order to trigger a client abort on the PHP side.

@hnowlan Can you confirm if this reasoning is correct? I'm not very familiar with how the k8s stuff works.

Bawolff renamed this task from Increase timeout on AssembleUploadChunksJob to AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.Mar 2 2024, 1:59 AM
Bawolff updated the task description. (Show Details)

Change 1007996 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[operations/deployment-charts@master] Increase tls timeout for mw-jobrunners to fix upload jobs

https://gerrit.wikimedia.org/r/1007996

Change 1007997 had a related patch set uploaded (by Brian Wolff; author: Brian Wolff):

[operations/deployment-charts@master] Increase timeout on misc jobs to 19 min for upload jobs

https://gerrit.wikimedia.org/r/1007997

Its really only uploads > 1GB that have this issue, which is a very small portion of uploads. Maybe it would make sense to split AssembleUploadChunks job into a big and small version depending on how big the file is. The big version could be handled more similarly to how webVideoTranscode jobs are handled and the small version could be handled like a normal job. (Possibly have to do the same thing for PublishStashedFile, although much of the work in that job is reverifying the file which isn't strictly needed since it has already been done, and we could possibly just make it not do that, which might make it much faster for large files).

Change #1007996 merged by jenkins-bot:

[operations/deployment-charts@master] Increase tls timeout for mw-jobrunners to fix upload jobs

https://gerrit.wikimedia.org/r/1007996

I've deployed the change to increase the envoy timeout to the jobrunners (thanks for the patches!) - I'll continue to monitor before we take any further changes. Are these failures graphed anywhere in Grafana? aiui Changeprop doesn't really have enough awareness of a job to get proper insight on these kinds of faliures

Mostly i was looking at how many uploads get to this stage but don't get past it. I was looking at https://logstash.wikimedia.org/goto/f299fd6b01197908797af218435bd927 and comparing revieved last chunk vs chunk recombined messages.

On an individual level, you can also check how many AssembleUploadChunks jobs get run multiple times without being marked as an error.

I also told one of the people on commons that was complaining to try uploading their file again, to see if it works.

Looking at a recent example, it seems like the change didn't take affect (Or i was wrong about the cause).

As an example: https://logstash.wikimedia.org/goto/8edbd98fddc2cf0ff00f426aa9b1d7c3
We have the first job queue log entry for this job at Apr 2, 2024 @ 19:58:35.209. then we have a second attempt at Apr 2, 2024 @ 20:02:57.114 which is about 4 minutes 22 seconds afterwards. Similarly for https://logstash.wikimedia.org/goto/91377bd2f59ec43c3bd5ca0561ce54c3 we have first log entry at Apr 3, 2024 @ 04:42:47 and the failed retry at Apr 3, 2024 @ 04:47:09.397 4 minutes 28 seconds later. This does seem slightly different, as earlier they seemed to be exactly 4 minutes apart, but maybe the job queue is just a bit busier at the moment so there was a bit of delay on the retry.

There is also this example https://logstash.wikimedia.org/goto/d11a5d3dbab594c19e486ddd5b707e2f where the retry happened after only 1.5 minutes, which I don't really understand unless the php process died or something.

There is sort of a graph of these types of failures at https://logstash.wikimedia.org/goto/654092fb1ca62897a96d96ec4d5d094c

Having looked at this a little further, I have a strong suspicion the timeout is coming from somewhere other than the config that we currently have defined (perhaps a default somewhere that isn't explicit). Going to keep looking into it, there's a chance this timeout will be affecting other mw-on-k8s charts but just hasn't bitten us because they don't use the longer timeouts the jobrunners do.

Change #1017819 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mw-jobrunner: bump max_execution_time

https://gerrit.wikimedia.org/r/1017819

Change #1017819 merged by jenkins-bot:

[operations/deployment-charts@master] mw-jobrunner: bump max_execution_time

https://gerrit.wikimedia.org/r/1017819

Kubernetes jobrunners were using the default max_execution_time of 180s, which might be the culprit for these behaviours. I've bumped it to match the metal jobrunners.

Kubernetes jobrunners were using the default max_execution_time of 180s, which might be the culprit for these behaviours. I've bumped it to match the metal jobrunners.

Hmm, that might have also caused problems for web requests too, as in the current configuration, MediaWiki assumes that web POST requests can last more than 200 seconds, and there is code that tries to ensure that if a request is taking too much time that it is cleaned up gracefully and doesn't die inside a critical section.

As an aside, the jobs do seem to be failing after about 200 seconds, which is the timeout for a non-job runner POST request. However, i read through the code, and don't see how that limit could be applied here, since its based on SERVERGROUP and the log entries from the jobs have the correct SERVERGROUP.

Change #1018231 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mw-jobrunner: use same request_terminate_timeout as metal

https://gerrit.wikimedia.org/r/1018231

Change #1018231 merged by jenkins-bot:

[operations/deployment-charts@master] mw-jobrunner: use same request_terminate_timeout as metal

https://gerrit.wikimedia.org/r/1018231

Kubernetes jobrunners were using the default max_execution_time of 180s, which might be the culprit for these behaviours. I've bumped it to match the metal jobrunners.

Hmm, that might have also caused problems for web requests too, as in the current configuration, MediaWiki assumes that web POST requests can last more than 200 seconds, and there is code that tries to ensure that if a request is taking too much time that it is cleaned up gracefully and doesn't die inside a critical section.

As an aside, the jobs do seem to be failing after about 200 seconds, which is the timeout for a non-job runner POST request. However, i read through the code, and don't see how that limit could be applied here, since its based on SERVERGROUP and the log entries from the jobs have the correct SERVERGROUP.

Sorry, I misspoke - in the mediawiki chart we define a default max_execution_time of 210 seconds. There were other default timeouts in the jobrunner config that appear to have been missed that are now fixed.

Looking at recent logs, it does seem like some of the failures might be caused by things other than a timeout (There was one recently that failed after 17 seconds). However i still think the larger files are failing due to some sort of timeout as they tend to fail at around 202 second mark, which is pretty suspicious. Sample size is pretty low since users know large uploads don't work so they don't try and upload large files.

Change #1019823 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mw-jobrunner: set more php-specific settings to match metal instances

https://gerrit.wikimedia.org/r/1019823

Change #1019823 merged by jenkins-bot:

[operations/deployment-charts@master] mw-jobrunner: set more php-specific settings to match metal instances

https://gerrit.wikimedia.org/r/1019823

hnowlan triaged this task as High priority.Tue, Apr 16, 3:05 PM

We can exclude a bad setting of the async trait for mw-jobrunner.
From a pod in production via shell.php

> use Wikimedia\MWConfig\ClusterConfig;
> ClusterConfig::getInstance()->isK8s()
= true

> ClusterConfig::getInstance()->isAsync()
= true

Looking at recent logs, it does seem like some of the failures might be caused by things other than a timeout (There was one recently that failed after 17 seconds). However i still think the larger files are failing due to some sort of timeout as they tend to fail at around 202 second mark, which is pretty suspicious. Sample size is pretty low since users know large uploads don't work so they don't try and upload large files.

202 in this case is very suspicious as this is our current base timeout in apache.conf - however, it appears very unlikely that this is being hit in the jobrunners themselves.

How can I get into a pod in job runners namespace(?) via shell.php? I want to try some stuff

Ok, i did some more digging in logstash.

For req-id: a0d2bc1e-203c-4a20-a753-3da12c41fc38 (Upload attempt of File:Overwinter.webm )

1st attempt at Apr 16, 2024 @ 05:30:09.184
2nd attempt at 05:34:31.377.
Since there is a 1 minute delay before job retry, the first job should have died at around 5:33. I looked for other log entries on that pod at 5:33, and i found an entry that i think is for that job [there is no request id associated with the log entry to be sure]:

_id=zWJl5Y4B1Aouzw__aYGJ

[16-Apr-2024 05:33:31] WARNING: [pool www] child 12, script '/srv/mediawiki/rpc/RunSingleJob.php' (request: "POST /rpc/RunSingleJob.php") execution timed out (202.155670 sec), terminating
[16-Apr-2024 05:33:31] WARNING: [pool www] child 12 exited on signal 15 (SIGTERM) after 6534.727666 seconds from start

For req-id: 50e342b4-966e-4e7b-943f-20be2bf741c9
1st attempt at Apr 16, 2024 @ 20:47:10.014
2nd attempt at Apr 16, 2024 @ 20:48:23.358
Probable related logs for same host at 20:47:23:

time="2024-04-16T20:47:30Z" level=info msg="Delete pod: mw-jobrunner/mw-jobrunner.eqiad.main-b6844d65c-7hqqr"
I0416 20:47:23.649334     1 Server-inl.h:192] Shutting down

(logstash id_ is qzKq6I4B1Aouzw__FqjP and Evmq6I4BeyUZP0OuL5sd )


So it seems like two separate issues.

I guess sometimes the job runner pod gets terminated in the middle of a job. That would be fine if something like https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1008403 got merged)

The other case seems like a 202 second timeout. I guess that is the php-fpm log file, although i don't know if apache timeout would also show up there, but it sounds like it is either the php-fpm or apache timeout.

How can I get into a pod in job runners namespace(?) via shell.php? I want to try some stuff

Follow the instructions on getting a shell on a production pod for the mw-jobrunner namespace, then run php /srv/mediawiki/multiversion/MWScript.php "shell.php" --wiki whateverwiki.

I think I found it

www-data@mw-jobrunner:/$ rgrep request_terminate_timeout /etc/php/
/etc/php/7.4/fpm/pool.d/FCGI_UNIX.conf:request_terminate_timeout = 201
/etc/php/7.4/fpm/pool.d/FCGI_TCP.conf:request_terminate_timeout = 201

Change #1021415 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/docker-images/production-images@master] php7.4-fpm: Actually use FPM__request_terminate_timeout

https://gerrit.wikimedia.org/r/1021415

Change #1021418 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Fix php.timeout

https://gerrit.wikimedia.org/r/1021418

Change #1021415 merged by Clément Goubert:

[operations/docker-images/production-images@master] php7.4-fpm: Actually use FPM__request_terminate_timeout

https://gerrit.wikimedia.org/r/1021415

Mentioned in SAL (#wikimedia-operations) [2024-04-18T10:45:15Z] <claime> Rebuild php7.4-fpm production images - T358308

Change #1021418 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Fix php.timeout

https://gerrit.wikimedia.org/r/1021418

Mentioned in SAL (#wikimedia-operations) [2024-04-18T10:52:15Z] <cgoubert@deploy1002> Started scap: Redeploy mw-on-k8s with full rebuild - Fix setting php.timeout - T358308

Change #1021427 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-debug: fix php.timeout

https://gerrit.wikimedia.org/r/1021427

Mentioned in SAL (#wikimedia-operations) [2024-04-18T11:29:19Z] <cgoubert@deploy1002> Finished scap: Redeploy mw-on-k8s with full rebuild - Fix setting php.timeout - T358308 (duration: 37m 04s)

Change #1021427 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: fix php.timeout

https://gerrit.wikimedia.org/r/1021427

request_terminate_timeout for mw-jobrunner should now be set to 86400, as it was on bare metal.

[...]
So it seems like two separate issues.

I guess sometimes the job runner pod gets terminated in the middle of a job. That would be fine if something like https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1008403 got merged)

Yes, that will happen on every mw-on-k8s deployment and is sort of unavoidable for now given how we do deployments.

The other case seems like a 202 second timeout. I guess that is the php-fpm log file, although i don't know if apache timeout would also show up there, but it sounds like it is either the php-fpm or apache timeout.

image.png (673×3 px, 147 KB)

It looks like setting request_terminate_timeout correctly worked. I'll keep an eye on it.

Seems like this is fixed! We just had File:OBR_Hafignnover_6-2024.webm (4.88 GB) successfully uploaded (req-id: c1a5aab7-6e6b-4ef5-b20d-f9ddb095577f ). The job took 5 minutes 20 seconds to complete, so went beyond the previous 202 second limit.

 [...]
So it seems like two separate issues.
 I guess sometimes the job runner pod gets terminated in the middle of a job. That would be fine if something like https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1008403 got merged)

Yes, that will happen on every mw-on-k8s deployment and is sort of unavoidable for now given how we do deployments.

Alright. It should be work around-able on the MediaWiki side, so that's not as big a deal.


Thank you everyone who spent effort on bringing this to a resolution.

Clement_Goubert claimed this task.

Marking this resolved as you just confirmed a big file upload going through correctly. Thanks for your help in debugging this!