Reconcile MediaWiki POST timeout and Varnish/ATS timeouts
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Legoktm
	Nov 1 2021, 11:01 PM

Description

Per https://wikitech.wikimedia.org/wiki/HTTP_timeouts, varnish/ATS has a 180 second timeout, but MediaWiki has $wgRequestTimeLimit set to 200 seconds on POST requests, leading to the situation where some POST requests will result in the user seeing an error screen from Varnish, rather than the one generated by MediaWiki.

Either the Varnish/ATS timeout should be raised to match MW or the MW should be lowered to be slightly under Varnish/ATS, per the parent task.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T250251 Audit and harmonize timeouts across the stack
		Open		None	T294800 Reconcile MediaWiki POST timeout and Varnish/ATS timeouts

Event Timeline

Legoktm created this task.Nov 1 2021, 11:01 PM

Legoktm mentioned this in T292954: [epic] large file uploads to commons.Nov 1 2021, 11:03 PM

Krinkle added a project: Performance-Team (Radar).Nov 2 2021, 4:02 AM

If anything, I think we should go in the other direction, and progressively and drastically reduce our timeouts for any synchronous requests to something nearer to the GET timeout of 60 seconds.

Any operation that takes longer than that (or even around that time, but I digress) should be broken down in an async process and executed by our battery of long-running jobs executors (the jobqueue).

With MediaWiki on kubernetes we will separate workers further between configurations, and we will be able to keep a longer timeout for internal operations, and a shorter one for public-facing ones.

Yann subscribed.Nov 2 2021, 10:42 PM

In T294800#7473846, @Joe wrote:

If anything, I think we should go in the other direction, and progressively and drastically reduce our timeouts for any synchronous requests to something nearer to the GET timeout of 60 seconds.

Any operation that takes longer than that (or even around that time, but I digress) should be broken down in an async process and executed by our battery of long-running jobs executors (the jobqueue).

In general I agree with all of this, but I'd like to be able to just solve the current problem of the timeouts being out of sync. Currently the problem is that on uploads-by-url, they hit the 180s download timeout (and also most likely hit the 200s MW timeout) but by that time, Varnish has returned an error page to the user, so the user has no clue why their upload failed.

I haven't yet looked into why MW is currently set to 200s, I would be OK with lowering the upload-by-url timeout to 170s, the MediaWiki POST timeout to 175s and leaving Varnish at 180s - with the thumbs up from Platform.

Sidenote, I wonder if we can get some basic stats from the envoy metrics about how many POST requests currently take more than 175s but finish before 200s, so we can gauge the impact of this. Or whether we need MW to log that data for us.

In T294800#7480542, @Legoktm wrote:

Sidenote, I wonder if we can get some basic stats from the envoy metrics about how many POST requests currently take more than 175s but finish before 200s, so we can gauge the impact of this. Or whether we need MW to log that data for us.

MediaWiki, or rather, apache httpd does log that information for us:

awk 'BEGIN {c=0}{if ($2 > 175000000 && $6 == "POST" && $2 < 200000000) {c+=1 }} END {print c}' /var/log/apache2/other_vhosts_access.log

will give you the count on a single appserver of all requests taking more than 175 seconds.

Using cumin, I find 4 occurences in today's appservers logs and 8 in yesterday's.

cumin1001:~$ sudo cumin 'A:mw-eqiad' "awk 'BEGIN {c=0}{if (\$2 > 175000000 && \$6 == \"POST\" && \$2 < 200000000 ) {c+=1 }} END {print c}' /var/log/apache2/other_vhosts_access.log.1"
73 hosts will be targeted:
mw[1319-1333,1349-1355,1364-1373,1384-1385,1387,1389,1391,1393,1395,1397,1399,1401,1403,1405,1407,1409,1411,1413-1420,1429-1436,1441-1442,1451-1456].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet
Ok to proceed on 73 hosts? Enter the number of affected hosts to confirm or "q" to quit 73
===== NODE GROUP =====                                                                                                                                                                                             
(8) mw[1323-1324,1332,1353,1364,1366,1407,1411].eqiad.wmnet                                                                                                                                                        
----- OUTPUT of 'awk 'BEGIN {c=0}...sts_access.log.1' -----                                                                                                                                                        
1                                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(65) mw[1319-1322,1325-1331,1333,1349-1352,1354-1355,1365,1367-1373,1384-1385,1387,1389,1391,1393,1395,1397,1399,1401,1403,1405,1409,1413-1420,1429-1436,1441-1442,1451-1456].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet                                                                                                                                                                                                              
----- OUTPUT of 'awk 'BEGIN {c=0}...sts_access.log.1' -----                                                                                                                                                        
0

At 7k requests/second, that's really marginal. But more to my point: what should change is the fact that upload-by-url can take up to 200 seconds to complete a single HTTP request.
HTTP requests should terminate reasonably quickly and the work should be done in the background via the jobqueue, and the client should have a way of polling it.

It's a bit hard to justify asking Traffic to bump the HTTP timeout by 8% because of a software limitation no one will fix in the forseeable future.

Let me add another data point: Of those 8 requests over 175 seconds, only 2 were to POSTs to Special:Upload.

Legoktm mentioned this in T295007: Upload by URL should use the job queue, possibly chunked with range requests.Nov 4 2021, 7:45 AM

colewhite triaged this task as Medium priority.Nov 8 2021, 10:34 PM

Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.Nov 22 2021, 7:46 PM

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:20 PM

Krinkle moved this task from Perf recommendation to Watching on the Performance-Team (Radar) board.Oct 4 2022, 1:17 PM

Krinkle removed a project: Performance-Team (Radar).Aug 6 2023, 10:37 PM

Reconcile MediaWiki POST timeout and Varnish/ATS timeoutsOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Reconcile MediaWiki POST timeout and Varnish/ATS timeouts
Open, MediumPublic
Actions

Related Objects
Search...