Page MenuHomePhabricator

Reconcile MediaWiki POST timeout and Varnish/ATS timeouts
Open, MediumPublic

Description

Per https://wikitech.wikimedia.org/wiki/HTTP_timeouts, varnish/ATS has a 180 second timeout, but MediaWiki has $wgRequestTimeLimit set to 200 seconds on POST requests, leading to the situation where some POST requests will result in the user seeing an error screen from Varnish, rather than the one generated by MediaWiki.

Either the Varnish/ATS timeout should be raised to match MW or the MW should be lowered to be slightly under Varnish/ATS, per the parent task.

Event Timeline

If anything, I think we should go in the other direction, and progressively and drastically reduce our timeouts for any synchronous requests to something nearer to the GET timeout of 60 seconds.

Any operation that takes longer than that (or even around that time, but I digress) should be broken down in an async process and executed by our battery of long-running jobs executors (the jobqueue).

With MediaWiki on kubernetes we will separate workers further between configurations, and we will be able to keep a longer timeout for internal operations, and a shorter one for public-facing ones.

If anything, I think we should go in the other direction, and progressively and drastically reduce our timeouts for any synchronous requests to something nearer to the GET timeout of 60 seconds.

Any operation that takes longer than that (or even around that time, but I digress) should be broken down in an async process and executed by our battery of long-running jobs executors (the jobqueue).

In general I agree with all of this, but I'd like to be able to just solve the current problem of the timeouts being out of sync. Currently the problem is that on uploads-by-url, they hit the 180s download timeout (and also most likely hit the 200s MW timeout) but by that time, Varnish has returned an error page to the user, so the user has no clue why their upload failed.

I haven't yet looked into why MW is currently set to 200s, I would be OK with lowering the upload-by-url timeout to 170s, the MediaWiki POST timeout to 175s and leaving Varnish at 180s - with the thumbs up from Platform.

Sidenote, I wonder if we can get some basic stats from the envoy metrics about how many POST requests currently take more than 175s but finish before 200s, so we can gauge the impact of this. Or whether we need MW to log that data for us.

Sidenote, I wonder if we can get some basic stats from the envoy metrics about how many POST requests currently take more than 175s but finish before 200s, so we can gauge the impact of this. Or whether we need MW to log that data for us.

MediaWiki, or rather, apache httpd does log that information for us:

awk 'BEGIN {c=0}{if ($2 > 175000000 && $6 == "POST" && $2 < 200000000) {c+=1 }} END {print c}' /var/log/apache2/other_vhosts_access.log

will give you the count on a single appserver of all requests taking more than 175 seconds.

Using cumin, I find 4 occurences in today's appservers logs and 8 in yesterday's.

cumin1001:~$ sudo cumin 'A:mw-eqiad' "awk 'BEGIN {c=0}{if (\$2 > 175000000 && \$6 == \"POST\" && \$2 < 200000000 ) {c+=1 }} END {print c}' /var/log/apache2/other_vhosts_access.log.1"
73 hosts will be targeted:
mw[1319-1333,1349-1355,1364-1373,1384-1385,1387,1389,1391,1393,1395,1397,1399,1401,1403,1405,1407,1409,1411,1413-1420,1429-1436,1441-1442,1451-1456].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet
Ok to proceed on 73 hosts? Enter the number of affected hosts to confirm or "q" to quit 73
===== NODE GROUP =====                                                                                                                                                                                             
(8) mw[1323-1324,1332,1353,1364,1366,1407,1411].eqiad.wmnet                                                                                                                                                        
----- OUTPUT of 'awk 'BEGIN {c=0}...sts_access.log.1' -----                                                                                                                                                        
1                                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(65) mw[1319-1322,1325-1331,1333,1349-1352,1354-1355,1365,1367-1373,1384-1385,1387,1389,1391,1393,1395,1397,1399,1401,1403,1405,1409,1413-1420,1429-1436,1441-1442,1451-1456].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet                                                                                                                                                                                                              
----- OUTPUT of 'awk 'BEGIN {c=0}...sts_access.log.1' -----                                                                                                                                                        
0

At 7k requests/second, that's really marginal. But more to my point: what should change is the fact that upload-by-url can take up to 200 seconds to complete a single HTTP request.
HTTP requests should terminate reasonably quickly and the work should be done in the background via the jobqueue, and the client should have a way of polling it.

It's a bit hard to justify asking Traffic to bump the HTTP timeout by 8% because of a software limitation no one will fix in the forseeable future.

Let me add another data point: Of those 8 requests over 175 seconds, only 2 were to POSTs to Special:Upload.

colewhite triaged this task as Medium priority.Nov 8 2021, 10:34 PM