Page MenuHomePhabricator

Failure to save recent changes
Closed, ResolvedPublic

Description

For the last four hours each time I try to save recent changes to a resource on Wikiversity I get the same error message: Request from 205.232.250.92 via cp1067 frontend, Varnish XID 322046810
Error: 503, Backend fetch failed at Fri, 11 Nov 2016 00:21:40 GMT.

This message is from the most recent.

Event Timeline

Another user was able to save "Evidence based assessment/Instruments/Navbox" at 01:35, 11 November 2016 on Wikiversity.

I was able to save geophysics quiz (Q27826294)‎; at 00:15 on wikidata.

Peachey88 triaged this task as Unbreak Now! priority.Nov 11 2016, 7:33 AM
Peachey88 edited projects, added SRE; removed WMF-General-or-Unknown.

Log captured with varnishlog -n frontend -g request -q 'RespStatus eq 503'

*   << Request  >> 629660955 
-   Begin          req 629660954 rxreq
-   Timestamp      Start: 1478859911.369584 0.000000 0.000000
-   Timestamp      Req: 1478859911.369584 0.000000 0.000000
-   ReqStart       10.20.0.178 40579
-   ReqMethod      POST
-   ReqURL         /w/api.php
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Host: fr.wikipedia.org
-   ReqHeader      X-Client-IP: XXX.XXX.XXX.XXX
-   ReqHeader      X-Forwarded-For: XXX.XXX.XXX.XXX
-   ReqHeader      X-Forwarded-Proto: https
-   ReqHeader      X-Connection-Properties: H2=1; SSR=1; SSL=TLSv1.2; C=ECDHE-ECDSA-AES128-GCM-SHA256; EC=prime256v1;
-   ReqHeader      Connection: close
-   ReqHeader      Content-Length: 1688
-   ReqHeader      user-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0 SeaMonkey/2.40
-   ReqHeader      accept: application/json, text/javascript, */*; q=0.01
-   ReqHeader      accept-language: fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3
-   ReqHeader      accept-encoding: gzip, deflate
-   ReqHeader      dnt: 1
-   ReqHeader      content-type: application/x-www-form-urlencoded; charset=UTF-8
-   ReqHeader      x-requested-with: XMLHttpRequest
-   ReqHeader      referer: https://fr.wikipedia.org/w/index.php?title=Discussion:REDACTED
-   ReqHeader      cookie: WMF-Last-Access=11-Nov-2016; REDACTED
-   ReqUnset       X-Forwarded-For: XXX.XXX.XXX.XXX
-   ReqHeader      X-Forwarded-For: XXX.XXX.XXX.XXX, 10.20.0.178
-   VCL_call       RECV
-   ReqUnset       Host: fr.wikipedia.org
-   ReqHeader      Host: fr.wikipedia.org
-   VCL_acl        MATCH wikimedia_trust "10.0.0.0"/8
-   VCL_acl        MATCH local_host "10.20.0.178"
-   ReqUnset       X-Forwarded-For: XXX.XXX.XXX.XXX, 10.20.0.178
-   ReqHeader      X-Forwarded-For: XXX.XXX.XXX.XXX, 10.20.0.178
-   ReqUnset       X-Forwarded-For: XXX.XXX.XXX.XXX, 10.20.0.178
-   ReqHeader      X-Forwarded-For: XXX.XXX.XXX.XXX, 10.20.0.178
-   ReqUnset       X-Forwarded-For: XXX.XXX.XXX.XXX, 10.20.0.178
-   ReqHeader      X-Forwarded-For: XXX.XXX.XXX.XXX, 10.20.0.178
-   ReqHeader      via-nginx: 1
-   ReqHeader      X-Trusted-Proxy: 
-   ReqUnset       X-Trusted-Proxy: 
-   ReqHeader      X-Carrier: 
-   ReqUnset       X-Carrier: 
-   Debug          "VSLP picked preferred backend  3 for key 513ef9da"
-   ReqHeader      X-WMF-LastStamp: 11-Nov-2016
-   VCL_return     pass
-   VCL_call       HASH
-   VCL_return     lookup
-   VCL_call       PASS
-   ReqHeader      X-CDIS: pass
-   VCL_return     fetch
-   Link           bereq 629660956 pass
-   FetchError     straight insufficient bytes
-   Timestamp      ReqBody: 1478859913.298887 1.929302 1.929302
-   Timestamp      Fetch: 1478859918.301781 6.932197 5.002895
-   RespProtocol   HTTP/1.1
-   RespStatus     503
-   RespReason     Backend fetch failed
-   RespHeader     Date: Fri, 11 Nov 2016 10:25:18 GMT
-   RespHeader     Server: Varnish
-   RespHeader     Content-Type: text/html; charset=utf-8
-   RespHeader     X-Varnish: 629660955
-   RespHeader     Age: 0
-   RespHeader     Via: 1.1 varnish-v4
-   VCL_call       DELIVER
-   RespHeader     X-Cache-Int: cp3043 pass
-   RespHeader     X-Cache: cp3043 pass
-   RespUnset      X-Cache-Int: cp3043 pass
-   RespHeader     X-Cache-Status: bug
-   RespHeader     Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
-   RespHeader     X-Analytics: 
-   ReqHeader      X-NowDay: 11-Nov-2016
-   RespUnset      X-Analytics: 
-   RespHeader     X-Analytics: ;WMF-Last-Access=11-Nov-2016
-   RespUnset      X-Analytics: ;WMF-Last-Access=11-Nov-2016
-   RespHeader     X-Analytics: ;WMF-Last-Access=11-Nov-2016;https=1
-   RespUnset      X-Analytics: ;WMF-Last-Access=11-Nov-2016;https=1
-   RespHeader     X-Analytics: WMF-Last-Access=11-Nov-2016;https=1
-   RespHeader     X-Client-IP: XXX.XXX.XXX.XXX
-   VCL_return     deliver
-   Timestamp      Process: 1478859918.301840 6.932256 0.000058
-   RespHeader     Content-Length: 1556
-   Debug          "RES_MODE 2"
-   RespHeader     Connection: close
-   Timestamp      Resp: 1478859918.301864 6.932280 0.000024
-   ReqAcct        1719 0 1719 417 1556 1973
-   End            
**  << BeReq    >> 629660956 
--  Begin          bereq 629660955 pass
--  Timestamp      Start: 1478859911.369661 0.000000 0.000000
--  BereqMethod    POST
--  BereqURL       /w/api.php
--  BereqProtocol  HTTP/1.1
--  BereqHeader    X-Client-IP: XXX.XXX.XXX.XXX
--  BereqHeader    X-Forwarded-Proto: https
--  BereqHeader    X-Connection-Properties: H2=1; SSR=1; SSL=TLSv1.2; C=ECDHE-ECDSA-AES128-GCM-SHA256; EC=prime256v1;
--  BereqHeader    Content-Length: 1688
--  BereqHeader    user-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0 SeaMonkey/2.40
--  BereqHeader    accept: application/json, text/javascript, */*; q=0.01
--  BereqHeader    accept-language: fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3
--  BereqHeader    accept-encoding: gzip, deflate
--  BereqHeader    dnt: 1
--  BereqHeader    content-type: application/x-www-form-urlencoded; charset=UTF-8
--  BereqHeader    x-requested-with: XMLHttpRequest
--  BereqHeader    referer: https://fr.wikipedia.org/w/index.php?title=Discussion:REDACTED
--  BereqHeader    cookie: WMF-Last-Access=11-Nov-2016; REDACTED
--  BereqHeader    Host: fr.wikipedia.org
--  BereqHeader    X-Forwarded-For: XXX.XXX.XXX.XXX, 10.20.0.178
--  BereqHeader    via-nginx: 1
--  BereqHeader    X-WMF-LastStamp: 11-Nov-2016
--  BereqHeader    X-CDIS: pass
--  BereqHeader    X-Varnish: 629660956
--  VCL_call       BACKEND_FETCH
--  VCL_return     fetch
--  BackendOpen    666 root:5955c50c-10b1-4d65-9e87-fb4a9c468ebf.be_cp3031_esams_wmnet 10.20.0.166 3128 10.20.0.178 30430
--  BackendStart   10.20.0.166 3128
--  FetchError     req.body read error: 0 (Success)
--  FetchError     backend write error: 0 (Success)
--  Timestamp      Bereq: 1478859913.298904 1.929243 1.929243
--  BackendClose   666 root:5955c50c-10b1-4d65-9e87-fb4a9c468ebf.be_cp3031_esams_wmnet
--  Timestamp      Beresp: 1478859918.301706 6.932045 5.002802
--  Timestamp      Error: 1478859918.301711 6.932050 0.000005
--  BerespProtocol HTTP/1.1
--  BerespStatus   503
--  BerespReason   Service Unavailable
--  BerespReason   Backend fetch failed
--  BerespHeader   Date: Fri, 11 Nov 2016 10:25:18 GMT
--  BerespHeader   Server: Varnish
--  VCL_call       BACKEND_ERROR
--  BerespHeader   Content-Type: text/html; charset=utf-8
--  VCL_return     deliver
--  Storage        malloc Transient
--  ObjProtocol    HTTP/1.1
--  ObjStatus      503
--  ObjReason      Backend fetch failed
--  ObjHeader      Date: Fri, 11 Nov 2016 10:25:18 GMT
--  ObjHeader      Server: Varnish
--  ObjHeader      Content-Type: text/html; charset=utf-8
--  Length         1556
--  BereqAcct      1793 0 1793 0 0 0
--  End

Mentioned in SAL (#wikimedia-operations) [2016-11-11T11:10:15Z] <ema> cp3043 repooled with gethdr_extrachance=100 (T150503)

From the httpd point of view:

There are a lot of 503s logged for GET requests for /w/api.php like the following:

2016-11-11T12:07:44	59999926	10.64.0.103	proxy-server/503	50065	GET	http://en.wikipedia.org/w/api.php [..] text/html https://en.wikipedia.org/w/index.php?title=[REDACTED]&action=submit

With a correspondent error log:

[proxy_fcgi:error]  (70008)Partial results are valid but processing is incomplete: AH01075: Error dispatching request to : (reading input brigade), referer: https://en.wikipedia.org/w/api.php?title=[REDACTED]action=submit

Some remarks:

  • 59999926 is the time taken for the request in microseconds, that is more or less always ~60s.
  • 50065 is the len of the 503 page.
  • The field after the text/html is the HTTP Referer
  • If I grep for the same combination of URI - Referer - etc.. I get only HTTP 200 responses related to POSTs, so Apache might fail to log the proper HTTP status code when handling this error.

From the httpd's internal point of view:

  • Partial results are valid but processing is incomplete corresponds to APR_INCOMPLETE and it seems returned by the dispatch function of mod_proxy_fcgi, precisely this bit:
rv = ap_get_brigade(r->input_filters, ib,
                    AP_MODE_READBYTES, APR_BLOCK_READ,
                    iobuf_size);
if (rv != APR_SUCCESS) {
    *err = "reading input brigade";
    *bad_request = 1;
    break;
}

Note that reading input brigade is in the error log's output. This seems to be something related to the httpd input filter chain, namely the stream of data coming from Varnish and related to the request to handle.

We still don't know what causes it, a repro would definitely help :)

To get a precise idea of the impact: apache2log

We've been able to reproduce the bug on pinkunicorn by closing the connection before sending Content-Length bytes as follows:

#!/usr/bin/env python

import socket

REPRO = True

dest_name = "127.0.0.1"
port = 80

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((dest_name, port))

if REPRO:
    cl = "Content-Length: 256"
else:
    cl = ""

s.send("""POST /w/api.php HTTP/1.1
Host: en.wikipedia.org
X-Forwarded-Proto: https
X-Wikimedia-Debug: backend=mw1099.eqiad.wmnet
%s

apples=oranges&oranges=apples
""" % cl)
data = (s.recv(1024))
print data

s.shutdown(1)
s.close()

Even simpler:

curl -d "Hola!" --header "Content-Length: 120" --header "Host: en.wikipedia.org" localhost/w/api.php

I checked the httpd trunk code and a quick php script to test, I am able to repro but the request ends up in a HTTP 400 logged correctly as POST.

So to be a bit more precise on what happens on apache:

mod_proxy_fcgi reads the request body in a loop, when it gets to the end of input according to the content-length (and that's end of input) it sets itself to stop reading from stdin

if (APR_BUCKET_IS_EOS(APR_BRIGADE_LAST(ib))) {
    last_stdin = 1;
}
...

if (last_stdin) {
    pfd.reqevents = APR_POLLIN; /* Done with input data */
...

if the last bucket is not reached, it will keep trying to read from an input that is already finished, hence the error @elukey was referring to.

This only happens if the POST body is somehow shorter than the content-length. If it's longer probably we won't see the same type of error (or any error at all).

Still, what is puzzling about apache is:

  1. It returns a 503 and not a 400 as I would expect per the HTTP standard
  2. It logs in access log a GET instead of a POST, after ~ 60 seconds (one of our timeouts in the apache config), while according to varnish the issue happens as soon as it's done sending the body and reaches its own timeout (usually the response time according to varnish is ~ 5 s)

We've set nginx's proxy_request_buffering back on: https://gerrit.wikimedia.org/r/#/c/321000/ and that seems to help.

Still, what is puzzling about apache is:

  1. It returns a 503 and not a 400 as I would expect per the HTTP standard
  2. It logs in access log a GET instead of a POST, after ~ 60 seconds (one of our timeouts in the apache config), while according to varnish the issue happens as soon as it's done sending the body and reaches its own timeout (usually the response time according to varnish is ~ 5 s)

@Joe I used git bisect and found that http://svn.apache.org/r1682544 fixed the issue (contained in 2.4.16)

Just FYI:
Changes smaller than about 6kB will be saved but larger ones trip the error message every time.

@Marshallsumter we've deployed a configuration change on Friday evening that should have fixed the problem. Can you please confirm whether that is the case? Thanks!

Marostegui changed the task status from Open to Stalled.Nov 16 2016, 7:09 AM
Marostegui lowered the priority of this task from Unbreak Now! to High.

I've made several efforts to well above about 6kB and all have worked!!!

Another awesome job by the phabricator volunteers!!!

elukey claimed this task.

So we found several things to follow up, probably not on this task:

  1. nginx proxy_request_buffering (already deployed via https://gerrit.wikimedia.org/r/#/c/321000/) - this effectively solved the problem since now nginx buffers the whole request before proxying it.
  2. Varnish bug - https://github.com/varnishcache/varnish-cache/issues/2126
  3. Our version of httpd is missing http://svn.apache.org/r1682544, but since we have 1) I don't believe we'd need to patch it (even if it would be nice to have it).

No other user report has been filed so I believe that we can close :)