$ curl --verbose http://en.wikipedia.beta.wmflabs.org * Rebuilt URL to: http://en.wikipedia.beta.wmflabs.org/ * Trying 208.80.155.135... * Connected to en.wikipedia.beta.wmflabs.org (208.80.155.135) port 80 (#0) > GET / HTTP/1.1 > Host: en.wikipedia.beta.wmflabs.org > User-Agent: curl/7.43.0 > Accept: */* > * Empty reply from server * Connection #0 to host en.wikipedia.beta.wmflabs.org left intact curl: (52) Empty reply from server
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Beta: RESTBase: Contact the MW API directly | operations/puppet | production | +1 -1 |
Event Timeline
CLOSE_WAIT are mostly:
deployment-cache-upload-04:80 -> deployment-restbase02:X 127.0.0.1:X->127.0.0.1:3128
Might just be a consequence.
full lsof dump on deployment-cache-text04:/home/dcausse/lsof_all_T134346.gz (37M compressed)
Random varnishlog for the backend:
varnishlog -n deployment-cache-text04 0 Debug - "VCL_error(200, OK)" 17 SessionOpen c 127.0.0.1 8402 :3128 17 ReqStart c 127.0.0.1 8402 592352331 17 RxRequest c GET 17 RxURL c /check 17 RxProtocol c HTTP/1.1 17 RxHeader c Host: varnishcheck 17 RxHeader c User-agent: Varnish backend check 17 RxHeader c Connection: close 17 VCL_call c recv 17 VCL_acl c MATCH wikimedia_nets 127.0.0.0/8 17 VCL_return c error 17 VCL_call c hash 17 Hash c /check 17 Hash c varnishcheck 17 VCL_return c hash 17 VCL_call c error deliver 17 VCL_call c deliver deliver 17 TxProtocol c HTTP/1.1 17 TxStatus c 200 17 TxResponse c OK 17 TxHeader c Server: Varnish 17 TxHeader c Accept-Ranges: bytes 17 TxHeader c Date: Wed, 04 May 2016 08:44:55 GMT 17 TxHeader c X-Varnish: 592352331 17 TxHeader c Age: 0 17 TxHeader c Via: 1.1 varnish 17 TxHeader c Connection: close 17 TxHeader c X-Cache: deployment-cache-text04 int(0) 17 Length c 0 17 ReqEnd c 592352331 1462351495.851732254 1462351495.851871252 0.000077009 0.000089645 0.000049353 17 SessionClose c error 17 StatSess c 127.0.0.1 8402 0 1 1 0 0 0 203 0
# varnishlog -n frontend 2975 SessionClose - dropped 2975 StatSess - - - 1462351596 0 0 0 0 0 0 0 2975 SessionClose - dropped 2975 StatSess - - - 1462351596 0 0 0 0 0 0 0 0 Backend_health - ipv4_127_0_0_1 Still healthy 4--X-RH 3 2 3 0.000848 0.000601 HTTP/1.1 200 OK 2975 SessionClose - dropped 2975 StatSess - - - 1462351596 0 0 0 0 0 0 0 2975 SessionClose - dropped 2975 StatSess - - - 1462351596 0 0 0 0 0 0 0 0 Backend_health - ipv4_127_0_0_1 Still healthy 4--X-RH 3 2 3 0.000623 0.000607 HTTP/1.1 200 OK 2975 SessionClose - dropped 2975 StatSess - - - 1462351596 0 0 0 0 0 0 0 2975 SessionClose - dropped 2975 StatSess - - - 1462351596 0 0 0 0 0 0 0 0 Backend_health - ipv4_127_0_0_1 Still healthy 4--X-RH 3 2 3 0.000419 0.000560 HTTP/1.1 200 OK 2975 SessionClose - dropped
That is the varnish backend check though.
The frontend process has a lot of threads (?):
# ps -p 26670 H|wc -l 1049
Which look like:
26670 ? Sl 0:00 /usr/sbin/varnishd -P /run/varnish-frontend.pid -a :80 \ -T 127.0.0.1:6082 -f /etc/varnish/wikimedia_text-frontend.vcl \ -w 250,500,120 -p shm_reclen=2048 -p user=varnish \ -p shm_workspace=16384 -p thread_pool_add_delay=1 \ -S /etc/varnish/secret -s malloc,1G \ -p thread_pool_stack=131072 -p listen_depth=4096 \ -p session_max=200000 -p vcc_err_unref=off \ -p nuke_limit=300 -p http_req_size=24576 \ -p default_ttl=2592000 -n frontend \ -p cc_command=exec cc -fpic -shared -Wl,-x -L/usr/local/lib/ -o %o %s -lGeoIP
The flow of traffic comes from deployment-changeprop, which hit restbase02 and then the text cache frontend.
Marko has stopped the change prop process. The varnish frontend is still stuck though with 3941 CLOSE_WAIT connections and 1049 threads.
Mentioned in SAL [2016-05-04T09:27:06Z] <hashar> deployment-cache-text04 systemctl stop varnish-frontend.service . To clear out all the stuck CLOSE_WAIT connections T134346
So the root cause is apparently change prop sending too many updates that ends up overloading the varnish text frontend. At which point it ends up with too many sub process all stuck somehow.
Leaving task open in case @BBlack has any interest in investigating since that might hit production?
The problem here is that change prop calls RESTBase a lot, which in turn contacts the MW API. The problem here is that RESTBase passes through Varnish for that. I will change the config so that it hits one of the two MW servers in Beta directly.
Change 286803 had a related patch set uploaded (by Mobrovac):
Beta: RESTBase: Contact the MW API directly
The frontend varnish on deployment-cache-text04 has ~ 500 threads but at least there are no more leaked file descriptor. Seems it is running just fine since RESTBase/changeprop on beta has been made to reach the app servers directly.