Page MenuHomePhabricator

varnish text on beta is unreachable / stuck
Closed, ResolvedPublic

Description

$ curl --verbose  http://en.wikipedia.beta.wmflabs.org
* Rebuilt URL to: http://en.wikipedia.beta.wmflabs.org/
*   Trying 208.80.155.135...
* Connected to en.wikipedia.beta.wmflabs.org (208.80.155.135) port 80 (#0)
> GET / HTTP/1.1
> Host: en.wikipedia.beta.wmflabs.org
> User-Agent: curl/7.43.0
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host en.wikipedia.beta.wmflabs.org left intact
curl: (52) Empty reply from server

Event Timeline

hashar triaged this task as Unbreak Now! priority.May 4 2016, 8:38 AM
hashar added subscribers: Gehel, dcausse.

@dcausse @Gehel looking at it as well.

The varnish frontend shows high CPU usage. David captured lsof output:

That shows a lot of CLOSE_WAIT.

CLOSE_WAIT are mostly:

deployment-cache-upload-04:80 -> deployment-restbase02:X
127.0.0.1:X->127.0.0.1:3128

Might just be a consequence.

full lsof dump on deployment-cache-text04:/home/dcausse/lsof_all_T134346.gz (37M compressed)

Random varnishlog for the backend:

varnishlog -n deployment-cache-text04 
    0 Debug        - "VCL_error(200, OK)"
   17 SessionOpen  c 127.0.0.1 8402 :3128
   17 ReqStart     c 127.0.0.1 8402 592352331
   17 RxRequest    c GET
   17 RxURL        c /check
   17 RxProtocol   c HTTP/1.1
   17 RxHeader     c Host: varnishcheck
   17 RxHeader     c User-agent: Varnish backend check
   17 RxHeader     c Connection: close
   17 VCL_call     c recv
   17 VCL_acl      c MATCH wikimedia_nets 127.0.0.0/8
   17 VCL_return   c error
   17 VCL_call     c hash
   17 Hash         c /check
   17 Hash         c varnishcheck
   17 VCL_return   c hash
   17 VCL_call     c error deliver
   17 VCL_call     c deliver deliver
   17 TxProtocol   c HTTP/1.1
   17 TxStatus     c 200
   17 TxResponse   c OK
   17 TxHeader     c Server: Varnish
   17 TxHeader     c Accept-Ranges: bytes
   17 TxHeader     c Date: Wed, 04 May 2016 08:44:55 GMT
   17 TxHeader     c X-Varnish: 592352331
   17 TxHeader     c Age: 0
   17 TxHeader     c Via: 1.1 varnish
   17 TxHeader     c Connection: close
   17 TxHeader     c X-Cache: deployment-cache-text04 int(0)
   17 Length       c 0
   17 ReqEnd       c 592352331 1462351495.851732254 1462351495.851871252 0.000077009 0.000089645 0.000049353
   17 SessionClose c error
   17 StatSess     c 127.0.0.1 8402 0 1 1 0 0 0 203 0
# varnishlog -n frontend
 2975 SessionClose - dropped
 2975 StatSess     - - - 1462351596 0 0 0 0 0 0 0
 2975 SessionClose - dropped
 2975 StatSess     - - - 1462351596 0 0 0 0 0 0 0
    0 Backend_health - ipv4_127_0_0_1 Still healthy 4--X-RH 3 2 3 0.000848 0.000601 HTTP/1.1 200 OK
 2975 SessionClose - dropped
 2975 StatSess     - - - 1462351596 0 0 0 0 0 0 0
 2975 SessionClose - dropped
 2975 StatSess     - - - 1462351596 0 0 0 0 0 0 0
    0 Backend_health - ipv4_127_0_0_1 Still healthy 4--X-RH 3 2 3 0.000623 0.000607 HTTP/1.1 200 OK
 2975 SessionClose - dropped
 2975 StatSess     - - - 1462351596 0 0 0 0 0 0 0
 2975 SessionClose - dropped
 2975 StatSess     - - - 1462351596 0 0 0 0 0 0 0
    0 Backend_health - ipv4_127_0_0_1 Still healthy 4--X-RH 3 2 3 0.000419 0.000560 HTTP/1.1 200 OK
 2975 SessionClose - dropped

That is the varnish backend check though.

The frontend process has a lot of threads (?):

# ps -p 26670 H|wc -l
1049

Which look like:

26670 ?        Sl     0:00 /usr/sbin/varnishd -P /run/varnish-frontend.pid -a :80 \
 -T 127.0.0.1:6082 -f /etc/varnish/wikimedia_text-frontend.vcl \
-w 250,500,120 -p shm_reclen=2048 -p user=varnish \
-p shm_workspace=16384 -p thread_pool_add_delay=1 \
-S /etc/varnish/secret -s malloc,1G \
-p thread_pool_stack=131072 -p listen_depth=4096 \
-p session_max=200000 -p vcc_err_unref=off \
-p nuke_limit=300 -p http_req_size=24576 \
-p default_ttl=2592000 -n frontend \
-p cc_command=exec cc -fpic -shared -Wl,-x -L/usr/local/lib/ -o %o %s -lGeoIP
hashar renamed this task from whole beta cluster is unreachable to varnish text on beta is unreachable / stuck.May 4 2016, 8:56 AM
hashar added projects: Traffic, Varnish.

The flow of traffic comes from deployment-changeprop, which hit restbase02 and then the text cache frontend.

Marko has stopped the change prop process. The varnish frontend is still stuck though with 3941 CLOSE_WAIT connections and 1049 threads.

Mentioned in SAL [2016-05-04T09:27:06Z] <hashar> deployment-cache-text04 systemctl stop varnish-frontend.service . To clear out all the stuck CLOSE_WAIT connections T134346

hashar lowered the priority of this task from Unbreak Now! to Medium.May 4 2016, 9:30 AM
hashar added a subscriber: BBlack.

So the root cause is apparently change prop sending too many updates that ends up overloading the varnish text frontend. At which point it ends up with too many sub process all stuck somehow.

Leaving task open in case @BBlack has any interest in investigating since that might hit production?

So the root cause is apparently change prop sending too many updates that ends up overloading the varnish text frontend. At which point it ends up with too many sub process all stuck somehow.

The problem here is that change prop calls RESTBase a lot, which in turn contacts the MW API. The problem here is that RESTBase passes through Varnish for that. I will change the config so that it hits one of the two MW servers in Beta directly.

Change 286803 had a related patch set uploaded (by Mobrovac):
Beta: RESTBase: Contact the MW API directly

https://gerrit.wikimedia.org/r/286803

Change 286803 merged by Jcrespo:
Beta: RESTBase: Contact the MW API directly

https://gerrit.wikimedia.org/r/286803

hashar claimed this task.

The frontend varnish on deployment-cache-text04 has ~ 500 threads but at least there are no more leaked file descriptor. Seems it is running just fine since RESTBase/changeprop on beta has been made to reach the app servers directly.