Page MenuHomePhabricator

Investigate esams text varnish backend fetch failures
Open, NormalPublic

Description

There has been an increase in varnish-be fetch failures in esams text lately, correlating timewise with T226048. FetchError monitoring, recently introduced, might help diagnosing the issue: T224994.

See the kibana dashboard https://logstash.wikimedia.org/goto/caa7c533400fa99e5a4b0516ee0298a6 as well as the fetcherror pie graph for a breakdown of the different errors: https://logstash.wikimedia.org/goto/f14da21012a25060b1f63685028ecc7b

Last 24h at the time of this writing:

reasoncountpercentage
Resource temporarily unavailable - straight insufficient bytes1327931.4%
http format error1306230.9%
Could not get storage1034124.5%
HTC status 331137.4%
chunked read err24765.9%

"Resource temporarily unavailable" is EAGAIN, and it might be caused by between_bytes_timeout or similar.

$ sudo varnishtest -v bin/varnishtest/tests/b00022.vtc 2>&1 | grep FetchError
**** v1    1.5 vsl|       1002 FetchError      b Resource temporarily unavailable
**** v1    1.5 vsl|       1002 FetchError      b eof socket fail

"HTC status 3" is probably due to https://github.com/varnishcache/varnish-cache/issues/1772. Once we patch Varnish and upgrade all hosts to the new version, this error should become "Timed out reusing backend connection".

"http format error" is interesting, it looks like some garbage (sic, that's what varnish calls it) is occasionally returned by the appservers. See https://logstash.wikimedia.org/goto/7338143bb141cf85845385c77b52a944 and https://logstash.wikimedia.org/goto/4bfb870a2c82886a21ababfb898459b5

Event Timeline

ema created this task.Jun 24 2019, 9:49 AM
Restricted Application added a project: Operations. · View Herald TranscriptJun 24 2019, 9:49 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
This comment was removed by Gilles.
ema moved this task from Triage to Caching on the Traffic board.Jun 24 2019, 12:49 PM
ema updated the task description. (Show Details)Jun 24 2019, 1:03 PM
CDanis added a subscriber: CDanis.Jun 24 2019, 1:04 PM
ema updated the task description. (Show Details)Jun 24 2019, 1:20 PM

Change 518064 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] Honor first_byte_timeout for recycled backend connections

https://gerrit.wikimedia.org/r/518064

ema updated the task description. (Show Details)Jun 24 2019, 1:27 PM
ema updated the task description. (Show Details)Jun 24 2019, 1:36 PM

Change 518064 merged by Ema:
[operations/debs/varnish4@debian-wmf] Honor first_byte_timeout for recycled backend connections

https://gerrit.wikimedia.org/r/518064

ema updated the task description. (Show Details)Jun 24 2019, 2:01 PM

Mentioned in SAL (#wikimedia-operations) [2019-06-24T14:01:57Z] <ema> cp3032: upgrade varnish to 5.1.3-1wm11 T226375

Change 519056 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishfetcherror: log BogoHeader

https://gerrit.wikimedia.org/r/519056

Change 519056 merged by Ema:
[operations/puppet@production] varnishfetcherror: log BogoHeader

https://gerrit.wikimedia.org/r/519056

Change 519189 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: add cp3043 back to the text cluster

https://gerrit.wikimedia.org/r/519189

Change 519189 merged by Ema:
[operations/puppet@production] cache: add cp3043 back to the text cluster

https://gerrit.wikimedia.org/r/519189

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp3043.esams.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906261419_ema_252538.log.

Completed auto-reimage of hosts:

['cp3043.esams.wmnet']

and were ALL successful.

Zache added a subscriber: Zache.Jun 27 2019, 3:45 AM