There has been an increase in varnish-be fetch failures in esams text lately, correlating timewise with T226048. FetchError monitoring, recently introduced, might help diagnosing the issue: T224994.
See the fetcherror pie graph on logstash for a breakdown of the different errors: https://logstash.wikimedia.org/goto/f14da21012a25060b1f63685028ecc7b
{F29629980}
Last 24h at the time of this writing:
| **reason** |**count**|**percentage**|
|Resource temporarily unavailable - straight insufficient bytes| 13279|31.4%|
|http format error| 13062|30.9%|
|Could not get storage|10341|24.5%|
|HTC status 3|3113|7.4%|
|chunked read err|2476|5.9%|
"Resource temporarily unavailable" is EAGAIN, and it's supposed to happen when a read() is attempted on a non-blocking socket. Further investigation needed on this one.
"HTC status 3" is probably due to https://github.com/varnishcache/varnish-cache/issues/1772. Once we patch Varnish and upgrade all hosts to the new version, this error should become "Timed out reusing backend connection".
"http format error" is interesting, it looks like some garbage (sic, that's how varnish calls it) is occasionally returned by the appservers. See https://logstash.wikimedia.org/goto/7338143bb141cf85845385c77b52a944 and https://logstash.wikimedia.org/goto/4bfb870a2c82886a21ababfb898459b5