Page MenuHomePhabricator

Recurring varnish-be fetch failures in codfw
Closed, ResolvedPublic

Description

All varnish-be in codfw are recurringly having troubles fetching from their eqiad backends. From the graph it's clear that the issue affects all hosts on all clusters at the same time. This causes some brief 503 spikes in codfw/ulsfo.

Last week @ayounsi checked whether he could find any signs of codfw<->eqiad network issues at the time of the spikes and he couldn't.

Event Timeline

ema created this task.Jul 10 2017, 10:54 AM
Restricted Application added a project: Operations. · View Herald TranscriptJul 10 2017, 10:54 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Jul 10 2017, 10:55 AM
elukey added a subscriber: elukey.Jul 10 2017, 11:34 AM

Some of those spikes seem to match with OSPF flaps triggered by BFD.
The link between cr2-codfw and cr2-eqiad stay up, no packet loss, but for some reasons BFD occasionally can't talk to the other side.
I'll monitor it a bit more to see if it matches and will open a ticket with the provider if necessary.

Mentioned in SAL (#wikimedia-operations) [2017-07-11T10:36:49Z] <XioNoX> bump BFD timer from 300 to 600 on the eqiad-codfw link for T170131

The BFD timer change didn't improve anything.
Got emails from Zayo saying they fixed an issue on the the eqiad-codfw link (finished at around 02:14Z today). Will monitor for improvements today.

ema moved this task from Triage to Caching on the Traffic board.Jul 12 2017, 9:39 AM
ema added a comment.Jul 12 2017, 10:03 AM

The fetch errors last precisely 240 seconds on each machine. A quick look at our varnish-be settings seems to open two routes for investigation:

  • thread_pool_timeout * thread_pools
  • max_retries * some_60s_timeout

60s timeouts possibly involved:

between_bytes_timeout
        Value is: 60.000 [seconds] (default)
        Minimum is: 0.000

        We only wait for this many seconds between bytes received from
        the backend before giving up the fetch.
        A value of zero means never give up.
        VCL values, per backend or per backend request take precedence.
        This parameter does not apply to pipe'ed requests.
first_byte_timeout
        Value is: 60.000 [seconds] (default)
        Minimum is: 0.000

        Default timeout for receiving first byte from backend. We only
        wait for this many seconds for the first byte before giving up.
        A value of 0 means it will never time out. VCL can override
        this default value for each backend and backend request. This
        parameter does not apply to pipe.
idle_send_timeout
        Value is: 60.000 [seconds] (default)
        Minimum is: 0.000

        Time to wait with no data sent. If no data has been transmitted
        in this many
        seconds the session is closed.
        See setsockopt(2) under SO_SNDTIMEO for more information.

        NB: This parameter may take quite some time to take (full)
        effect.

@ayounsi mentioned that the network issues on the codfw-eqiad link usually last just a bunch of seconds.

ayounsi moved this task from Backlog to Troubleshooting on the netops board.Jul 12 2017, 7:18 PM
This comment was removed by elukey.

Re-adding the comment after checking that all the cp hosts were pushing metrics to graphite (on some of them logster was erroring for lock not acquired).

From https://grafana.wikimedia.org/dashboard/db/varnishkafka?orgId=1&from=now-3h&to=now it seems that the varnishkafka producers are not affected by any spike in errors, might be interesting to as a note.

Another note, some of those spikes don't match with the OSPF flaps.

BBlack added a subscriber: BBlack.Oct 23 2017, 2:59 PM

Are these fetch-failure spikes still happening?

ema closed this task as Resolved.Oct 23 2017, 3:10 PM
ema claimed this task.