Recurring varnish-be fetch failures in codfw
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ema
	Jul 10 2017, 10:54 AM

Description

All varnish-be in codfw are recurringly having troubles fetching from their eqiad backends. From the graph it's clear that the issue affects all hosts on all clusters at the same time. This causes some brief 503 spikes in codfw/ulsfo.

Last week @ayounsi checked whether he could find any signs of codfw<->eqiad network issues at the time of the spikes and he couldn't.

Event Timeline

• ema created this task.Jul 10 2017, 10:54 AM

Restricted Application added a project: SRE. · View Herald TranscriptJul 10 2017, 10:54 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ema triaged this task as High priority.Jul 10 2017, 10:55 AM

elukey subscribed.Jul 10 2017, 11:34 AM

Some of those spikes seem to match with OSPF flaps triggered by BFD.
The link between cr2-codfw and cr2-eqiad stay up, no packet loss, but for some reasons BFD occasionally can't talk to the other side.
I'll monitor it a bit more to see if it matches and will open a ticket with the provider if necessary.

Mentioned in SAL (#wikimedia-operations) [2017-07-11T10:36:49Z] <XioNoX> bump BFD timer from 300 to 600 on the eqiad-codfw link for T170131

The BFD timer change didn't improve anything.
Got emails from Zayo saying they fixed an issue on the the eqiad-codfw link (finished at around 02:14Z today). Will monitor for improvements today.

• ema moved this task from Backlog to Caching on the Traffic board.Jul 12 2017, 9:39 AM

The fetch errors last precisely 240 seconds on each machine. A quick look at our varnish-be settings seems to open two routes for investigation:

thread_pool_timeout * thread_pools
max_retries * some_60s_timeout

60s timeouts possibly involved:

between_bytes_timeout
        Value is: 60.000 [seconds] (default)
        Minimum is: 0.000

        We only wait for this many seconds between bytes received from
        the backend before giving up the fetch.
        A value of zero means never give up.
        VCL values, per backend or per backend request take precedence.
        This parameter does not apply to pipe'ed requests.

first_byte_timeout
        Value is: 60.000 [seconds] (default)
        Minimum is: 0.000

        Default timeout for receiving first byte from backend. We only
        wait for this many seconds for the first byte before giving up.
        A value of 0 means it will never time out. VCL can override
        this default value for each backend and backend request. This
        parameter does not apply to pipe.

idle_send_timeout
        Value is: 60.000 [seconds] (default)
        Minimum is: 0.000

        Time to wait with no data sent. If no data has been transmitted
        in this many
        seconds the session is closed.
        See setsockopt(2) under SO_SNDTIMEO for more information.

        NB: This parameter may take quite some time to take (full)
        effect.

@ayounsi mentioned that the network issues on the codfw-eqiad link usually last just a bunch of seconds.

ayounsi moved this task from Backlog to Next quarter on the netops board.Jul 12 2017, 7:18 PM

elukey added a comment.Jul 13 2017, 2:05 PM

This comment was removed by elukey.

Re-adding the comment after checking that all the cp hosts were pushing metrics to graphite (on some of them logster was erroring for lock not acquired).

From https://grafana.wikimedia.org/dashboard/db/varnishkafka?orgId=1&from=now-3h&to=now it seems that the varnishkafka producers are not affected by any spike in errors, might be interesting to as a note.

Another note, some of those spikes don't match with the OSPF flaps.

Are these fetch-failure spikes still happening?

Nope!

Recurring varnish-be fetch failures in codfwClosed, ResolvedPublicActions

Description

Event Timeline

Recurring varnish-be fetch failures in codfw
Closed, ResolvedPublic
Actions