Maniphest T191940

Investigate 2018-04-10 global traffic drop
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Apr 10 2018, 11:13 PM

Description

Update: Incident report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20180410-Routing

For about 54min from 22:46 - 23:40 on Tue 10 Apr, a significant amount of global traffic was unable to reach our data centres.

Varnish traffic (Grafana)

From 22:53 - 23:03 (10min), "Transmission cp10xx (eqiad)" was down 70% (dropped from 10 GBit/s to 3 GBit/s). The bottom lasted for about 5min (22:55 - 23:00).

From 22:46 - 23:24 (40min), "Transmission cp50x (eqsin)" was down 90% (dropped from 1.6 GBits to 0.12 Gbit/s). The bottom lasted about 30min (23:50 - 23:20).

Edit count (Grafana)

Edit count (global) dropped from 800/min to 350/min (down 56%)

Varnish http (Grafana)

Requests (total) dropped from 11M/min to 8M/min (down 30%)

Asia page views (Grafana)


Page views (1:100 samples) dropped from 170/min to <10/min (down 90%). This is based on client-side Geo and indicates that traffic was really down (as opposed to re-routed).

Details

	Subject	Repo	Branch	Lines +/-
	Revert "Depolling eqsin due to router issue"	operations/dns	master	+0 -2

Customize query in gerrit

Related Objects

Mentioned Here: T191667: Juniper HA audit

Event Timeline

Krinkle created this task.Apr 10 2018, 11:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2018, 11:13 PM

Krinkle added a project: Wikimedia-Incident.Apr 10 2018, 11:13 PM

I guess this is why en.wikipedia.org and phabricator.wikimedia.org would not load for me? (though gerrit.wikimedia.org loaded for me)

They still don't load for me. I think this is about April 10, not April 11.

Well, phabricator is fine.

Krinkle renamed this task from Investigate 2018-04-11 global traffic drop to Investigate 2018-04-10 global traffic drop.Apr 10 2018, 11:19 PM

Krinkle updated the task description. (Show Details)

Just started working again.

EddieGP subscribed.Apr 10 2018, 11:22 PM

And now dead again. Affects www.wikipedia.org, commons, wikidata, wiktionary.

Krinkle updated the task description. (Show Details)Apr 10 2018, 11:42 PM

Krinkle added a project: Performance-Team.

This was caused by a change made for T191667, more specifically enabling nonstop-routing on cr1/2-eqiad and cr1-eqsin.
I applied the change to cr1-eqiad (redundant), cr2-esams (middle of the night), which went fine, cr1-eqsin, then cr2-eqiad.
First observations seems to indicate that it have triggered a bug on cr2-eqiad and at least withdrawn the prefixes advertised from eqiad.
Change have been rolledback 4min later.

cr1-eqsin was a different kind of partial routing daemon failure, where rpd was at 100% CPU and the router was dropping most of the traffic (alerting was flapping), but still advertising its prefixes.
Rolling back the change did not fix the issue, running a commit full neither (at least not promptly). The site has been depolled.
Router seems to be back to a healthy state now.

I will follow-up with the vendor.

Marostegui subscribed.Apr 11 2018, 6:31 AM

MoritzMuehlenhoff subscribed.Apr 11 2018, 6:59 AM

• Mholloway subscribed.Apr 11 2018, 7:04 AM

Ladsgroup subscribed.Apr 11 2018, 9:44 AM

• Imarlier subscribed.Apr 11 2018, 12:28 PM

• ema triaged this task as High priority.Apr 11 2018, 12:43 PM

• ema moved this task from Backlog to Network on the Traffic board.

Change in observed performance due to depooling of Singapore:

Krinkle updated the task description. (Show Details)Apr 11 2018, 2:54 PM

Change 425552 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] Revert "Depolling eqsin due to router issue"

https://gerrit.wikimedia.org/r/425552

gerritbot added a project: Patch-For-Review.Apr 11 2018, 4:05 PM

Change 425552 merged by BBlack:
[operations/dns@master] Revert "Depolling eqsin due to router issue"

https://gerrit.wikimedia.org/r/425552

Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180410-Routing

Do you want to rephrase this task's description to be about the incident? As a "let's investigate what happened" task as it is now, it should be closed, because we now know what happened.

Brandon - No further action for Performance on this. I'm assigning to you to close out or for further investigation, if any is needed.

Krinkle moved this task from Active investigation to Active Situation on the Wikimedia-Incident board.Jul 18 2018, 7:20 PM

Krinkle closed this task as Resolved.Jul 19 2018, 7:00 AM

Krinkle updated the task description. (Show Details)

Krinkle removed a project: Patch-For-Review.

BBlack moved this task from Network to Done on the Traffic board.Oct 8 2021, 6:01 PM

	F16916408: Screen Shot 2018-04-11 at 03.08.35.png
	Apr 11 2018, 2:54 PM

	F16916406: Screen Shot 2018-04-11 at 03.08.20.png
	Apr 11 2018, 2:54 PM

	F16905407: Screen Shot 2018-04-11 at 00.38.53.png
	Apr 10 2018, 11:42 PM

	F16905333: Screen Shot 2018-04-11 at 00.12.57.png
	Apr 10 2018, 11:13 PM

	F16905322: Screen Shot 2018-04-11 at 00.07.31.png
	Apr 10 2018, 11:13 PM

	F16905327: Screen Shot 2018-04-11 at 00.10.04.png
	Apr 10 2018, 11:13 PM

Investigate 2018-04-10 global traffic dropClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Investigate 2018-04-10 global traffic drop
Closed, ResolvedPublic
Actions