Page MenuHomePhabricator

Investigate 2018-04-10 global traffic drop
Closed, ResolvedPublic

Description

Update: Incident report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20180410-Routing


For about 54min from 22:46 - 23:40 on Tue 10 Apr, a significant amount of global traffic was unable to reach our data centres.

Varnish traffic (Grafana)
Screen Shot 2018-04-11 at 00.07.31.png (834×1 px, 98 KB)
Screen Shot 2018-04-11 at 00.38.53.png (812×2 px, 169 KB)

From 22:53 - 23:03 (10min), "Transmission cp10xx (eqiad)" was down 70% (dropped from 10 GBit/s to 3 GBit/s). The bottom lasted for about 5min (22:55 - 23:00).

From 22:46 - 23:24 (40min), "Transmission cp50x (eqsin)" was down 90% (dropped from 1.6 GBits to 0.12 Gbit/s). The bottom lasted about 30min (23:50 - 23:20).

Edit count (Grafana)
Screen Shot 2018-04-11 at 00.10.04.png (698×1 px, 101 KB)
Edit count (global) dropped from 800/min to 350/min (down 56%)
Varnish http (Grafana)
Screen Shot 2018-04-11 at 00.12.57.png (586×1 px, 102 KB)
Requests (total) dropped from 11M/min to 8M/min (down 30%)
Asia page views (Grafana)
Screen Shot 2018-04-11 at 03.08.20.png (474×1 px, 370 KB)
Screen Shot 2018-04-11 at 03.08.35.png (494×1 px, 175 KB)
Page views (1:100 samples) dropped from 170/min to <10/min (down 90%). This is based on client-side Geo and indicates that traffic was really down (as opposed to re-routed).

Related Objects

Event Timeline

I guess this is why en.wikipedia.org and phabricator.wikimedia.org would not load for me? (though gerrit.wikimedia.org loaded for me)

They still don't load for me. I think this is about April 10, not April 11.

Krinkle renamed this task from Investigate 2018-04-11 global traffic drop to Investigate 2018-04-10 global traffic drop.Apr 10 2018, 11:19 PM
Krinkle updated the task description. (Show Details)

And now dead again. Affects www.wikipedia.org, commons, wikidata, wiktionary.

This was caused by a change made for T191667, more specifically enabling nonstop-routing on cr1/2-eqiad and cr1-eqsin.
I applied the change to cr1-eqiad (redundant), cr2-esams (middle of the night), which went fine, cr1-eqsin, then cr2-eqiad.
First observations seems to indicate that it have triggered a bug on cr2-eqiad and at least withdrawn the prefixes advertised from eqiad.
Change have been rolledback 4min later.

cr1-eqsin was a different kind of partial routing daemon failure, where rpd was at 100% CPU and the router was dropping most of the traffic (alerting was flapping), but still advertising its prefixes.
Rolling back the change did not fix the issue, running a commit full neither (at least not promptly). The site has been depolled.
Router seems to be back to a healthy state now.

I will follow-up with the vendor.

ema moved this task from Backlog to Network on the Traffic board.

Change in observed performance due to depooling of Singapore:

Synthetic tests (from AWS Mumbai): https://grafana.wikimedia.org/dashboard/db/mumbai-webpagetest?orgId=1&from=1523361600000&to=1523448000000
Real-user data from India: https://graphite.wikimedia.org/render/?width=2154&height=1308&_salt=1523451743.114&target=removeAboveValue(frontend.navtiming_oversample.responseStart.by_country.IN.median%2C5000)&target=removeAboveValue(frontend.navtiming_oversample.domComplete.by_country.IN.median%2C5000)&target=frontend.navtiming_oversample.mediaWikiLoadComplete.by_country.IN.median&from=12%3A00_20180410&until=12%3A00_20180411
Real user data from Singapore: https://graphite.wikimedia.org/render/?width=2154&height=1308&_salt=1523451810.779&from=12%3A00_20180410&until=12%3A00_20180411&target=removeAboveValue(frontend.navtiming_oversample.responseStart.by_country.SG.median%2C5000)&target=removeAboveValue(frontend.navtiming_oversample.domComplete.by_country.SG.median%2C5000)&target=removeAboveValue(frontend.navtiming_oversample.mediaWikiLoadComplete.by_country.SG.median%2C5000)
Real user data from Japan: https://graphite.wikimedia.org/render/?width=2154&height=1308&_salt=1523451810.779&from=12%3A00_20180410&until=12%3A00_20180411&target=removeAboveValue(frontend.navtiming_oversample.responseStart.by_country.JP.median%2C5000)&target=removeAboveValue(frontend.navtiming_oversample.domComplete.by_country.JP.median%2C5000)&target=removeAboveValue(frontend.navtiming_oversample.mediaWikiLoadComplete.by_country.JP.median%2C5000)

Change 425552 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] Revert "Depolling eqsin due to router issue"

https://gerrit.wikimedia.org/r/425552

Change 425552 merged by BBlack:
[operations/dns@master] Revert "Depolling eqsin due to router issue"

https://gerrit.wikimedia.org/r/425552

Do you want to rephrase this task's description to be about the incident? As a "let's investigate what happened" task as it is now, it should be closed, because we now know what happened.

Imarlier removed a project: Performance-Team.

Brandon - No further action for Performance on this. I'm assigning to you to close out or for further investigation, if any is needed.

Krinkle updated the task description. (Show Details)
Krinkle removed a project: Patch-For-Review.