Page MenuHomePhabricator

Investigate 2018-04-10 global traffic drop
Closed, ResolvedPublic


Update: Incident report at

For about 54min from 22:46 - 23:40 on Tue 10 Apr, a significant amount of global traffic was unable to reach our data centres.

Varnish traffic (Grafana)
Screen Shot 2018-04-11 at 00.07.31.png (834×1 px, 98 KB)
Screen Shot 2018-04-11 at 00.38.53.png (812×2 px, 169 KB)

From 22:53 - 23:03 (10min), "Transmission cp10xx (eqiad)" was down 70% (dropped from 10 GBit/s to 3 GBit/s). The bottom lasted for about 5min (22:55 - 23:00).

From 22:46 - 23:24 (40min), "Transmission cp50x (eqsin)" was down 90% (dropped from 1.6 GBits to 0.12 Gbit/s). The bottom lasted about 30min (23:50 - 23:20).

Edit count (Grafana)
Screen Shot 2018-04-11 at 00.10.04.png (698×1 px, 101 KB)
Edit count (global) dropped from 800/min to 350/min (down 56%)
Varnish http (Grafana)
Screen Shot 2018-04-11 at 00.12.57.png (586×1 px, 102 KB)
Requests (total) dropped from 11M/min to 8M/min (down 30%)
Asia page views (Grafana)
Screen Shot 2018-04-11 at 03.08.20.png (474×1 px, 370 KB)
Screen Shot 2018-04-11 at 03.08.35.png (494×1 px, 175 KB)
Page views (1:100 samples) dropped from 170/min to <10/min (down 90%). This is based on client-side Geo and indicates that traffic was really down (as opposed to re-routed).

Related Objects

Event Timeline

I guess this is why and would not load for me? (though loaded for me)

They still don't load for me. I think this is about April 10, not April 11.

Krinkle renamed this task from Investigate 2018-04-11 global traffic drop to Investigate 2018-04-10 global traffic drop.Apr 10 2018, 11:19 PM
Krinkle updated the task description. (Show Details)

And now dead again. Affects, commons, wikidata, wiktionary.

This was caused by a change made for T191667, more specifically enabling nonstop-routing on cr1/2-eqiad and cr1-eqsin.
I applied the change to cr1-eqiad (redundant), cr2-esams (middle of the night), which went fine, cr1-eqsin, then cr2-eqiad.
First observations seems to indicate that it have triggered a bug on cr2-eqiad and at least withdrawn the prefixes advertised from eqiad.
Change have been rolledback 4min later.

cr1-eqsin was a different kind of partial routing daemon failure, where rpd was at 100% CPU and the router was dropping most of the traffic (alerting was flapping), but still advertising its prefixes.
Rolling back the change did not fix the issue, running a commit full neither (at least not promptly). The site has been depolled.
Router seems to be back to a healthy state now.

I will follow-up with the vendor.

ema moved this task from Backlog to Network on the Traffic board.

Change in observed performance due to depooling of Singapore:

Synthetic tests (from AWS Mumbai):
Real-user data from India:
Real user data from Singapore:
Real user data from Japan:

Change 425552 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] Revert "Depolling eqsin due to router issue"

Change 425552 merged by BBlack:
[operations/dns@master] Revert "Depolling eqsin due to router issue"

Do you want to rephrase this task's description to be about the incident? As a "let's investigate what happened" task as it is now, it should be closed, because we now know what happened.

Imarlier removed a project: Performance-Team.

Brandon - No further action for Performance on this. I'm assigning to you to close out or for further investigation, if any is needed.

Krinkle updated the task description. (Show Details)
Krinkle removed a project: Patch-For-Review.