Page MenuHomePhabricator

Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users
Closed, ResolvedPublic

Description

On 2021-10-07, from approx 16:00 UTC to 17:00 UTC, internal issues with one of our network providers caused many users to see our sites as inaccesible. Traceroutes from the affected users' perspectives showed a routing loop within the provider's network.

After detection and diagnosis, at 16:19 we stopped advertising our BGP routes to that provider for one of our edge sites (eqiad), which restored connectivity for affected users of eqiad (mostly in the Americas) a few minutes later.

Unfortunately, due to an interesting confluence of issues, we did not notice that some esams users (mostly in RU and KZ) were also having trouble until after the provider had seemingly resolved the issue themselves around 17:00 UTC.

Event Timeline

Change 727594 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] NEL alert is empirically high-signal & should page SRE

https://gerrit.wikimedia.org/r/727594

Change 727594 merged by CDanis:

[operations/puppet@production] NEL alert is empirically high-signal & should page SRE

https://gerrit.wikimedia.org/r/727594

colewhite triaged this task as Medium priority.Nov 8 2021, 10:36 PM
lmata renamed this task from 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users to Incident: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users.Apr 28 2022, 6:55 PM
lmata moved this task from In Progress to Scorecard Done on the SRE-OnFire (FY2021/2022-Q2) board.
akosiaris updated the task description. (Show Details)
akosiaris subscribed.

@CDanis, No other actionables showed up, alerting has arguably been improved by the change above, I am gonna tentatively resolve this, but feel free to reopen