Page MenuHomePhabricator

2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users
Open, MediumPublic


On 2021-10-07, from approx 16:00 UTC to 17:00 UTC, internal issues with one of our network providers caused many users to see our sites as inaccesible. Traceroutes from the affected users' perspectives showed a routing loop within the provider's network.

After detection and diagnosis, at 16:19 we stopped advertising our BGP routes to that provider for one of our edge sites (eqiad), which restored connectivity for affected users of eqiad (mostly in the Americas) a few minutes later.

Unfortunately, due to an interesting confluence of issues, we did not notice that some esams users (mostly in RU and KZ) were also having trouble until after the provider had seemingly resolved the issue themselves around 17:00 UTC.

Event Timeline

Change 727594 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] NEL alert is empirically high-signal & should page SRE

Change 727594 merged by CDanis:

[operations/puppet@production] NEL alert is empirically high-signal & should page SRE

colewhite triaged this task as Medium priority.Nov 8 2021, 10:36 PM