On 2021-10-07, from approx 16:00 UTC to 17:00 UTC, internal issues with one of our network providers caused many users to see our sites as inaccesible. Traceroutes from the affected users' perspectives showed a routing loop within the provider's network.
After detection and diagnosis, at 16:19 we stopped advertising our BGP routes to that provider for one of our edge sites (eqiad), which restored connectivity for affected users of eqiad (mostly in the Americas) a few minutes later.
Unfortunately, due to an interesting confluence of issues, we did not notice that some esams users (mostly in RU and KZ) were also having trouble until after the provider had seemingly resolved the issue themselves around 17:00 UTC.
- write a short-form incident report on wikitech - https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-08_network_provider
- improve alerting
- placeholder to address any other actionables