How’d we do in our strive for operational excellence last month? Read on to find out!
There were 4 documented incidents last month. This is currently on average, compared to the past five years (per Incident graphs).
2021-10-08 network provider
Impact: For upto an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. It was caused by a routing problem with one of several redundant network providers.
2021-10-22 eqiad networking
Impact: For ~40 minutes clients that are normally geographically routed to Eqiad experienced connection or timeout errors. We lost about 7K req/s during this time. After initial recovery, Eqiad was ready and repooled in ~10 minutes.
2021-10-25 s3 db replica
Impact: For ~30min MediaWiki backends were slower than usual. For ~12 hours, many wiki replicas were stale for Wikimedia Cloud Services such as Toolforge.
Impact: During a server upgrade, historical data was lost for a subset of Graphite metrics. Some were recovered via the redundant server, but others were lost as the redundant was also upgraded since then and lost some in a similar fashion.
Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.
In October, 49 new tasks were reported as production errors. Of these, we resolved 26, and 23 remain unresolved and carry forward to the next month.
Previously, the production error workboard held an accumulated total of 298 still-open error reports. We resolved 20 of those. Together with the 23 new errors carried over from October, this brings us to 301 unresolved errors on the board.
For the month-over-month numbers, refer to the spreadsheet data.
Take a look at the workboard and look for tasks that could use your help.
Issues carried over from recent months:
|Apr 2021||9 of 42 issues left.|
|May 2021||16 of 54 issues left.|
|Jun 2021||9 of 26 issues left.|
|Jul 2021||12 of 31 issues left.|
|Aug 2021||12 of 46 issues left.|
|Sep 2021||11 of 24 issues left.|
|Oct 2021||23 of 49 new issues are carried forward.|
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof