How’d we do in our strive for operational excellence last month? Read on to find out!
Last month we experienced 2 (public) incidents. This is below the three-year median of 3 incidents a month (Incident graphs).
2022-04-06 esams network
Impact: For 30 minutes, wikis were slow or unreachable for a portion of clients to the Esams data center. Esams is one of two DCs primarily serving Europe, Middle East, and Africa.
2022-04-26 cr2-eqord down
Impact: No external impact. Internally, for 2 hours we were unable to access our Eqord routers by any means. This was due to a fiber cut on a redundant link to Eqiad, which then coincided with planned vendor maintenance on the links to Ulsfo and Eqiad. See also Network design.
Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.
Recently resolved incident follow-up:
During the month of April we reported 27 new production errors. Of these new errors, we resolved 14, and the remaining 13 are still open and have carried over to May.
Last month, the workboard totalled 298 unresolved error reports. Of these older reports that carried over from previous months, 16 were resolved. Most of these were reports from before 2019.
The new total, including some tasks for the current month of May, is 292. A slight decrease! (spreadsheet).
Take a look at the workboard and look for tasks that could use your help.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof