HomePhabricator
Production Excellence #43: April 2022
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

Last month we experienced 2 (public) incidents. This is below the three-year median of 3 incidents a month (Incident graphs).

2022-04-06 esams network
Impact: For 30 minutes, wikis were slow or unreachable for a portion of clients to the Esams data center. Esams is one of two DCs primarily serving Europe, Middle East, and Africa.

2022-04-26 cr2-eqord down
Impact: No external impact. Internally, for 2 hours we were unable to access our Eqord routers by any means. This was due to a fiber cut on a redundant link to Eqiad, which then coincided with planned vendor maintenance on the links to Ulsfo and Eqiad. See also Network design.

proderr-incidents 2022-04.png (800×1 px, 127 KB)


Incident follow-up

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.

Recently resolved incident follow-up:

Reduce mysql grants for wikiadmin scripts
Filed in 2020 after the wikidata drop-table incident (details). Carried out over the last six months by Amir @Ladsgroup (SRE Data Persistence).

Improve reliability of Toolforge k8s cron jobs and Re-enable CronJobControllerV2
Filed earlier this week after a Toolforge incident and carried out by Taavi @Majavah.


Trends

During the month of April we reported 27 new production errors. Of these new errors, we resolved 14, and the remaining 13 are still open and have carried over to May.

Last month, the workboard totalled 298 unresolved error reports. Of these older reports that carried over from previous months, 16 were resolved. Most of these were reports from before 2019.

The new total, including some tasks for the current month of May, is 292. A slight decrease! (spreadsheet).

Take a look at the workboard and look for tasks that could use your help.

View Workboard

proderr-unified 2022-04.png (1×1 px, 116 KB)


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

In a fair fight, I'd kill you!
— Well, that's not much incentive for me to fight fair then, is it?

Written by Krinkle on May 12 2022, 9:00 PM.
Principal Engineer (Wikimedia Performance)
Projects
None
Subscribers
None

Event Timeline