How’d we do in our strive for operational excellence last month? Read on to find out!
One documented incident last month:
Impact: A portion of outgoing email from wikimedia.org was delivered with a delay of upto 24 hours. This affected staff Gmail, and Znuny/Phabricator notifications. No mail was lost, it was eventually delivered.
Image from Incident graphs.
Remember to review and schedule Incident Follow-up work in Phabricator. These are preventive measures and tech debt mitigations written down after an incident. Read about past incidents at Incident status on Wikitech.
Recently resolved incident follow-up:
Create paging alert for high MX queues.
Filed in December after the mail delivery incident, resolved later that month by Keith (Herron).
Limit db execution time of expensive MW special pages.
Filed in December after various incidents due to high DB/appserver load, carried out by Amir (Ladsgroup).
In December we reported 22 new errors in December, of which 5 have since been resolved, and 17 remain open and have carried over to January. From the 298 issues previously carried over, we also resolved 17, thus the workboard still adds up to 298 in total.
In previous editions, we sometimes looked at the breakdown of tasks that remained unresolved. This time, I'd like to draw attention to the throughput and distribution of tasks that did get resolved.
Production errors resolved in the month of December, by team and component (query):
- Community-Tech (2): GlobalPreferences (1), CodeMirror (1).
- DBA: DjVuHandler (1).
- Editing-team: DiscussionTools (1).
- Fundraising Tech: CentralNotice (1).
- Growth-Team (8): GrowthExperiments (6), Image-Suggestions (1), StructuredDiscussions (1).
- Language-Team: UniversalLanguageSelector (1).
- Parsoid (1).
- Product-Infrastructure: TemplateStyles (1).
- Readers-Web (2).
- Structured-Data (2).
- Wikidata team: Wikidata-Page-Banner (1).
- Missing steward (1): MediaWiki-Logevents (T289806: Thanks @Umherirrender!).
For the month-over-month numbers, refer to the spreadsheet data.
Take a look at the workboard and look for tasks that could use your help.
Oldest unresolved errors:
- (June 2020) WikibaseClient: RuntimeException in wblistentityusage API. T254334
- (June 2020) WikibaseClient: Deadlock in EntityUsageTable::addUsages method. T255706
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof