Production Excellence #39: December 2021
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

One documented incident last month:

2021-12-03 mx
Impact: A portion of outgoing email from wikimedia.org was delivered with a delay of upto 24 hours. This affected staff Gmail, and Znuny/Phabricator notifications. No mail was lost, it was eventually delivered.

proderr-incidents 2021-12.png (840×2 px, 154 KB)

Image from Incident graphs.


Incident follow-up

Remember to review and schedule Incident Follow-up work in Phabricator. These are preventive measures and tech debt mitigations written down after an incident. Read about past incidents at Incident status on Wikitech.

Recently resolved incident follow-up:

Create paging alert for high MX queues.
Filed in December after the mail delivery incident, resolved later that month by Keith (Herron).

Limit db execution time of expensive MW special pages.
Filed in December after various incidents due to high DB/appserver load, carried out by Amir (Ladsgroup).


Trends

In December we reported 22 new errors in December, of which 5 have since been resolved, and 17 remain open and have carried over to January. From the 298 issues previously carried over, we also resolved 17, thus the workboard still adds up to 298 in total.

In previous editions, we sometimes looked at the breakdown of tasks that remained unresolved. This time, I'd like to draw attention to the throughput and distribution of tasks that did get resolved.

Production errors resolved in the month of December, by team and component (query):

  • Community-Tech (2): GlobalPreferences (1), CodeMirror (1).
  • DBA: DjVuHandler (1).
  • Editing-team: DiscussionTools (1).
  • Fundraising Tech: CentralNotice (1).
  • Growth-Team (8): GrowthExperiments (6), Image-Suggestions (1), StructuredDiscussions (1).
  • Language-Team: UniversalLanguageSelector (1).
  • Parsoid (1).
  • Product-Infrastructure: TemplateStyles (1).
  • Readers-Web (2).
  • Structured-Data (2).
  • Wikidata team: Wikidata-Page-Banner (1).
  • Missing steward (1): MediaWiki-Logevents (T289806: Thanks @Umherirrender!).

Figure 1: Unresolved error reports by month.

For the month-over-month numbers, refer to the spreadsheet data.


Outstanding errors

Take a look at the workboard and look for tasks that could use your help.
View Workboard

Oldest unresolved errors:

  • (June 2020) WikibaseClient: RuntimeException in wblistentityusage API. T254334
  • (June 2020) WikibaseClient: Deadlock in EntityUsageTable::addUsages method. T255706

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

💡 Did you know:

To find your team's error reports, use the appropriate "Filter" link in the sidebar of the workboard.

Written by Krinkle on Jan 17 2022, 10:16 PM.
Principal Engineer (Wikimedia Performance)
Projects
None
Subscribers
Umherirrender

Event Timeline