Production Excellence #12: June 2019
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

📊 Month in numbers
  • 11 documented incidents. ⚠️ [1]
  • 39 new Wikimedia-prod-error reports. [2]
  • 25 Wikimedia-prod-error reports closed. [3]

The number of incidents in June was high compared to previous years. At 11 incidents, this is higher than this year’s median (5), the 2018 median (4), and the 2017 median (5). It is also higher than any month of June in the last 4 years. – More data at CodePen.

To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.

There are currently 204 open Wikimedia-prod-error reports (up from 186 in April, and 201 in May). [4]

📖 [Op-ed] Integrated maintenance cost

Hereby a shoutout to the Wikidata and Core Platform teams, at WMDE and WMF respectively. They both recently established a rotating subteam that focuses on incidental work. Such as maintenance, and other work that might otherwise hinder feature development.

I expect this to improve efficiency by avoiding context switches between feature and incidental work. The rotational aspect should distribute the work more evenly among team members (avoiding burnout). And, it may increase exposure to other teams, and lesser-known areas of our code; which provide opportunities for personal growth and to retain institutional knowledge.

📉 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the month in which they were first observed.


Or help someone who already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • November: 1 issue got fixed! (1 issue left).
  • December: 3 issues left (unchanged). ⚠️
  • January: 1 issue left (unchanged). ⚠️
  • February: 2 issues left (unchanged). ⚠️
  • March: 4 issues left (unchanged). ⚠️
  • April: 2 issues got fixed! (10 of 14 issues, that survived April, remain open). ❇️
  • May: 4 issues got fixed! (6 of 10 issues, that survived May, are left). ❇️
  • June: 11 new issues from last month remain unresolved.

By steward and software component, the unresolved issues that survived June:

  • CPT / MW Auth (PHP fatal): T228717
  • CPT / MW Actor (DB contention): T227739
  • CPT or Multimedia / Thumb handler (MultiCurl error): T225197
  • Multimedia / File metadata (PHP error): T226751
  • Wikidata / Commons page view (PHP fatal): T227360
  • Wikidata / Jobrunner (PHP memory fatal): T227450
  • Wikidata / Jobrunner (Trx error): T225098
  • Product-Infra / ReadingList API (PHP fatal): T226593
  • (Unknown?) / Special:ConfirmEmail (PHP fatal): T226337
  • (Unknown?) / Page renaming (DB timeout): T226898
  • (Unknown?) / Page renaming (Bad revision fatal): T225366
💡Ideas: To suggest something to investigate or highlight in a future edition, contact me by e-mail or private IRC message.

🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: @Anomie, @brion, @Catrope, @cscott, @daniel, @dcausse, @DerFussi, @Ebe123, @fgiunchedi, @Jdforrester-WMF, @kostajh, @Legoktm, @Lucas_Werkmeister_WMDE, @matmarex, @matthiasmullie, @Michael, @Nikerabbit, @SBisson, @Smalyshev, @Tchanders, @Tgr, @Tpt, @Umherirrender, and @Urbanecm.


Until next time,

– Timo Tijhof

🔮These are his marbles...” “Ha! He really did lose his marbles, didn't he?” “Yeah, he lost them good.


  1. Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex…
  2. Tasks created. – phabricator.wikimedia.org/maniphest/query…
  3. Tasks closed. – phabricator.wikimedia.org/maniphest/query…
  4. Open tasks. – phabricator.wikimedia.org/maniphest/query…
Written by Krinkle on Jul 31 2019, 6:44 PM.
Principal Engineer (Wikimedia Performance)
Tgr, Legoktm, fgiunchedi and 21 others
"Love" token, awarded by Quiddity.

Event Timeline