Production Excellence #28: January 2021
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

1 documented incident last month. That's the third month in a row that we are at or near zero major incidents – not bad! [1] [2]

Learn about recent incidents at Incident status on Wikitech, or Preventive measures in Phabricator.

💡 Did you know: Our Incident status page provides a green-yellow status reflection over the past ten days, with a link to the most recent incident doc if there was any during that time.

📊 Trends

This January saw a small recovery in our otherwise negative upward trend. For the first time in twelve month more reports were closed than new reports having outlived the previous month without resolution. What happened twelve months ago? In January 2020, we also saw a small recovery during the otherwise upward trend before and after it.

Perhaps it's something about the post-December holidays that temporarily improves the quality and/or reduces the quantity — of code changes. Only time will tell if this is the start of a new positive trend, or merely a post-holiday break. [3]

Unresolved error reports stacked by recent month

While our month-to-month trend might not (yet) be improving, we do see persistent improvements in our overall backlog of pre-2019 reports. This is in part because we generally don't file new reports there, so it makes sense that it doesn't go back up, but it's still good to see downward progress every month, unlike with reports from more recent months which often see no change month-to-month (see "Outstanding errors" below, for example).

This positive trend on our "Old" backlog started in October 2020 and has consistently progressed every month since then (refer to the "Old" numbers in red on the below chart, or the same column in the spreadsheet). [3][4]

Total open production error tasks, by month

📖 Outstanding errors

Summary over recent months:

  • ⚠️ July 2019 (2 of 18 issues left): no change.
  • ⚠️ August 2019 (1 of 14 issues): no change.
  • ✅ September 2019 (0 of 12 issues): Last two tasks were resolved (-2).
  • ⚠️ October 2019 (4 of 12 issues): One task resolved (-1).
  • ⚠️ November 2019 (1 of 5 issues): no change.
  • ⚠️ December 2019 (2 of 9 issues), Two tasks resolved (-2).
  • ⚠️ January 2020 (2 of 7 issues), no change.
  • ⚠️ February 2020 (1 of 7 issues left), One task resolved (-1).
  • March 2020 (2 of 2 issues left), no change.
  • April 2020 (9 of 14 issues left): no change.
  • May 2020 (6 of 14 issues left): One task resolved (-1).
  • June 2020 (7 of 14 issues left): no change.
  • July 2020 (9 of 24 new issues): no change.
  • August 2020 (22 of 53 new issues): One task resolved (-1).
  • September 2020 (13 of 33 new issues): One task resolved (-1).
  • October 2020 (31 of 69 new issues): Four tasks fixed (-4).
  • November 2020 (14 of 38 new issues): no change.
  • December 2020 (19 of 33 new issues) Three tasks resolved (-3)
  • January 2021: 7 of 50 new issues survived the month and remained unresolved (+50; -43)
Recent tally
160issues open, as of Excellence #27 (4 Feb 2021).
-15issues closed since, of the previous 160 open issues.
+7new issues that survived January 2021.
152issues open, as of today (16 Feb 2021).

January saw +50 new production errors reported in a single month, which is an unfortunate all-time high. However, we've also done remarkably well on addressing 43 of them within a month, when the potential root cause and diagnostics data were still fresh in our minds. Well done!

For the on-going month of February, there have been 16 new issues reported so far.

Take a look at the workboard and look for tasks that could use your help!

View Workboard

🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


[1] Incident status Wikitech.
[2] Wikimedia incident stats by Krinkle, CodePen.
[3] Month-over-month, Production Excellence spreadsheet.
[4] Open tasks, Wikimedia-prod-error, Phabricator.

Written by Krinkle on Feb 19 2021, 6:45 AM.
Principal Engineer (Wikimedia Performance)

Event Timeline