Production Excellence #14: August 2019
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence in August? Read on to find out!

📊 Month in numbers
  • 3 documented incidents. [1]
  • 42 new Wikimedia-prod-error reports. [2]
  • 31 Wikimedia-prod-error reports closed. [3]
  • 210 currently open Wikimedia-prod-error reports in total. [4]

The number of recorded incidents in August, at three, was below average for the year so far. However, in previous years (2017-2018), August also has 2-3 incidents. – Explore this data.

To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.

*️⃣ When you have eliminated the impossible...

Reports from Logstash indicated that some user requests were aborted by a fatal PHP error from the MessageCache class. The user would be shown a generic system error page. The affected requests didn’t seem to have anything obvious in common, however. This made it difficult to diagnose.

MessageCache is responsible for fetching interface messages, such as the localised word “Edit” on the edit button. It calls a “load()” function and then tries to access the loaded information. However, sometimes the load function would claimed to have finished its work, but yet the information was not there.

When the load function initialises all the messages for a particular language, it keeps track of this, so as to not do the same a second time. From any one angle I could look at this code, no obvious mistakes stood out. A deeper investigation revealed that two unrelated changes (more than a year apart), each broke 1 assumption that was safe to break. But, put together, and this seemingly impossible problem emerges. Check out T208897#5373846 for the details of the investigation.

📉 Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.


Or help someone that’s already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • January: 1 report left (unchanged). ⚠️
  • February: 2 reports left (unchanged). ⚠️
  • March: 4 reports left (unchanged). ⚠️
  • April: 2 reports got fixed! (8 of 14 reports left). ❇️
  • May: 4 of 10 reports left (unchanged).
  • June: 1 report got fixed! (8 of 11 reports left). ❇️
  • July: 2 reports got fixed (17 of 18 reports left).
  • August: 14 new reports remain unsolved.
  • September: 11 new reports remain unsolved.

🎉 Thanks!

Thank you to @aaron, @Catrope, @Daimona, @dbarratt, @Jdforrester-WMF, @kostajh, @pmiazga, @Tarrow, @zeljkofilipin, and everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

🎭“I think you should call it Seb's because no one will come to a place called Chicken on a Stick.


[1] Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident…

[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…

[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…

[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Written by Krinkle on Oct 3 2019, 4:27 AM.
Principal Engineer (Wikimedia Performance)
Jdforrester-WMF, Catrope, aaron and 6 others

Event Timeline