Production Excellence #8: February 2019
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence? Read on to find out!

📊 Month in numbers
  • 7 documented incidents. [1]
  • 30 new Wikimedia-prod-error tasks created. [2] (17 new in Jan, and 18 in Dec.)
  • 27 Wikimedia-prod-error tasks closed. [3] (16 closed in Jan, and 20 in Dec.)

There are in total 177 open Wikimedia-prod-error tasks today. (188 in Feb, 172 in Jan, and 165 in Dec.)

📉 Current problems

There’s been an increase in how many application errors are reported each week. And, we’ve also managed to mostly keep up with those each week, so that’s great!

But, it does appear that most weeks we accumulated one or two unresolved errors, which is starting to add up. I believe this is mainly because they were reported a day after the branch went out. That is, if the same issues had been reported 24 hours earlier in a given week, then they might’ve blocked the train as a regression.


Below is breakdown of unresolved prod errors since last quarter. (I’ve omitted the last three weeks.)

By month:

  • February: 5 reports (1.33-wmf.16, 1.33-wmf.17, 1.33-wmf.18).
  • January: 3 reports (1.33-wmf.13, 1.33-wmf.14).
  • December 2018: 5 reports (1.33-wmf.9).
  • November 2018: 3 reports (1.33-wmf.2).
  • October 2018: 1 report (1.32-wmf.26).
  • September 2018: 2 reports (1.32-wmf.20).

By steward and software component:

📖 Fixed exposed fatal error on Special:Contributions

Previously, a link to Special:Contributions could pass invalid options to a part of MediaWiki that doesn’t allow invalid options. Why would anything allow invalid options? Let’s find out.

Think about software as an onion. Software tends to have an outer layer where everything is allowed. If this layer finds illegal user input, it has to respond somehow. For example, by informing the user. In this outer layer, illegal input is not a problem in the software. It is a normal thing to see as we interact with the user. This outer layer responds directly to a user, is translated, and can do things like “view recent changes”, “view user contributions” or “rename a page”.

Internally, such action is divided into many smaller tasks (or functions). For example, a function might be “get talk namespace for given subject namespace”. This would answer “Talk:” to “(Article)”, and “Wikipedia_talk:” to “Wikipedia:”. When searching for edits on My Contributions with “Associated namespaces” ticked, this function is used. It is also used by Move Page if renaming a page together with its talk page. And it’s used on Recent Changes and View History, for all those little “talk” links next to each page title and username.

If one of your edits is for a page that has no discussion namespace, what should MediaWiki do? Show no edits? Skip that edit and tell the user “1 edit was hidden”? Show normally, but without a talk link? That decision is made by the outer layer for a feature, when it catches the internal exception. Alternatively, it can sometimes avoid an exception by asking a different question first – a question that cannot fail. Such as “Does namespace X have a talk space?”, instead of “What is the talk space for X?”.

When a program doesn’t catch or avoid an exception, a fatal error occurs. Thanks to @D3r1ck01 for fixing this fatal error. – T150324

💡 ProTip: If your Jenkins build is failing and you suspect it’s unrelated to the project itself, be sure to report it to Phabricator under “Shared Build Failure”.
🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: @aaron, @Addshore, @alaa_wmde, @Amorymeltzer, @Anomie @D3r1ck01 @Daimona @daniel @hashar @hoo, @jcrespo, @KaMan, @Mainframe98, @Marostegui, @matej_suchanek, @Ottomata, @Pchelolo, @Reedy, @revi, @Smalyshev, @Tarrow, @Tgr, @thcipriani, @Umherirrender, and @Volker_E.


Until next time,

– Timo Tijhof


[1] Incidents. — wikitech.wikimedia.org/wiki/Special:AllPages…

[2] Tasks created. — phabricator.wikimedia.org/maniphest/query…

[3] Tasks closed. — phabricator.wikimedia.org/maniphest/query…

🍏 He got me invested in some kind of.. fruit company.

Written by Krinkle on Mar 21 2019, 7:11 PM.
Principal Engineer (Wikimedia Performance)
Tgr, Reedy, hoo and 20 others
"Love" token, awarded by mmodell.

Event Timeline

I really appreciate that you take the time to compile this data into such a nice summary. Thank you!