Production Excellence #7: January 2019
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

📊 Month in numbers
  • 4 documented incidents in January 2019. [1]
  • 16 Wikimedia-prod-error tasks closed. [2]
  • 17 Wikimedia-prod-error tasks created. [3]

📖 Unable to move certain file pages

Xiplus reported that renaming a File page on zh.wikipedia.org led to a fatal database exception. Andre Klapper identified the stack trace from the logs, and Brad (@Anomie) investigated.

The File renaming failed because the File page did not have a media file associated with it (such move action is not currently allowed in MediaWiki). But, while handling this error the code caused a different error. The impact was that the user didn't get informed about why the move failed. Instead, they received a generic error page about a fatal database exception.

@Tgr fixed the code a few hours later, and it was deployed by Roan later that same day.
Thanks! — T213168

📖 DBPerformance regression detected and fixed

During a routine audit of Logstash dashboards, I found a DBPerformance warning. The warning indicated that the limit of 0 for “master connections” was violated. That's a cryptic way of saying it found code in MediaWiki that uses a database master connection on a regular page view.

MediaWiki can have many replica database servers, but there can be only one master database at any given moment. To reduce chances of overload, delaying edits, or network congestion; we make sure to use replicas whenever possible. We usually involve the master only when source data is being changed, or is about to be changed. For example, when editing a page, or saving changes.

As the vast majority of traffic is page views, we have lower thresholds for latency and dependency on page views. In particular, page views may (in the future) be routed to secondary data centres that don’t even have a master DB.

@Tchanders from the Anti-Harassment team investigated the issue, found the culprit, and fixed it in time for the next MediaWiki train. Thanks! — T214735

📖 TemplateData missing in action

@Tacsipacsi and @Evad37 both independently reported the same TemplateData issue. TemplateData powers the template insertion dialog in VisualEditor. It wasn't working for some templates after we deployed the 1.33-wmf.13 branch.

The error was “Argument 1 passed to ApiResult::setIndexedTagName() must be an instance of array, null given”. This means there was code that calls a function with the wrong parameter. For example, the variable name may've been misspelled, or it may've been the wrong variable, or (in this case) the variable didn't exist. In such case, PHP implicitly assumes “null”.

Bartosz (@matmarex) found the culprit. The week before, I made a change to TemplateData that changed the “template parameter order” feature to be optional. This allows users to decide whether VisualEditor should force an order for the parameters in the wikitext. It turned out I forgot to update one of the references to this variable, which still assumed it was always present.

Brad (Anomie) fixed it later that week, and it was deployed the next day. Thanks! — T213953

📈 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.


There are currently 188 open Wikimedia-prod-error tasks as of 12 February 2019. (We’ve had a slight increase since November; 165 in December, 172 in January.)

For this month’s edition, I’d like to draw attention to a few older issues that are still reproducible:

  • [2013; Collection extension] Special:Book fatal error for blocked users. T56179
  • [2013; CentralNotice] Fatal error when placeholder key contains a space. T58105
  • [2014; LQT] Fatal error when attempting to view certain threads. T61791
  • [2015; MassMessage] Warning about Invalid message parameters. T93110
  • [2015; Wikibase] Warning “UnresolvedRedirectException” for some pages on Wikidata (and Commons). T93273
💡 Terminology:

A “Fatal error” (or uncaught exception) prevents a user action. For example — a page might display “MWException: Unknown class NotificationCount.”, instead the article content.
A “Warning” (or non-fatal, or PHP error) lets the program continue to display a mostly page regardless. This may cause corrupt, incorrect, or incomplete information to be shown. For example — a user may receive a notification that says “You have (null) new messages”.

🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: A2093064‚ @Anomie, @Daimona @Gilles, @He7d3r, @Jdforrester-WMF, @matmarex, @mmodell, @Nikerabbit, @Catrope, @Tchanders, @Tgr, and @thiemowmde.


Until next time,

— Timo Tijhof

👢There's a snake in my boot. Reach for the sky!


[1] Incidents. — wikitech.wikimedia.org/wiki/Special:AllPages…

[2] Tasks closed. — phabricator.wikimedia.org/maniphest/query…

[3] Tasks created. — phabricator.wikimedia.org/maniphest/query…

Written by Krinkle on Feb 13 2019, 3:53 AM.
Principal Engineer (Performance)

Event Timeline

Thanks for the excellence in summary, @Krinkle! And indeed thanks to everyone who helps out with production errors, it's often thankless and rarely glamorous but nonetheless very important work done by everyone involved. Respect! :)