HomePhabricator
Production Excellence #6: December 2018
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

  • Month in numbers.
  • Lightning round.
  • Current problems.

📊 Month in numbers

  • 4 documented incidents. [1]
  • 20 Wikimedia-prod-error tasks closed. [2]
  • 18 Wikimedia-prod-error tasks created. [3]
  • 172 currently open Wikimedia-prod-error tasks (as of 16 January 2019).

Terminology:

  • An Exception (or fatal) prevents a user action. For example, a page would display “Exception: Unable to render page”, instead the article content.
  • An Error (or non-fatal, warning) can produce pages that are technically unaware of a problem, but may show corrupt, incorrect, or incomplete information. For example — a user may receive a notification that says “You have (null) new messages”.

For December, I haven’t prepared any stories or taken interviews. Instead, I’ve got a lightning round of errors in various areas that were found and fixed this past month.

⚡️ Contributions view fixed

MarcoAurelio reported that Special:Contributions failed to load for certain user names on meta.wikimedia.org (PHP Fatal error, due to a faulty database record). Brad Jorsch investigated and found a relation to database maintenance from March 2018. He corrected the faulty records, which resolved the problem. Thanks! — T210985

⚡️ Undefined talk space now defined

The newly created Cantonese Wiktionary (yue.wiktionary.org) was encountering errors from the Siteinfo API. We found this was due to invalid site configuration. Urbanecm patched the issue, and also created a new unit test for wmf-config that will prevent this issue from happening on other wikis in the future. Thanks! — T211529

⚡️ The undefined error status... error

After deploying the 1.33.0-wmf.8 train to all wikis, we found a regression in the HTTP library for MediaWiki. When MediaWiki requested an HTTP resource from another service, and this resource was unavailable, then MediaWiki failed to correctly determine the HTTP status code of that error. Which then caused another error! This happened, for example, when Special:Collection was unable to reach the PediaPress.com backend in some cases. Patched by Bill Pirkle. Thanks! — T212005

⚡️ Fatal error: Call to undefined function in Kartographer API

When the 1.33.0-wmf-9 train reached the canary phase on Tue 18 December (aka, group0 [1]), Željko spotted a new fatal error in the logs. The fatal originated in the Kartographer extension and would have affected various users of the MediaWiki API. Patched the same day by Michael Holloway, reviewed by James Forrester, and deployed by Željko. Thanks! — T212218

📉 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.

→ https://phabricator.wikimedia.org/tag/wikimedia-production-error

November's theme will continue for now, as I imagine lots of you were on vacation during that time! I’d like to draw attention to a subset of PHP fatal errors. Specifically, those that are publicly exposed (e.g. don’t need elevated user rights) and emit an HTTP 500 error code.

  1. Wikibase: Clicking “undo” for certain revisions fatals with a PatcherException. — T97146
  2. Flow: Unable to view certain talk pages due to workflow InvalidDataException. — T70526
  3. Translate: Certain Special:Translate urls fatal. — T204833
  4. MediaWiki (Special-pages): SpecialDoubleRedirects unavailable on tt.wikipedia.org. — T204800
  5. MediaWiki (Parser): Parse API exposes fatal content model error. — T206253
  6. CentralNotice: Certain SpecialCentralNoticeBanners urls fatal. — T149240
  7. PageViewInfo: Certain “mostviewed” API queries fail. — T208691

Public user requests resulting in fatals can (and have) caused alerts to fire that notify SRE of wikis potentially being less available or down.

💡 ProTip:

Use “Report Error” on https://phabricator.wikimedia.org/tag/wikimedia-production-error/ to create a task with a helpful template. This template is also available as “Report Application Error”, from the “Create Task” dropdown menu, on any task creation form.

🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including @MarcoAurelio, @Anomie, @Urbanecm, @BPirkle, @zeljkofilipin, @Mholloway, @Esanders, @Jdforrester-WMF, and @hashar.

Until next time,

— Timo Tijhof


Footnotes:

[1] Incidents. — wikitech.wikimedia.org/wiki/Special:AllPages...

[2] Tasks closed. — phabricator.wikimedia.org/maniphest/query...

[3] Tasks opened. — phabricator.wikimedia.org/maniphest/query...

[4] What is group0? — wikitech.wikimedia.org/wiki/Deployments/One_week#Three_groups

Written by Krinkle on Jan 22 2019, 2:54 AM.
Principal Engineer (Wikimedia Performance)
Projects
None
Subscribers
None
Tokens
"Like" token, awarded by greg."Love" token, awarded by Zppix.

Event Timeline