Production Excellence #5: November 2018
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

  • Month in numbers.
  • Highlighted stories.
  • Current problems.

📊 Month in numbers

  • 4 documented incidents in November 2018. [1]
  • 42 Wikimedia-prod-error tasks closed in November 2018. [2]
  • 36 Wikimedia-prod-error tasks created in November 2018. [3]
  • 165 currently open Wikimedia-prod-error tasks (as of 12 December 2018).

Terminology:

  • An Exception (or fatal) causes user actions to be prevented. For example, a page would display "Exception: Unable to render page", instead the article content.
  • An Error (or non-fatal, or warning) can produce page views that are technically unaware of a problem, but may show corrupt, incorrect, or incomplete information. Examples – an article would display the code word “null” instead of the actual content, a user looking for Vegetables may be taken to an article about Vegetarians, a user may receive a notification that says “You have (null) new messages.

With that behind us... Let’s celebrate this month’s highlights!

*️⃣ Fatal DB exception at wikitech.wikimedia.org

Quiddity reported that he was unable to disable a spam account, due to a fatal exception. Andre Klapper used the Exception ID to find the stack trace in the logs. The trace revealed that a table was missing in Wikitech’s database.

The MediaWiki software was recently expanded with a “Partial blocking” ability. [4] This involved introducing a new database table that stores block metadata differently. This software update was deployed to Wikitech, but this new table was not created.

@Marostegui (Database administrator) quickly applied the schema patches that create the missing table. Thanks Manuel, Andre, and Quiddity; Teamwork!

T209674

*️⃣ Big-page Deletion Unleashed!

It had been known for years, [5] that users are unable to delete or restore pages with more than a few hundred revisions. Attempts to do so could fail, with a fatal “DBTransactionSizeError” exception. This error indicates that the change is too big or too slow. Such changes risk replication lag, and may impact the stability of the infrastructure.

The database structure used by MediaWiki for page archives dates back to 2003 (over 15 years ago). I'll spare you the details, but it depends on database interactions that are inherently slow when applied to systems as big as Wikipedia! RFC T20493 intends to modernise this structure for the long-term.

Then along came @BPirkle. Bill joined the WMF Core platform team earlier this year. He took on the challenge of making page deletion work for any size page, today.

Previously, page deletion happened in a single step. This simple approach had the benefit of either succeeding in its entirety, or safely rolling back like nothing happened. It also meant that the database protected us against conflicting changes. In August, Bill started a two-month effort that carefully split the logic for “delete a page” into smaller steps that each are safe and quick. It now uses our JobQueue to schedule and run these steps, without the user waiting for it.

T198176

📉 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.

→ https://phabricator.wikimedia.org/tag/wikimedia-production-error

I’d like to draw attention to a subset of PHP fatal errors. Specifically, those that are publicly exposed (e.g. don’t require elevated user rights) and use an HTTP 500 status code.

  1. CentralNotice: Some Special:CentralNoticeBanners urls fatal. – T149240
  2. Flow: Unable to view certain talk pages due to workflow InvalidDataException. – T70526
  3. JsonConfig: Unable to diff certain “.map” pages on Commons. – T203063
  4. MediaWiki (Parser): Parse API exposes fatal content model error. – T206253
  5. MediaWiki (Special-pages): Special:DoubleRedirects unavailable on ttwiki. – T204800
  6. MobileFrontend: Some Special:MobileDiff urls fatal. – T156293
  7. ProofreadPage: Unable to edit certain pages on Wikisource. – T176196
  8. Translate: Some Special:Translate urls fatal. – T204833
  9. Wikibase: Clicking “undo” for some revisions fatals with a PatcherException. – T97146

Public user requests resulting in fatals can (and have) caused alerts to fire that notify SRE of wikis potentially being less available or down.

💡 ProTip:

Cross-reference one workboard with another via Open TasksAdvanced Filter and enter Tag(s) to apply as a filter.

🎉 Thank you

Thank you to everyone who helped by reporting or investigating problems in Wikimedia production; and for implementing or reviewing their solutions. Including: @tstarling, @thiemowmde, @thcipriani, @Tgr, @Steinsplitter, @Quiddity, @pmiazga, @Nikerabbit, @Mvolz, @Lucas_Werkmeister_WMDE, @kostajh, @jrbs, @JJMC89, @Jdforrester-WMF, @hashar, @Gilles, @Daimona, @Ciencia_Al_Poder, @Catrope, @BPirkle, @Barkeep49, @Anomie, and @Aklapper.

Thanks!

Until next time,

– Timo Tijhof


Footnotes:

[1] Incidents. – wikitech.wikimedia.org/wiki/Special:AllPages...
[2] Tasks closed. – phabricator.wikimedia.org/maniphest/query...
[3] Tasks opened. – phabricator.wikimedia.org/maniphest/query...
[4] Partial blocks. – meta.wikimedia.org/wiki/Community_health_initiative
[5] Bug report about page deletion, 2007. – T13402

Written by Krinkle on Dec 12 2018, 4:40 AM.
Principal Engineer (Wikimedia Performance)
Projects
None
Subscribers
None
Tokens
"Yellow Medal" token, awarded by mmodell."Like" token, awarded by dduvall."Love" token, awarded by Quiddity.

Event Timeline

Thank you @Krinkle those monthly reports are very helpful.