Production Excellence: September 2018
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

Month in numbers

  • 1 documented incident since August 9. [1]
  • 113 Wikimedia-prod-error tasks closed since August 9. [2]
  • 99 Wikimedia-prod-error tasks created since August 9. [3]

Current problems

Frequent:

  • [MediaWiki-Logging] Exception from Special:Log (public GET). – T201411
  • [Graph] Warning "data error" from ApiGraph in gzdecode. – T184128
  • [RemexHtml] Exception "backtrack_limit exhausted" from search index jobs. – T201184

Other:

  • [MediaWiki-Redirects] Exception from NS_MEDIA redirect (public GET). – T203942

This is an oldie: (Well..., it's an oldie where I come from... 🎸)

  • [FlaggedRevs] Exception from Special:ProblemChanges (since 2011). – T176232

Terminology:

  • An Exception (or fatal error) causes user actions to be aborted. For example, a page would display "Exception: Unable to render page", instead the article content.
  • A Warning (or non-fatal error) can produce page views that are technically unaware of a problem, but may show corrupt or incomplete information. For example, an article would display the word "null" instead of the actual content. Or, a user may be told "You have null new messages."

The combined volume of infrequent non-fatal errors is high. This limits our ability to automatically detect whether a deployment caused problems. The “public GET” risks in particular can (and have) caused alerts to fire that notify Operations of wikis potentially being down. Such exceptions must not be publicly exposed.

With that behind us... Let’s celebrate this month’s highlights!

*️⃣ Quiz defect – "0" is not nothing!

Tyler Cipriani (Release Engineering) reported an error in Quiz. Wikiversity uses Quiz for interactive learning. Editors define quizzes in the source text (wikitext). The Quiz program processes this text, creates checkboxes with labels, and sends it to a user. When the sending part failed, "Error: Undefined index" appeared in the logs. @Umherirrender investigated.

A line in the source text can: define a question, or an answer, or nothing at all. The code that creates checkboxes needs to decide between "something" and "nothing". The code utilised the PHP "if" statement for this, which compares a value to True and False. The answers to a quiz can be any text, which means PHP first transforms the text to one of True or False. In doing so, values like "0" became False. This meant the code thought "0" was not an answer. The code responsible for sending checkboxes did not have this problem. When the code tried to access the checkbox to send, it did not exist. Hence, "Error: Undefined index".

Umherirrender fixed the problem by using a strict comparison. A strict comparison doesn't transform a value first, it only compares.

T196684

*️⃣ PageTriage enters JobQueue for better performance

Kosta Harlan (from Audiences's Growth team) investigated a warning for PageTriage. This extension provides the New Pages Feed tool on the English Wikipedia. Each page in the feed has metadata, usually calculated when an editor creates a page. Sometimes, this is not available. Then, it must be calculated on-demand, when a user triages pages. So far, so good. The information was then saved to the database for re-use by other triagers. This last part caused the serious performance warning: "Unexpected database writes".

Database changes must not happen on page views. The database has many replicas for reading, but only one "master" for all writing. We avoid using the master during page views to make our systems independent. This is a key design principle for MediaWiki performance. [5] It lets a secondary data centre build pages without connecting to the primary (which can be far away).

Kosta addressed the warning by improving the code that saves the calculated information. Instead of saving it immediately, an instruction is now sent via a job queue, after the page view is ready. This job queue then calculates and saves the information to the master database. The master synchronises it to replicas, and then page views can use it.

T199699 / https://gerrit.wikimedia.org/r/455870

*️⃣ Tomorrow, may be sooner than you think

After developers submit code to Gerrit, they eagerly await the result from Jenkins, an automated test runner. It sometimes incorrectly reported a problem with the MergeHistory feature. The code assumed that the tests would finish by "tomorrow".

It might be safe to assume our tests will not take one day to finish. Unfortunately, the programming utility "strtotime", does not interpret "tomorrow" as "this time tomorrow". Instead, it means "the start of tomorrow". In other words, the next strike of midnight! The tests use UTC as the neutral timezone.

Every day in the 15 minutes before 5 PM in San Francisco (which is midnight UTC), code submitted to Code Review, could have mysteriously failing tests.

– Continue at https://gerrit.wikimedia.org/r/452873

*️⃣ Continuous Whac-A-Mole

In August, developers started to notice rare and mysterious failures from Jenkins. No obvious cause or solution was known at that time.

Later that month, Dan Duvall (Release Engineering team) started exploring ways to run our tests faster. Before, we had many small virtual servers, where each server runs only one test at a time. The idea: Have a smaller group of much larger virtual servers where each server could run many tests at the same time. We hope that during busier times this will better share the resources between tests. And, during less busy times, allow a single test to use more resources.

As implementation of this idea began, the mysterious test failures became commonplace. "No space left on device", was a common error. The test servers had their hard disk full. This was surprising. The new (larger) servers seemed to have enough space to accommodate the number of tests it ran at the same time. Together with Antoine Musso and Tyler Cipriani, they identified and resolved two problems:

  1. Some automated tests did not clean up after themselves.
  2. The test-templates were stored on the "root disk" (the hard drive for the operating system), instead of the hard drive with space reserved for tests. This root disk is quite small, and is the same size on small servers and large servers.

T202160 / T202457

🎉 Thanks!

Thank you to everyone who has helped report, investigate, or resolve production errors past month. Including:

Tpt
Ankry
Daimona
Legoktm
Volker_E
Pchelolo
Dan Duvall
Gilles Dubuc
Daniel Kinzler
Umherirrender
Greg Grossmeier
Gergő Tisza (Tgr)
Sam Reed (Reedy)
Giuseppe Lavagetto
Brad Jorsch (Anomie)
Tim Starling (tstarling)
Kosta Harlan (kostajh)
Jaime Crespo (jcrespo)
Antoine Musso (hashar)
Roan Kattouw (Catrope)
Adam WMDE (Addshore)
Stephane Bisson (SBisson)
Niklas Laxström (Nikerabbit)
Thiemo Kreuz (thiemowmde)
Subramanya Sastry (ssastry)
This, that and the other (TTO)
Manuel Aróstegui (Marostegui)
Bartosz Dziewoński (matmarex)
James D. Forrester (Jdforrester-WMF)

Thanks!

Until next time,

– Timo Tijhof


Further reading:

Footnotes:

[1] Incidents. – https://wikitech.wikimedia.org/wiki/Special:AllPages?from=Incident+documentation%2F20180809&to=Incident+documentation%2F20180922&namespace=0
[2] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query/wOuWkMNsZheu/#R
[3] Tasks opened. – https://phabricator.wikimedia.org/maniphest/query/6HpdI76rfuDg/#R
[4] Quiz on Wikiversity. – https://en.wikiversity.org/wiki/How_things_work_college_course/Conceptual_physics_wikiquizzes/Velocity_and_acceleration
[5] Operate multiple datacenters. – https://www.mediawiki.org/wiki/Requests_for_comment/Master-slave_datacenter_strategy_for_MediaWiki

Written by Krinkle on Sep 25 2018, 6:41 PM.
Principal Engineer (Performance)
Projects
None
Subscribers
KartikMistry
Tokens
"Party Time" token, awarded by D3r1ck01."Love" token, awarded by kostajh.