Production Excellence #18: January 2020
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

📊  Month in numbers
  • 3 documented incidents. [1]
  • 26 new Wikimedia-prod-error reports. [2]
  • 26 Wikimedia-prod-error reports closed. [3]
  • 198 currently open Wikimedia-prod-error reports in total. [4]

To read more about these incidents and pending actionables; check Incident documentation § 2020, or Explore Wikimedia incident stats (interactive).

📖  Paradoxical array key

Wikimedia encountered several Zend engine bugs that could corrupt a PHP program at run-time, during the upgrade from HHVM to PHP 7.2. (Some of these bugs are still being worked on.) One of the bugs we fixed last month was particularly mysterious. Investigation led by @hashar and @tstarling.

MediaWiki would create an array in PHP and add a key-value pair to it. We could iterate this array, and see that our key was there. Moments later, if we tried to retrieve the key from that same array, sometimes the key would no longer exist!

After many ad-hoc debug logs, core dumps, and GDB sessions, the problem was tracked down to the string interning system of Zend PHP. String interning is a memory reduction technique. It means we only store one copy of a character sequence in RAM, even if many parts of the code use the same character sequence. For example, the words “user” and “edit” are frequently used in the MediaWiki codebase. One of those sequences is the empty string (“”), which is also used a lot in our code. This is the string we found disappearing most often from our PHP arrays. This bug affected several components, including Wikibase, the wikimedia/rdbms library, and ResourceLoader.

Tim used a hardware watchpoint in GDB, and traced the root cause to the Memcached client for PHP. The php-memcached client would “free” a string directly from the internal memory manager after doing some work. It did this even for “interned” strings that other parts of the program may still be depending on.

@jijiki and @Joe backported the upstream fix to our php-memcached package and deployed it to production. Thanks! — T232613

📉  Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.


Breakdown of recent months (past two weeks not included):

  • March: 3 of 10 reports left (unchanged). ⚠️
  • April: Two reports closed, 4 of 14 left.
  • May: (All clear!)
  • June: Two reports closed. 4 of 11 left.
  • July: Four reports closed, 8 of 18 left.
  • August: 4 of 14 reports left (unchanged).
  • September: One report closed, 8 of 12 left.
  • October: 8 of 12 left (unchanged).
  • November: 5 of 5 left (unchanged).
  • December: Three reports closed, 6 of 9 left.
  • January: 7 new reports survived the month of January.

There are a total of 57 reports filed in recent months that remain open. This is down from 62 last month.

🎉  Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation#2019
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Written by Krinkle on Feb 28 2020, 7:39 PM.
Principal Engineer (WMF Performance Team)

Event Timeline