Production Excellence #19: February 2020
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

📊  Month in numbers
  • 8 documented incidents. [1]
  • 27 new Wikimedia-prod-error reports. [2]
  • 26 Wikimedia-prod-error reports closed. [3]
  • 199 currently open Wikimedia-prod-error reports in total. [4]

With a median of 4–5 documented incidents per month (over the last three years), there were a fairly large number of them this past month.

To read more about these incidents and pending actionables; check Incident documentation § 2020, or Explore Wikimedia incident stats (interactive).

📖  Unset vs array splice

Our error monitor (Logstash) received numerous reports about an “Undefined offset” error from the OATHAuth extension. This extension powers the Two-factor auth (2FA) login interface on Wikipedia.

@ItSpiderman and @Reedy investigated the problem. The error message:

PHP Notice: Undefined offset: 8
at /srv/mediawiki/extensions/OATHAuth/src/Key/TOTPKey.php:188

This error means that the code was accessing item number 8 from a list (an array), but the item does not exist. Normally, when a “2FA scratch token” is used, we remove it from a list, and save the remaining list for next time.

The code used the count() function to compute the length of the list, and used a for-loop to iterate through the list. When the code found the user’s token, it used the unset( $list[$num] ) operation to remove token $num from the list, and then save $list for next time.

The problem with removing a list item in this way is that it leaves a “gap”. Imagine a list with 4 items, like [ 1: …, 2: …, 3: … , 4: … ]. If we unset item 2, then the remaining list will be [ 1: …, 3: …, 4: … ]. The next time we check this list, the length of the list is now 3 (so far so good!), but the for-loop will access the items as 1-2-3. The code would not know that 3 comes after 1, causing an error because item 2 does not exist. And, the code would not even look at item 4!

When a user used their first ever scratch token, everything worked fine. But from their second token onwards, the tokens could be rejected as “wrong” because the code was not able to find them.

To avoid this bug, we changed the code to use array_splice( $list, $num, 1 ) instead of unset( $list[$num] ). The important thing about array_splice is that it renumbers the items in the list, leaving no gaps.

T244308 / https://gerrit.wikimedia.org/r/570253

📉  Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.


Breakdown of recent months:

  • March: 3 of 10 reports left (unchanged). ⚠️
  • April: 4 of 14 left (unchanged).
  • May: (All clear!)
  • June: 4 of 11 left (unchanged).
  • July: 8 of 18 left (unchanged).
  • August: Two reports closed! 2 of 14 reports left.
  • September: One report closed, 7 of 12 left.
  • October: Two reports closed, 6 of 12 left.
  • November: 5 of 5 left (unchanged).
  • December: 6 of 9 left (unchanged).
  • January: One report closed, 6 of 7 reports left.
  • February: 7 new reports survived the month of February.

Last month’s total over recent months was 57 open reports. Of those, 6 got closed, but with 7 new reports from February still open, the total is now up at 58 open reports.

🎉  Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production.

Together, we’re getting there!

Until next time,

– Timo Tijhof

[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation#2020
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Written by Krinkle on Mar 24 2020, 9:40 PM.
Principal Engineer (Performance)
Reedy, ItSpiderman
"Yellow Medal" token, awarded by mmodell.

Event Timeline