Production Excellence #16: October 2019
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

📊 Month in numbers
  • 3 documented incidents. [1]
  • 33 new Wikimedia-prod-error reports. [2]
  • 30 Wikimedia-prod-error reports closed. [3]
  • 207 currently open Wikimedia-prod-error reports in total. [4]

There were three recorded incidents last month, which is slightly below our median of the past two years (Explore this data). To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.

📖 To Log or not To Log

MediaWiki uses the PSR-3 compliant Monolog library to send messages to Logstash (via rsyslog and Kafka). These messages are used to automatically detect (by quantity) when the production cluster is in an unstable state. For example, due to an increase in application errors when deploying code, or if a backend system is failing. Two distinct issues hampered the storing of these messages this month, and both affected us simultaneously.

Elasticsearch mapping limit

The Elasticsearch storage behind Logstash optimises responses to Logstash queries with an index. This index has an upper limit to how many distinct fields (or columns) it can have. When reached, messages with fields not yet in the index are discarded. Our Logstash indexes are sharded by date and source (one for “mediawiki”, one for “syslog”, and one for everthing else).

This meant that error messages were only stored if they only contained fields used before, by other errors stored that day. Which in turn would only succeed if that day’s columns weren’t already fully taken. A seemingly random subset of error messages was then rejected for a full day. Each day it got a new chance at reserving its columns, so long as the specific kind of error is triggered early enough.

To unblock deployment automation and monitoring of MediaWiki, an interim solution was devised. The subset of messages from “mediawiki” that deal with application errors now have their own index shard. These error reports follow a consistent structure, and contain no free-form context fields. As such, this index (hopefully) can’t reach its mapping limit or suffer message loss.

The general index mapping limit was also raised from 1000 to 2000. For now that means we’re not dropping any non-critical/debug messages. More information about the incident at T234564. The general issue with accommodating debug messages in Logstash long-term, is tracked at T180051. Thanks @matmarex, @hashar, and @herron.

Crash handling

Wikimedia’s PHP configuration has a “crash handler” that kicks in if everything else fails. For example, when the memory limit or execution timeout is reached, or if some crucial part of MediaWiki fails very early on. In that case our crash handler renders a Wikimedia-branded system error page (separate from MediaWiki and its skins). It also increments a counter metric for monitoring purposes, and sends a detailed report to Logstash. In migrating the crash handler from HHVM to PHP7, one part of the puzzle was forgotten. Namely the Logstash configuration that forwards these reports from php-fpm’s syslog channel to the one for mediawiki.

As such, our deployment automation and several Logstash dashboards were blind to a subset of potential fatal errors for a few days. Regressions during that week were instead found by manually digging through the raw feed of the php-fpm channel instead. As a temporary measure, Scap was updated to consider the php-fpm’s channel as well in its automation that decides whether a deployment is “green”.

We’ve created new Logstash configurations that forward PHP7 crashes in a similar way as we did for HHVM in the past. Bookmarked MW dashboards/queries you have for Logstash now provide a complete picture once again. Thanks @jijiki and @colewhite! – T234283

📉 Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.


Or help someone that’s already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • March: 1 report fixed. (3 of 10 reports left).
  • April: 8 of 14 reports left (unchanged). ⚠️
  • May: (All clear!)
  • June: 9 of 11 reports left (unchanged). ⚠️
  • July: 13 of 18 reports left (unchanged).
  • August: 2 reports were fixed! (6 of 14 reports left).
  • September: 2 reports were fixed! (10 of 12 new reports left).
  • October: 12 new reports survived the month of October.

🎉 Thanks!

Thank you, to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

🌴“Gotta love crab. In time, too. I couldn't take much more of those coconuts. Coconut milk is a natural laxative. That's something Gilligan never told us.

[1] Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident…
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Written by Krinkle on Nov 8 2019, 5:57 AM.
Principal Engineer (WMF Performance Team)
hashar, matmarex, herron and 2 others

Event Timeline