Page MenuHomePhabricator

Evaluate the consequences of the parsercache being empty post-switchover
Closed, ResolvedPublic

Assigned To
Authored By
Joe
Oct 12 2018, 9:05 AM
Referenced Files
F26557114: fatals.png
Oct 12 2018, 3:04 PM
F26557657: memcached_fatals_eqiad.png
Oct 12 2018, 3:04 PM
F26557677: memcached_fatals_codfw.png
Oct 12 2018, 3:04 PM
F26556993: requests_s.png
Oct 12 2018, 2:45 PM
F26557042: hhvm_load_queued_api.png
Oct 12 2018, 2:45 PM
F26556940: appserver-cpu.png
Oct 12 2018, 2:41 PM
F26556898: api-cpu.png
Oct 12 2018, 2:41 PM
F26553026: parsercache.png
Oct 12 2018, 1:54 PM
Tokens
"Yellow Medal" token, awarded by Bawolff."Love" token, awarded by akosiaris.

Description

As explained in T206740, we found out that at the time of a MediaWiki switchover to eqiad, the eqiad parsercaches contained only stale data.

This ticket is to assess what were the consequences on the whole cluster that were caused directly by this event.

Some of the issues observed:

  • MCS (and restbase as a consequence) timing out on specific enpoints in the swagger checks
  • Semi-sporadic 503 spikes
  • A shower of memcached errors
  • Very high CPU usage on the appservers
  • Queued requests in HHVM
  • Sporadic 500s with timeouts related to entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.24/includes/parser/Preprocessor_Hash.php

We need to analyze all of those and see what can be explained by the parsercache fiasco and what cannot.

Event Timeline

First, the timeline:

  • Internal traffic starts flowing through eqiad in the interval 14:14:44 - 14:15:03
  • External traffic starts flowing through eqiad between 14:15:09 and 14:17:29, when it's migrated completely.
  • First alert for MCS timeouts notifies on IRC at 14:17:49. It is concievable the issue started 3 minutes earlier, in the exact moment of the internal switchover.
  • Some degradation of service for users starts immediately. Availability for cache text fires on multiple datacenters from 14:19:20 onwards
  • First alert for high cpu usage on an API appserver (the only ones with this check) happens at 14:24:39
  • Alert for MediaWiki memcached error rate fires at 14:23:10
  • Alert for MediaWiki fatals appears at 14:25:50 - we know this check to lag a bit
  • Intermittent problems continue, in the midst of some action by the SRE team (mostly, rebalancing the load for the api and application servers at the LVS level),
  • Most things settle around 14:35, with the last true recovery being the logstash count of fatals on Mediawiki at 14:59:40
Joe renamed this task from Evaluate the consequences of the parsercache fiaso post-switchover to Evaluate the consequences of the parsercache fiasco post-switchover.Oct 12 2018, 9:48 AM
Joe renamed this task from Evaluate the consequences of the parsercache fiasco post-switchover to Evaluate the consequences of the parsercache being empty post-switchover.Oct 12 2018, 9:51 AM

So first, what I think might be the full root cause of everything: When we switched from codfw to eqiad the parser cache hit ratio drops suddenly to 6% in the matter of minutes. At 15:00, when we started a full recovery, the hit rate was still only 18% but growing steadily.

parsercache.png (367×1 px, 31 KB)

With memcached wiped clean, and the parsercache databases basically void of useful content, almost all requests needed to be fully parsed by our application servers. This had a series of consequences;

First of all the CPU usage of the API cluster

api-cpu.png (255×547 px, 9 KB)

spiked upwards 50% on average, which is more than double the normal CPU usage.
Similar effect happened on the appserver cluster

appserver-cpu.png (267×542 px, 9 KB)

This should all be due to the amount of CPU-intensive jobs we had to run. On some of the more underpowered machines we easily topped 75% CPU usage, as our lvs weights are not based solely on CPU power (it's not the only resource that can be starved).

At the same time, a higher time for processing a single request meant that even in front of a substantially constant request flow

requests_s.png (244×635 px, 11 KB)

We had a higher concurrency on the servers; see the load (that is, total number of workers currently processing a request) and the queued requests (that is, requests that wait in a queue in HHVM waiting to be processed by the worker threads) just for the first few minutes of the switchover. As you can notice, the situation converged quite quickly to more normal numbers and we stopped having queued requests

hhvm_load_queued_api.png (250×1 px, 28 KB)

As far as MediaWiki fatals go, we had way less issues than one would expect given the graphs above. We had only ~ 1000 fatals due to requests exceeding 60 seconds, and a bunch more for other somewhat related causes

fatals.png (177×1 px, 13 KB)

but we had a very high number of memcached errors

memcached_fatals_eqiad.png (196×1 px, 11 KB)

in particular, comparing this with the number of errors we had during the switchover to codfw is worrysome

memcached_fatals_codfw.png (197×1 px, 9 KB)

which cannot be explained by the higher memcached traffic is SET due to so many parsercache values being not present.

Interestingly, a lot of the long queue of errors in that graph is indeed related to parsercache memcached items; I suspect anyways that the higher number of errors is only partially caused by the higher traffic due to the parsercache issue, and mostly due to us having less servers in eqiad and thus triggering more severely T203786

Overall the absence of any valid parsercache entries can explain all the effects we've seen, except at least partially the very high volume of memcached errors.

All this seems pretty correct to me and does explain what we 've experienced pretty well

Joe triaged this task as High priority.