Page MenuHomePhabricator

Separate clusters for asynchronous processing from the ones for public consumption
Closed, ResolvedPublic

Description

Since the introduction of restbase, we've been processing asyncronous events (like storing a new revision in restbase after an edit) by querying Parsoid and the mediawiki API, on the same clusters that serves the public, so the one that backs, among other things, all our anti-vandalism tools and Visual Editor.

This has caused several minor or major incidents, be it triggering obscure bugs or just overwhelming the cluster.

While perfect separation of concerns is impossible (after all, everything operates on the same database), a decent amount of it is absolutely mandatory:

a failure/overload of the async processing pattern should not impact the user experience.

The current plan includes the following steps:

  • Enable TLS termination on all the MediaWiki Clusters, so that cross-dc queries can be encrypted
  • Create new cluster load balancers for the API https cluster
  • Configure ChangePropagation to call restbase in the inactive datacenter
  • Check that restbase in the inactive DC calls parsoid in the same DC
  • Make parsoid call the MW API via https from the inactive DC
  • Configure load balancers so that the main API LB has all the appservers as backends, while the async-API LB only includes roughly 50% of them. This will mean that the non-shared ones will be able to serve live traffic even in the case the remaining 50% will be unable to due to some overloading. This number can be adjusted further in the future.

Event Timeline

mobrovac subscribed.

On the services side, we could probably move the update flow (RB, Parsoid, MobileApps et al.) to the non-active DC, so that the active DC only serves live traffic (Cassandra allows us to do that). However, the case of MW is trickier since it might suffer from replication lag.

There are pros & cons for dividing the API cluster in multiple sub-clusters. The big advantage is indeed increased isolation between jobs of different urgency. Downsides are increased complexity, and less peak capacity for either kind of job. If we divided clusters by average load, a smaller burst in public API requests could impact public API availability, while a unified cluster would have had enough spare capacity to handle such a burst.

@mobrovac's proposal of using the inactive DC avoids some of the downsides, but we would need to improve our handling of replication lag, as the average replication lag would be higher. Higher replication lag would also increase the chance of users requesting new content before a changeprop update finished, which would lower performance for users. We have several clients follow RCStream and requesting new revisions soon after each edit, and human users are frequently re-opening VisualEditor right after saving an edit.

In any case, we need to make our public APIs more robust by enforcing sensible request rate and resource consumption limits (ex: T97192). This reduces the risk of issues from both external & internal requests, without incurring some of the issues of separate clusters.

There are pros & cons for dividing the API cluster in multiple sub-clusters. The big advantage is indeed increased isolation between jobs of different urgency. Downsides are increased complexity, and less peak capacity for either kind of job. If we divided clusters by average load, a smaller burst in public API requests could impact public API availability, while a unified cluster would have had enough spare capacity to handle such a burst.

You miss the point: I'm ok in having an underpowered cluster for async processing and forcing applications to throttle the async processing rate to adapt to the smaller scale factor of said cluster. Also, I'd work so that it would be easy to move one server from one cluster to the other, or in both, in case of need.

Also, the downside you present is mostly theoretical (unless we're referring to deliberate attacks, and there I doubt this would make a big difference) while the outages that were caused by async processing hammering the API have been all too real in the last couple of years.

But even letting apart the practicality of our current cluster: separating async processing from real time processing is the right thing to do in general, even if we are aware that the DB layer will still be shared.

@mobrovac's proposal of using the inactive DC avoids some of the downsides, but we would need to improve our handling of replication lag, as the average replication lag would be higher. Higher replication lag would also increase the chance of users requesting new content before a changeprop update finished, which would lower performance for users. We have several clients follow RCStream and requesting new revisions soon after each edit, and human users are frequently re-opening VisualEditor right after saving an edit.

Using the inactive DC for batch processing can be a good idea for the purely stateless stratum, thus restbase and parsoid, if we can enable TLS to MediaWiki, that needs to be called in the active DC.

In any case, we need to make our public APIs more robust by enforcing sensible request rate and resource consumption limits (ex: T97192). This reduces the risk of issues from both external & internal requests, without incurring some of the issues of separate clusters.

I think this is a misrepresentation of the issue in T151702; while I agree with the broad statement that mediawiki can be better in that respect, the issue we're seeing here is excessive request rate from Change Propagation and probably some bug in its concurrency control.

So, assuming parsoid can do TLS to its backend (I'll check that), my proposed plan would be:

  1. Create a new load-balanced cluster, "batch-api", for MediaWiki; possibly include the jobrunners in this pool with lower weight. This pool should include about 25% of the current API cluster and the servers should be easily pooled/depooled from one cluster to the other
  2. Not sure what the kafka replication status to codfw is; if it's set up, make changeprop work from codfw, call restbase there, call parsoid there, and make parsoid call the batch api cluster in eqiad via TLS.

This is quite a lot of work but I think it's the correct way to tackle this without having to split up all clusters. Some attentions should be put into puppetizing everything in a way that allows us to manage the DC switchover easily.

Turns out parsoid's way of contacting the backends wasn't able to support TLS termination, but it's easy enough to add it (see https://gerrit.wikimedia.org/r/#/c/325550/)

So, it would be possible to have changeprop to consume events from the codfw kafka cluster, right?

Change 325591 had a related patch set uploaded (by Giuseppe Lavagetto):
role::mediawiki::webserver: add TLS local proxy

https://gerrit.wikimedia.org/r/325591

the issue we're seeing here is excessive request rate from Change Propagation and probably some bug in its concurrency control.

Could you describe how you came to this conclusion in T151702? The data I have seen (summarized in T151702#2841851) leads me to believe that concurrency limiting worked, and request rates during the outage were low.

the issue we're seeing here is excessive request rate from Change Propagation and probably some bug in its concurrency control.

Could you describe how you came to this conclusion in T151702? The data I have seen (summarized in T151702#2841851) leads me to believe that concurrency limiting worked, and request rates during the outage were low.

I thought my preceding comments in T151702 made that pretty self-evident, specifically you might have missed https://phabricator.wikimedia.org/T151702#2841177, where I grepped access logs on all servers to be sure we weren't missing anything: it shows pretty clearly that there was a spike of requests originated by that specific transclusion of 23k reqs in a single minute coming from change propagation, with an initial rate of more than 1k/s.

This either means that the concurrency control has a bug, or that it's dramatically misconfigured, or both.

@Joe, the data you collected in T151702#2841177 was about requests from Parsoid to the MW API, which were originally triggered by ChangeProp requests. This difference is important to figure out where the bursts came from.

Here is the request rate from RESTBase to Parsoid during the second outage:

pasted_file (1×1 px, 282 KB)

And here is the request rate from ChangeProp to RESTBase:

pasted_file (1×1 px, 284 KB)

Finally, the incoming request rate in Parsoid:

pasted_file (1×1 px, 169 KB)

I think this data does not support the interpretation that there was a spike of requests from ChangeProp itself, or that concurrency limiting did not work. It does suggest that retries between Parsoid and MediaWiki played a role. Parsoid also logged API request timeouts in the original outage, which means that it retried those requests.

As for the overall rates: At a configured concurrency limit of 400, and typical processing times of 0.5 to 1.5s, request rates between ~250 and ~800 requests per second are normal. Longer term averages should be around 300-400/s.

Please also have a look at T151702#2841851, which shows request rates and -latencies during the actual outage. Request rates dropped to ~10-30/s, in line with the increase in latency at constant concurrency.

  1. Both are "actual outages"
  2. Is it possible that grafana only records successful requests to parsoid? It doesn't seem to me that there is any failure recored in those graphs, which is very strange during a phase where parsoid was basically not responding?

I've extracted the data about requests to Parsoid from its logs as follows:

  • Messages for user-agent ChangePropagation/WMF
  • For wiki euwiki
  • with longMsg started parsing
  • with reqId 3ff62f51-cd11-4b44-98e4-6a6aa608b600

and that shows 6K hits/minute in the parsoid logs. Aren't those logs 1:1 with requests coming from ChangeProp?

See an example:

$ sudo salt --output=raw 'wtp1*' cmd.run 'zgrep "2016-12-02T07:51" /srv/log/parsoid/main.log.4.gz | jq ". | select(.userAgent == \"ChangePropagation/WMF\") | select(.wiki == \"euwiki\") | select(.longMsg == \"started parsing\") | select(.reqId == \"3ff62f51-cd11-4b44-98e4-6a6aa608b600\") | .reqId " | wc -l ' | cut -f 4 -d \' | awk '{sum+=$1} END { print sum }'
7519

Same selection yelds 6472 requests at 7:50, so this says, if I'm not interpreting the logs incorrectly, that parsoid received 15K requests from ChangePropagation for that specific request id in the span of 2 minutes. Given all those requests are very expensive to parse, it seems unlikely to me this is due to retries.

Do you see any flaw in that line of reasoning @ssastry @GWicke?

EDIT: preceding grep was flawed, fixed the values, the data is still showing the huge amount of requests that is only moderately multiplied towards mediawiki.

This comment was removed by Joe.

@Joe, for every Parsoid parse request, Parsoid could make multiple M/W api requests. So, there is that to factor in. But, separately, given your back and forth, I got curious about logs and did some log digging on my own. I looked at euwki 'started parsing' requests on wtp1001 for the time period in question: 7:45 - 8:50 UTC Dec 2, 2016. I am going to upload that in a bit.

This is just from one server, wtp1001 .. so, like x24 for how many requests Parsoid received across the entire cluster.

<#reqs, time>
ssastry@wtp1001:/srv/log/parsoid$ gunzip < main.log.4.gz | grep euwiki | grep 'started parsing' | sed 's/.*"time":"2016-12-02T0//g;s/Z",.*$//g;' | cut -f1-2 -d":" | uniq -c 
 ... removed a bunch of small requests from earlier ...   
      1 7:43
    273 7:50
    308 7:51
    175 7:52
     61 7:53
     36 7:54
     41 7:55
     42 7:56
     35 7:57
     37 7:58
     34 7:59
     34 8:00
     42 8:01
     38 8:02
     36 8:03
     38 8:04
     31 8:05
     25 8:06
     32 8:07
     25 8:08
     25 8:09
     28 8:10
     23 8:11
     33 8:12
     31 8:13
     44 8:14
     25 8:15
     27 8:16
     25 8:17
     29 8:18
     22 8:19
     27 8:20
     29 8:21
     25 8:22
     27 8:23
     34 8:24
     24 8:25
     29 8:26
     28 8:27
     27 8:28
     35 8:29
     24 8:30
     14 8:31
     30 8:32
     19 8:33
     34 8:34
     27 8:35
     27 8:36
     25 8:37
     22 8:38
     33 8:39
     40 8:40
     37 8:41
     44 8:42
    164 8:43
    244 8:44
    329 8:45
    407 8:46
    338 8:47
    445 8:48
    448 8:49
    318 8:50
      4 8:51
      1 8:59
... removed a bunch of small requests from later ...

<title,oldid> from 7:50 -- 7:59
ssastry@wtp1001:/srv/log/parsoid$ gunzip < main.log.4.gz | grep euwiki | grep 'started parsing' | grep "T07:5" | cut -d"," -f7,8 | sort | uniq -c | wc
   1041    2082   49501

<title,oldid> from 8:00 -- 8:59
ssastry@wtp1001:/srv/log/parsoid$ gunzip < main.log.4.gz | grep euwiki | grep 'started parsing' | grep "T08:" | cut -d"," -f7,8 | sort | uniq -c | wc
   3967    7934  202966

<user-agent> from 7:50 -- 7:59
ssastry@wtp1001:/srv/log/parsoid$ gunzip < main.log.4.gz | grep euwiki | grep 'started parsing' | grep "T07:5" | cut -d"," -f10 | sort | uniq -c 
   1041 "userAgent":"ChangePropagation/WMF"
      1 "userAgent":"WMF Mobile Content Service"

<user-agent> from 8:00 -- 7:59
sastry@wtp1001:/srv/log/parsoid$ gunzip < main.log.4.gz | grep euwiki | grep 'started parsing' | grep "T08:" | cut -d"," -f10 | sort | uniq -c 
   3962 "userAgent":"ChangePropagation/WMF"
      6 "userAgent":"http-kit/2.0"
      1 "userAgent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML
      1 "userAgent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML
      1 "userAgent":"mw-ocg-bundler/1.3.1-git/unknown"
      1 "userAgent":"REST-API-Crawler-Google/1.0 (+wikidata-external@google.com)"

Oops .. looks like @Joe and I were cranking logs at the same time. :) But, we have two independent set of numbers to cross-check, so not a bad thing. :)

ssastry@wtp1001:/srv/log/parsoid$ gunzip < main.log.4.gz | grep euwiki | grep 'started parsing' | grep "T07:5" | cut -d"," -f7,8 | sort | uniq -c | sort -nr | head -2
      2 "title":"Le_Puy_(Doubs)","oldId":5114860
      1 "title":"Zundert","oldId":5220183

But, one thing that my analysis shows that there was just 1 retry from RB to Parsoid on wtp1001 itself, but, to get a truer picture, you would have to collate all logs from all 24 servers and look at that collated log.

@ssastry's data seems to roughly concur with grafana's. ~30 req/minute per node works out to about 12 req/s across the 24 nodes of the cluster, and the log data is filtered for euwiki only (a subset of all requests).

@GWicke it's more like 10 times that number (300 req/min/node) and about 120 reqs/s/node. And @ssastry numbers coincide with mines now.

@GWicke to be very clear, it's the shower of requests during those few minutes that causes the issue on the API cluster, so the concurrency of requests during that initial spike is the issue. The next hour or so is mostly irrelevant to understanding if the concurrency limits are effective in protecting us from an auto-DDOS.

I think we are all on the same page now. I agree that it is that initial burst of template reparse requests that is the issue. And, that is the reason for my request in T151702#2841660 which Services is working on in T152229: Interleave processing of backlink jobs.

I think what would be effective mitigating the issue would be if changepropagation spread out the requests coming from a single transclusion, as in rate-limiting those, so that if rendering that specific template is very expensive, the expensive requests to the API would be spread over time.

But on a more general note: we're wildly off-topic here; I don't see how separating the processing of such async requests from the API cluster serving user-generated traffic can be a bad idea, given this is not the first time async processing caused more general issues. And this ticket is about separating concerns between the two classes of traffic.

Also, note that we could even do as follows: leave all servers in the API cluster active in both pools, and just depool a bunch from the one used by async processing if we see any increased pressure. This would let the load balancers route the active users traffic to the servers not receiving the async traffic anymore.

Also, note that we could even do as follows: leave all servers in the API cluster active in both pools, and just depool a bunch from the one used by async processing if we see any increased pressure. This would let the load balancers route the active users traffic to the servers not receiving the async traffic anymore.

I don't have a strong opinion about this ticket since I don't yet understand all the implications, but, this proposal here seems reasonable to me.

@GWicke it's more like 10 times that number (300 req/min/node) and about 120 reqs/s/node. And @ssastry numbers coincide with mines now.

300 * 24 / 60 = 120 req/s across the Parsoid cluster, before the slow-down during the outage

Both @ssastry's and @Joe's log data shows lower request rates than grafana, but the logs are filtered for euwiki only. The overall steady state request rate to Parsoid is higher, around 400 req/s. Grafana shows higher values throughout the period (both before the slow-down, and during it), which is why I said the numbers are consistent with the log data. If you look at all logs, I expect the rates to be the same.

Could you point out where either the logs or grafana show a spike in request rates from ChangeProp or RESTBase to Parsoid?

Could you point out where either the logs or grafana show a spike in request rates from ChangeProp or RESTBase to Parsoid?

@GWicke You can see the rate of CP originated requests on this graph: https://grafana.wikimedia.org/dashboard/db/eventbus?panelId=8&fullscreen&from=now-7d&to=now

Choose the on-transclusion-update_exec topic - each message processed in that topic results in a Parsoid reparse request. It is a bit bursty, but let's see if implementation of T152229 would make it any better.

The CP processing graph for the second outage referenced by @Joe agrees with the RB and Parsoid graphs:

pasted_file (1×1 px, 315 KB)

Retry processing:

pasted_file (1×1 px, 340 KB)

No sign of bursts.

@GWicke, at this point, I think you and @Joe are using the word "burst" a bit differently. In any case, I think T152074#2854369 captures my understanding of you two are talking about. Joe's observation is that the "burst of expensive parse requests" (but within the normal processing and concurrency rate as you observe, hence not a burst of parse requests) in that 3-min period is the cause of the incident. You both are right for your individual definitions of burst. In any case, it seems clear to me that the larger batch of expensive parse requests within a short period of time caused the Dec 2 incident.

@ssastry, I was responding to comments like this one:

@GWicke to be very clear, it's the shower of requests during those few minutes that causes the issue on the API cluster, so the concurrency of requests during that initial spike is the issue. The next hour or so is mostly irrelevant to understanding if the concurrency limits are effective in protecting us from an auto-DDOS.

To expand on previous comments, my interpretation of the data is as follows:

  • Requests from ChangeProp to RB, and from RB to Parsoid slowed down proportional to individual request latencies. Request rates between ChangeProp / RESTBase, and RESTBase / Parsoid matched expectations. Logs concur with grafana request rates. Overall, this suggests that concurrency limits in ChangeProp kept the number of concurrent requests from ChangeProp through RB to Parsoid constant.
  • Retries in ChangeProp or RESTBase were rare before / during the outages.
  • Something caused the PHP API to experience especially high load from specific requests.

Assuming timeouts work as expected, more expensive requests at constant concurrency should not cause significantly higher load on the PHP API. So, clearly something else is going on here. We should figure out what that is, and fix it.

I offered some possibilities:

  • If requests in the PHP API did not actually terminate after the 60 second response time limit, then this would cause requests in HHVM to pile up.
    • T97192 discusses issues with how time limits are enforced in different versions of HHVM. @Anomie's last comment in T97192#2678349 indicates that request time limits are still not working in Vagrant.
    • Parsoid's request timeout matched HHVM's response time limit, so the fact that it saw a client timeout does not tell us much about whether the HHVM processing was in fact aborted a split second later.
    • It seems that HHVM nodes were swamped & didn't recover by themselves. This might point towards time limits not working as expected.
  • Based on the code, the previous timeouts configured in Parsoid would have had Parsoid retry timed-out expensive MW API requests once. If time limits worked in the PHP API, then this would not have made a difference to the PHP API cluster load. However, if PHP time limits were not working promptly, then this would have contributed to requests piling up in HHVM.

We increased the Parsoid -> MW API timeout to 65s as part of T152073. If MW API timeouts do indeed fire at 60s, then this should eliminate any client timeouts (and related retries) for requests from Parsoid to the MW API. This is something we can easily check for in the logs, so we should soon know whether MediaWiki API response time limits do indeed work in production.

Edit: A parsoid dashboard showing API request client timeouts is available at https://logstash.wikimedia.org/goto/23508ee86e2ba94dd24f2e55f7655628. Once the Parsoid config change to increase the API request timeout to 65s is deployed, those timeouts should basically disappear. Instead, HHVM is expected to return an error such as 500 or 503, which would no longer be logged as etimedout in parsoid. @ssastry @Arlolra, could you update this task once that config change was deployed?

According to https://www.mediawiki.org/wiki/Parsoid/Deployments#Wednesday.2C_December_7.2C_2016_around_1:15pm_PT:_Y.C2.A0Deployed_3cf19c6b it seems that the MediaWiki API timeout change was indeed deployed Wednesday, December 7, 2016 around 1:15pm PT.

The rate of Parsoid to MediaWiki API request client timeouts does not seem to have changed with the deploy. Average rate is around 330 timeouts per hour:

pasted_file (2×1 px, 1 MB)

Unless there is some other reason for these timeouts (such as a broken API server that is not depooled), this suggests that HHVM response timeouts do not trigger reliably after 60s, and before Parsoid times out the request at 65s.

See also: Code emitting the etimedout log line

Change 325591 merged by Giuseppe Lavagetto:
role::mediawiki::webserver: add TLS local proxy

https://gerrit.wikimedia.org/r/325591

Joe updated the task description. (Show Details)

Thanks to the invaluable help from @mobrovac, we're now serving the traffic for CP in eqiad via the chain of calls we wanted.

Tomorrow I'll start reserving some of the API servers just for live traffic.

Mentioned in SAL (#wikimedia-operations) [2017-01-18T07:32:56Z] <_joe_> depooling mw1226-mw1235 from the https pool in eqiad, T152074