Page MenuHomePhabricator

[spec] Active-active setup for ORES across datacenters (eqiad, codfw)
Closed, ResolvedPublic

Description

We've been discussing splitting traffic across datacenters in order to have more capacity for ORES.

This task is done when we've identified the strategy we'd like to follow and created the tasks in order to implement it.

Lots of relevant discussion has already taken place at T139372: Set up oresrdb redis node in codfw

@Joe suggested:

So I thought a bit about it and come up with the following alternative solution

  1. the celery side of the redis for ores MUST NOT be replicated between datacenters
  2. We can set up nutcracker to act as a proxy between ores and the cache redis instances in both datacenters, so that all ores nodes share all the caches, and
  3. set up ipsec between where nutcracker is and the oresrdbXXXX that actually stores the data.

but honestly for now duplicating the precaching jobs to both DCs is ok.

Details

Related Gerrit Patches:
operations/puppet : productionchangeprop: Remove the ores_uri parameter
mediawiki/services/change-propagation/deploy : masterUse ores_uris instead of ores_uri
mediawiki/services/change-propagation/deploy : masterConfig: Do precaching in both datacenters.
operations/puppet : productionchangeprop: Add an ores_uris parameter

Event Timeline

Halfak created this task.Mar 4 2017, 6:35 PM

FWIW, ores is one of those services that could potentially work active/active in both DCs, having an extra RTT latency penalty (around 40ms for CODFW-EQIAD) in the case of a local cache miss in the non-active mediawiki-wise DC.

Wrapping up the discussion from T139372, important bullet points seem to be:

  • We MUST NOT replicate the celery database redis part between DCs as this is local to the DC by definition
  • We SHOULD have the local ores cache warmed up/prepopulated before the switchover
  • The cache is possibly sufficiently warmed up by precaching requests generated by changeprop (numbers on this one rely on T159502)
  • We COULD try to split the traffic between the DCs using say twemproxy/nutcracker so that the cache is global
  • We COULD just duplicate the precaching jobs to both DCs.

@mobrovac is the last point possible ? That is changeprop sending the exact same request to both DCs ? It looks like the fastest way forward for the switchover and possibly overall (depends on the numbers from T159502, I suppose)

How about replicating the precaching redis instance across DCs? Would that be feasible? It seems slightly less spaghetti-like than sending double requests in CP, which would be feasible, but ugly as we would need to replicate the current ORES portion of the config for the second DC, i.e. create a new rule for it. Adding to that the fact that we would also need to introduce another Puppet var to account for ORES in the second DC, I would honestly prefer either (a) replicate the pre-caching DB between DCs (note that here strong consistency is not an effective requirement if we know that only one DC will ever produce the requests); or (b) having a kind of switch that redirects requests from CP based on the current active DC.

More generally, making ORES effectively active-active could be easily achievable by integrating it in the REST API and putting it behind RESTBase, since then we could leverage Cassandra's multi-DC support. Just food for thought.

How about replicating the precaching redis instance across DCs? Would that be feasible? It seems slightly less spaghetti-like than sending double requests in CP, which would be feasible, but ugly as we would need to replicate the current ORES portion of the config for the second DC, i.e. create a new rule for it.

Having the mirror-maker we can configure the ORES rule in ChangeProp to listen to both datacenter-prefixed topics and configure the actual ORES URI based on the datacenter the change-prop is located at..

Having the mirror-maker we can configure the ORES rule in ChangeProp to listen to both datacenter-prefixed topics and configure the actual ORES URI based on the datacenter the change-prop is located at..

Hm, that would activate CP for the whole ensemble of messages for the other DC. Alternatively, we could set up an ORES topic and send messages via mirror maker, but that still seems like a work-around. And I think we are looking for a proper solution here, not just something for the next DC switch. Correct?

Hm, that would activate CP for the whole ensemble of messages for the other DC.

We can make the consume_dc config property support per-rule overrides, it's pretty easy to do.

How about replicating the precaching redis instance across DCs? Would that be feasible?

It's actually more of mess that it seems. It's been discussed already and rejected. See T139372#3064972 for an explanation (TL;DR it's unsustainable). Hence the need for alternative solutions.

One thing that has been discussed is split the cache globally by using nutcracker, but this entails interesting caveats like the extra latency (~40ms) to fetch a cached object (or even get the cache miss) from the "other" DC which make it less appealing. Not to mention that it is more fragile due to increased susceptibility to network interruptions

Not pre-warming up the cache is not desired either per T139372 as it would put ORES into overloaded state for a prolonged period of time (reasearch has run enough tests about it). We would survive (as is in it would not cause a widespread outage

It seems slightly less spaghetti-like than sending double requests in CP, which would be feasible, but ugly as we would need to replicate the current ORES portion of the config for the second DC, i.e. create a new rule for it. Adding to that the fact that we would also need to introduce another Puppet var to account for ORES in the second DC, I would honestly prefer either (a) replicate the pre-caching DB between DCs (note that here strong consistency is not an effective requirement if we know that only one DC will ever produce the requests); or (b) having a kind of switch that redirects requests from CP based on the current active DC.

Yeah I can understand that feeling. Maybe a relatively elegant way of solving it would be to make [uri] be an array and pass both values ? Then the Puppet var could also become an array propagating all the way down to the config. But I have no idea what that entails and if it is in any way saner than duplicating the config.

That being said as (a) can't happen as outlined above, and what would (b) accomplish given that the goal is to try and have both DC caches warmed up via the precaching mechanism.

mobrovac added a comment.EditedMar 21 2017, 10:00 PM

What about setting ORES up behind RESTBase and use Cassandra? In that case it wouldn't even matter where the results have been generated as they will be available in both DCs automagically. @Halfak @akosiaris what do you think? If need be, we can also redirect ores.wm.org to RESTBase in Varnish so that it's transparent for clients.

Halfak added a comment.EditedMar 22 2017, 1:12 AM

@mobrovac, I can't see how this would solve any of the problems we've been discussing. Can you clarify what, exactly, would be done magically?

Ok, so the focus of this ticket seems to be on how to have both Redis instances warmed up with the same content all the time. If we were to put ORES behind RESTBase, the results could be stored in Cassandra (with a TTL, most likely) and would be available in both DCs. Varnish would redirect the request to RESTBase. If the response is already in Cassandra, it would be returned right away regardless of the DC. If the request is not in storage, the request would be directed to ORES, the response would be stored in Cassandra (for future requests) and returned back to the client.

Sorry for not answering sooner on this.

@mobrovac That's an architectural discussion that while useful to have (and I would be glad to have it), I think it should be postponed for after the switchover. IMHO it's more prudent to avoid such changes before the switchover. We 've already discussed this for a few other stuff (like switch OS upgrades in codfw) and decided to postpone it just to be on the safe side.

For what is worth, it would not work however. Have a look at T148999 for why. Note that most of the discussion, while pertaining to varnish is valid for RESTBase as well. TL;DR is that ORES is not yet cache-able. The issue has been revisited in T137962#3023643 btw (but only for static assets)

Change 345825 had a related patch set uploaded (by Alexandros Kosiaris):
[mediawiki/services/change-propagation/deploy@master] Use ores_uris instead of ores_uri

https://gerrit.wikimedia.org/r/345825

Change 345826 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] changeprop: Add an ores_uris parameter

https://gerrit.wikimedia.org/r/345826

Change 345827 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] changeprop: Remove the ores_uri parameter

https://gerrit.wikimedia.org/r/345827

Halfak added a comment.Apr 7 2017, 2:12 PM

Hey folks. I just realized that I had the following message in sitting in phab waiting for me to send. It's not as relevant now that @akosiaris has picked up the work. But I figured it's better to send now than to just have not responded to @mobrovac at all.


@mobrovac, one of the key things we have ORES do is dedupe requests. So if a request comes in while a score is being generated, we'll give it a task ID rather than starting to generate the score again. This works really well because most of our requests come in based on RCStream or other sources of approximate realtime-ness.

One of the biggest reasons we decided against RESTBase a while back was because it constrained the access pattern to ORES so that we couldn't batch IO. E.g. ores.wikimedia.org/v2/scores/enwiki/damaging?revids=1|2|3|4|5|6|... would do IO for all of these revisions in batch. It gives about a 5x speedup over requesting individual scores. In the case of requests like this, we pull a subset of scores from our internal cache.

In the case of ChangeProp sending requests to both data centers, we'd be able to dedupe within datacenter, but we'd be inherently duplicating our baseline work.

What is the status of T148714: Create generalized "precache" endpoint for ORES ? Has it been deployed and tested? If so, we can switch CP to use that and remove the config from there - T158437: Change ORES rules to send all events to new "/precache" endpoint. That would be useful in this transition to sending requests to both DCs.

@mobrovac, one of the key things we have ORES do is dedupe requests. So if a request comes in while a score is being generated, we'll give it a task ID rather than starting to generate the score again. This works really well because most of our requests come in based on RCStream or other sources of approximate realtime-ness.

That's good. That means that by the time the external request comes in, you already have the score (if it hasn't fallen out of cache) since CP is likely to beat any external request.

One of the biggest reasons we decided against RESTBase a while back was because it constrained the access pattern to ORES so that we couldn't batch IO. E.g. ores.wikimedia.org/v2/scores/enwiki/damaging?revids=1|2|3|4|5|6|... would do IO for all of these revisions in batch. It gives about a 5x speedup over requesting individual scores. In the case of requests like this, we pull a subset of scores from our internal cache.

This is an interesting approach. Do you know what portion of requests request batch results? Is it a significant portion? Are these requests usually for latest revisions? This ticket is probably not the venue to discuss this, but IMHO we should revisit this decision. ORES' functionality maps really well to a REST API layout ;)

Change 347017 had a related patch set uploaded (by Ppchelko):
[mediawiki/services/change-propagation/deploy@master] Config: Do precaching in both datacenters.

https://gerrit.wikimedia.org/r/347017

Change 345826 merged by Alexandros Kosiaris:
[operations/puppet@production] changeprop: Add an ores_uris parameter

https://gerrit.wikimedia.org/r/345826

Change 347017 merged by Ppchelko:
[mediawiki/services/change-propagation/deploy@master] Config: Do precaching in both datacenters.

https://gerrit.wikimedia.org/r/347017

Mentioned in SAL (#wikimedia-operations) [2017-04-12T18:30:55Z] <ppchelko@tin> Started deploy [changeprop/deploy@0a9a008]: Config: Send ORES precache requests to both DCs. T159615

Mentioned in SAL (#wikimedia-operations) [2017-04-12T18:37:48Z] <ppchelko@tin> Finished deploy [changeprop/deploy@0a9a008]: Config: Send ORES precache requests to both DCs. T159615 (duration: 06m 53s)

Mentioned in SAL (#wikimedia-operations) [2017-04-12T18:43:01Z] <ppchelko@tin> Started deploy [changeprop/deploy@e403f56]: Config: Send ORES precache requests to both DCs. Attempt #2. T159615

Mentioned in SAL (#wikimedia-operations) [2017-04-12T18:44:16Z] <ppchelko@tin> Finished deploy [changeprop/deploy@e403f56]: Config: Send ORES precache requests to both DCs. Attempt #2. T159615 (duration: 01m 15s)

Pchelolo closed this task as Resolved.Apr 12 2017, 7:08 PM
Pchelolo claimed this task.

The prefacing rule is now updating ORES in both datacenters. Although there's still room for improvement, this issue can be resolved now.

Change 345825 abandoned by Alexandros Kosiaris:
Use ores_uris instead of ores_uri

Reason:
Already done in https://gerrit.wikimedia.org/r/347017

https://gerrit.wikimedia.org/r/345825

Change 345827 merged by Alexandros Kosiaris:
[operations/puppet@production] changeprop: Remove the ores_uri parameter

https://gerrit.wikimedia.org/r/345827