Page MenuHomePhabricator

Implement sentinel for ORES production Redis
Open, LowPublic

Description

See https://redis.io/topics/sentinel

Currently redis is a SPOF. This proxy strategy can help us make sure that one redis going down doesn't take down the whole service.

Event Timeline

Halfak created this task.Dec 30 2015, 9:13 PM
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Active on the Scoring-platform-team (Current) board.
Halfak added a subscriber: Halfak.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 30 2015, 9:13 PM
Halfak set Security to None.
Halfak added a project: ORES.
yuvipanda removed yuvipanda as the assignee of this task.Jan 20 2016, 6:06 PM
Sabya renamed this task from Implement tewmproxy for ORES in production to Implement twemproxy for ORES in production.Feb 24 2016, 2:48 AM

@akosiaris, do you think we should still pursue something like this? We haven't had big stability problems with redis (that I'm aware of), but it could be good to have this redundancy anyway. The big downside is that we'd need another redis node.

Halfak triaged this task as Low priority.Jul 28 2016, 2:41 PM

Change 350421 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] ores: Add twemproxy support

https://gerrit.wikimedia.org/r/350421

So, a case of needing a maintenance on the (primary/master) redis host has arrived. We will need to perform some cabling changes to the redis database host (T148506). Not only that but at some point we are going to reboot it in order to apply kernel updates and this trend is only gonna continue. So, getting a redundant redis installation that allows for easy maintenance task to be performed is definitely something worth pursuing. We already have a slave getting all updates but making the switch between master and slave is currently a manual process. For T148506 this was handled in T163326 already so there is no urgency but it is definitely worthwhile to move this forward.

In that spirit, I 've uploaded https://gerrit.wikimedia.org/r/350421 today. Twemproxy seems fine as a solution for now, especially given the nature of the redis datastores (jobqueue and cache).

The proposed architecture in that patch is the one we use for Redis and Mediawiki, which is twemproxy(aka nutcracker) running on every scb node, listening on a unix socket per ORES redis database and forwarding requests to the 2 servers based on the ketama consistent hashing algorithm.

There are a number of open questions, namely:

  • Can ORES (uwsgi+celery) use the unix socket instead of a TCP connection ?
  • Is it actually better to do that, or should we stick to a TCP connection to 127.0.0.1:6379 and 127.0.0.1:6380 respectively ?

I 've looked a bit at how security is handled in the unix socket case and I think it's quite fine. A password is still required and permissions can be set on the unix socket to enhance security.

@Halfak, @Ladsgroup, still interested in this ? Cause ops definitely is. Note by the way that the approach described above will effectively double the available cache size.

Halfak added a comment.Jun 6 2017, 2:45 PM

Yes. Definitely still interested in this. What's the next step?

Testing if ORES can use the unix socket or if it requires TCP. After that we deploy nutcracker (actually it's already on scb for other reasons now, it's a change in the config), change the ORES config to use nutcracker and we are done.

Change 350421 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Add twemproxy support

https://gerrit.wikimedia.org/r/350421

Change 363340 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores: Use nutcracker for redis

https://gerrit.wikimedia.org/r/363340

Change above has been merged, all that's left is the above is to turn the knob and enable it. That would be https://gerrit.wikimedia.org/r/363340. Since it has the potential for an outage I would like a review before deploying. @Halfak, @Ladsgroup ?

Halfak added a comment.Jul 5 2017, 2:49 PM

Ok. This *looks* simple.

Is tewmproxy known to be running on "localhost" and in a good state? When we flip the switch, will we be finding out for the first time?

Do you want to try this in one datacenter first?

Ok. This *looks* simple.
Is tewmproxy known to be running on "localhost" and in a good state? When we flip the switch, will we be finding out for the first time?

Yes and no respectively.

Do you want to try this in one datacenter first?

Yeah, good idea. I 'll do codfw first.

Change 363340 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Use nutcracker for redis

https://gerrit.wikimedia.org/r/363340

codfw has been migrated to use nutcracker and reverted. This has backfired majestically. The reason being

[2017-07-06 13:59:15.068] nc_redis.c:1092 parsed unsupported command 'MULTI'

Per https://github.com/twitter/twemproxy/blob/master/notes/redis.md#transactions this is not supported. We need to figure out why that command is being used.

awight raised the priority of this task from Low to Normal.Jun 11 2018, 11:01 AM
awight added a subscriber: awight.

Increasing priority, this is at least normal if not high. I'll create a subtask for the transaction investigation.

awight renamed this task from Implement twemproxy for ORES in production to Implement twemproxy for ORES production Redis.Jun 18 2018, 10:03 PM

codfw has been migrated to use nutcracker and reverted. This has backfired majestically. The reason being

[2017-07-06 13:59:15.068] nc_redis.c:1092 parsed unsupported command 'MULTI'

Per https://github.com/twitter/twemproxy/blob/master/notes/redis.md#transactions this is not supported. We need to figure out why that command is being used.

Even after upgrading to celery4 this still happens.

For what is worth, the upstream task is https://github.com/celery/celery/issues/3500. Closed WONTFIX apparently.

Which now brings us to the question of what's the next step? Should we reinvent the wheel and make a basic nutcracker that is able to handle transactions? Should we change to use RabbitMQ for the broker and redis for other stuff? (In that case we can put the redis behind the nutcracker and I have no idea how to handle RabbitMQ). Fork celery and drop the transaction part? (Nooo)

Which now brings us to the question of what's the next step?

Should we reinvent the wheel and make a basic nutcracker that is able to handle transactions?

I 'd avoid that pretty much at all costs. It's up to the scoring team, but expect a sizable amount of resources required to implement and support this, with the caveat it will probably not be used anywhere else. I also don't see how a basic nutcracker would be able to support transactions in the redis protocol (it's actually a difficult problem, otherwise it would have already been done), but I haven't studied the protocol that much.

Should we change to use RabbitMQ for the broker and redis for other stuff? (In that case we can put the redis behind the nutcracker and I have no idea how to handle RabbitMQ).

Note SRE doesn't have much experience with RabbitMQ currently as the only existing installation is in the WMCS realm. We would have to gain knowledge and build/learn tooling to support it. Personally I would rather avoid it, as AMQP + RabbitMQ have been a major rabbit hole in the past for multiple people and organizations, especially in the HA setup we 'd require here.

Fork celery and drop the transaction part? (Nooo)

I +1 your no. The MULTI command is done by kombu IIRC (https://github.com/celery/kombu), the messaging transport library that was early split out of celery and distributed independently. But I am not sure how celery uses that and how it could be patched to avoid the transaction (and even it would be prudent to do so)

I see 2 more alternatives:

  • We can use the twemproxy for the cache part as there is no MULTI command there (at least that's my understanding). That will allow to have a degraded service for some time in the case of a redis server failure. That being said, without the capability to submit new jobs, that time will not be very large. To fill in that gap we could use keepalived (or maybe even pybal) to failover the IP of the jobqueue redis and recover. We will have lost the entire job set but we should recover since most clients will retry.
Ladsgroup added a comment.EditedNov 15 2018, 11:40 AM

Celery4 supports sentinel: http://docs.celeryproject.org/en/v4.1.0/whatsnew-4.0.html#redis-support-for-sentinel

I love that we upgraded to celery4 ^^

I looked at sentinel. It's a little bit complex but easily doable. We probably need to do install redis on ores nodes and make them to run sentinel (Example 3 in https://redis.io/topics/sentinel) Setting up the configuration would be a little bit hard and basically all needs to be done via puppet. One other concern is dockerizing ores. That's going to be even harder with docker.

Ladsgroup renamed this task from Implement twemproxy for ORES production Redis to Implement sentinel for ORES production Redis.Nov 15 2018, 6:21 PM
Ladsgroup removed a project: Patch-For-Review.
Ladsgroup updated the task description. (Show Details)
Ladsgroup raised the priority of this task from Normal to High.Nov 15 2018, 6:25 PM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptNov 28 2018, 1:51 PM
Joe added a subscriber: Joe.Dec 4 2018, 6:29 AM

I looked at sentinel. It's a little bit complex but easily doable. We probably need to do install redis on ores nodes and make them to run sentinel (Example 3 in https://redis.io/topics/sentinel) Setting up the configuration would be a little bit hard and basically all needs to be done via puppet. One other concern is dockerizing ores. That's going to be even harder with docker.

Let's pause for a sec and lets' consider if we prefer to use sentinel or redis-cluster.

What kind of availablity guarantees we need for redis? Do we need it for all the instances or just for the ones used by celery?

Depending on the answers, redis-cluster is a way better solution than sentinel, or even a complementary one.

Also, and sorry for bringing this up so late, did anyone test netflix's dynomite for redis proxying? We've packaged it in the past and while it didn't fit our needs specifically in that case, it might fit our use-case for ORES.

Hey,

  • @akosiaris tested twemproxy in prod and it fails because celery issues redis transactions and twemproxy doesn't support redis transactions. Same goes with dynomite.
  • redis for celery is being used at all of ores redis nodes port 6379 and the cache is port 6380 (oresrdb1001 is master, oresrdb1002 is replica)
  • We need redis for celery to be HA. The cache for ores doesn't need to be HA (it's nice to have but it isn't worth it if we want to build it for just to make cache HA).
  • One other thing is that with redis sentinel we can distribute reads and given the cache hit rate of 80% for ores, it would make ores more scalable.
  • Celery4 has native support for sentinel.

Hope these help.

Joe added a comment.EditedDec 4 2018, 10:51 AM

So, I would separate the needs for the cache (where I guess we can use twemproxy) and celery (where we can't use it).

Redis Sentinel is effectively a new storage solution we never tested, monitored or instrumented before, hence my resistance. But:

  • twemproxy AND dynomite don't support MULTI
  • Celery doesn't seem to support redis-cluster (this issue is opened since 2016)
  • the other well supported backend of celery 4 is rabbitmq, with which I have past experiences that weren't so nice; it would also mean we need to learn much more about our new datastore than about redis-sentinel.

So we're kinda out of options if we don't want to rearchitect ORES completely to avoid the use of celery in favour of submitting jobs from changeprop to an uWSGI endpoint directly - which I would recommend but is out of the scope of this ticket.

Before we can go on and use sentinel, though, we will need the SRE team to get a feeling of how it works and its failure scenarios, and how to have alerting and monitoring on it.

Also, we will have to agree on what we mean by "availability":

  • What it means, from your point of view, that redis is "available"? E.G. what is the error rate (across what time interval) you'd consider acceptable in normal operating conditions? What error rate would then be considered an outage?
  • Once we've properly defined that, what is the target availability we want to reach for the ores/celery system? And for the underlying redis service? 100% is of course irrealistic, before you ask :)

And once we've reached an agreement on that, I'd like also to agree that if we can't reach that objective with the current architecture, we will take steps together to improve the situation - which could mean we need to go back and revisit the use of celery completely.

Hey,

  • @akosiaris tested twemproxy in prod and it fails because celery issues redis transactions and twemproxy doesn't support redis transactions. Same goes with dynomite.

Yes, that's true.

  • redis for celery is being used at all of ores redis nodes port 6379 and the cache is port 6380 (oresrdb1001 is master, oresrdb1002 is replica)

Just for completeness and expanding a bit on the above (which is correct), the celery broker AND the celery result store are on port 6379.

  • We need redis for celery to be HA. The cache for ores doesn't need to be HA (it's nice to have but it isn't worth it if we want to build it for just to make cache HA).

Just as a note, the celery queues are transient, not persisted per T210584#4780485. This is a bit weird as having HA for a data store that is not persisted and will lose data on a restart sounds counter-intutive. But this is about the jobs continuing to be submitted and executed and not about losing the result of those jobs.

Just for completeness and expanding a bit on the above (which is correct), the celery broker AND the celery result store are on port 6379.

True, but there is two caches in ORES. Celery result cache (celery result backend), that's on 6379 and also there is ores score cache (built-in cache) that resides in 6380 and has a higher TTL (15 years I think + key eviction).

Ladsgroup removed Ladsgroup as the assignee of this task.Jan 25 2019, 12:31 PM
Ladsgroup removed a project: User-Ladsgroup.

Hey @Joe and @akosiaris. What do you think of the next steps here? Can we move forward or we need to discuss more our options? If you need more hands to help on anything, just let me know.

Joe added a comment.Feb 22 2019, 6:04 AM

Personally speaking, I think the real blocker is proper resourcing for doing this work, which isn't something we can do as a side job. Using redis-sentinel while maintaining our operational standards means we have to run tests, set up appropriate monitoring, learn how to recover from the inevitable failure scenarios.

Also, I'd like someone to answer the questions I asked in https://phabricator.wikimedia.org/T122676#4796953

But more importantly, I'm not sure this work entirely justifiable compared to e.g. switch to using changeprop and a separated uwsgi endpoint for submitting score calculations, which would make ores behave more like the rest of our services. Uniformity is important.

submitting jobs from changeprop to an uWSGI endpoint directly

Aren't we doing this already?

rearchitect ORES completely to avoid the use of celery

I'm not sure I understand the details of this proposal. I imagine that the idea is to drop batch processing (which is something the layers of uwsgi and celery allow us to do). 20-25% of our scores are delivered via batch requests[1] and a batch request is generally ~3x faster than requesting single scores. I expect our batch request use-case to increase in frequency over time.

That said, I appreciate the idea that making the infrastructure easier to maintain might be worth the hit. After all, if we could allow people to send 3x as many single-score requests, then the tradeoff would cancel out on our end. I don't think we could triple our workers if we dropped celery but we could potentially increase them by 20% or so with the memory we'd make available.

I created T216838: [Discuss] ORES without celery with some notes from this discussion so we can continue it there.

Also, I'd like someone to answer the questions I asked in https://phabricator.wikimedia.org/T122676#4796953

What it means, from your point of view, that redis is "available"? E.G. what is the error rate (across what time interval) you'd consider acceptable in normal operating conditions? What error rate would then be considered an outage?
Once we've properly defined that, what is the target availability we want to reach for the ores/celery system? And for the underlying redis service? 100% is of course irrealistic, before you ask :)

Fair questions. I'd been thinking about this in terms of avoiding a SPOF, but your availability % question is reasonable. If our redis backend were to become unavailable, we'd not be able to process scoring requests. If the whole redis host were to become unavailable, we would also lose our score cache. That means we'd be 100% unavailable in one datacenter. I believe that we'd then auto-fail to the other datacenter after return a sustained period of returning 500s (@akosiaris, correct me if I'm wrong here). So we'd experience a brief period of downtime during the failover (not sure how long we expect that switch to happen) and we'd be running in a compromised state while a live human pushed a configuration change to use the backup redis host and re-enabled the original datacenter. We'd then be in a semi-compromised state until we could bring the host or a new host back online.

The potential impact of a better strategy for managing redis would be in reducing the downtime experienced during the switchover to the other datacenter. If that is automatic (as I understand) that impact might be minimal. If it is manual, we probably ought to make it automatic, but sentinel would then have a higher impact on our uptime.

None of this directly answers @Joe's questions about uptime expectations, but I think it may help us think through what we hope to gain in the context of a failure event.

Ladsgroup added a comment.EditedFeb 25 2019, 1:12 PM

Fair questions. I'd been thinking about this in terms of avoiding a SPOF, but your availability % question is reasonable. If our redis backend were to become unavailable, we'd not be able to process scoring requests. If the whole redis host were to become unavailable, we would also lose our score cache. That means we'd be 100% unavailable in one datacenter. I believe that we'd then auto-fail to the other datacenter after return a sustained period of returning 500s (@akosiaris, correct me if I'm wrong here). So we'd experience a brief period of downtime during the failover (not sure how long we expect that switch to happen) and we'd be running in a compromised state while a live human pushed a configuration change to use the backup redis host and re-enabled the original datacenter. We'd then be in a semi-compromised state until we could bring the host or a new host back online.

I'm not sure if I understood you correctly here. Sentinel avoids the whole ores cluster in one datacenter goes down when the main redis node goes down. Let me make it as concrete examples:

  • Status quo: If oresrdb1001 goes down, all of ores will go down until someone manually change redis backend to oresrdb1002.
  • Status quo: If both redis hosts go down, ores will go down until someone manually changes traffic to codfw.
  • With sentinel: If oresrdb1001 goes down, sentinel automatically finds a proper master and aligns ores to handle it. We'll be having at most one minute of down time. <- This is what sentinel is trying to address.
  • With sentinel: if all redis nodes go down, ores goes down until someone manually changes traffic to codfw.

@Halfak: I had the feeling you're suggesting ores uses redis in the other datacenter until the redis goes back online in the first dc, given 36ms latency for any connection between these two dc, I don't think it's feasible and doable. Ignore my comment if I misunderstood you.

Ladsgroup added a comment.EditedFeb 25 2019, 1:25 PM

Personally speaking, I think the real blocker is proper resourcing for doing this work, which isn't something we can do as a side job. Using redis-sentinel while maintaining our operational standards means we have to run tests, set up appropriate monitoring, learn how to recover from the inevitable failure scenarios.

This is out of scoring platform team control. I know you have lots of things to do (Running Wikipedia + Wikidata with such a small team is a miracle) so I want to make it clear that I'm fine with putting sentinel project on-hold until SRE feels comfortable maintaining a sentinel cluster. oresrdb[12]001 is a SPOF but they have been cooperating since their inception.

Also, I'd like someone to answer the questions I asked in https://phabricator.wikimedia.org/T122676#4796953

What it means, from your point of view, that redis is "available"? E.G. what is the error rate (across what time interval) you'd consider acceptable in normal operating conditions? What error rate would then be considered an outage?
Once we've properly defined that, what is the target availability we want to reach for the ores/celery system? And for the underlying redis service? 100% is of course irrealistic, before you ask :)

Fair questions. I'd been thinking about this in terms of avoiding a SPOF, but your availability % question is reasonable. If our redis backend were to become unavailable, we'd not be able to process scoring requests. If the whole redis host were to become unavailable, we would also lose our score cache. That means we'd be 100% unavailable in one datacenter.

I believe that we'd then auto-fail to the other datacenter after return a sustained period of returning 500s (@akosiaris, correct me if I'm wrong here).

No, that's not true. The failover to the other datacenter would happen after a manual intervention of an SRE, the reasoning behind it being that it can be really hard to deduce that this is the best course of action automatically.

So we'd experience a brief period of downtime during the failover (not sure how long we expect that switch to happen) and we'd be running in a compromised state while a live human pushed a configuration change to use the backup redis host and re-enabled the original datacenter.

Assuming I understood the scenario well enough, that is the ORES redis host failing, it would be faster to switch to the backup. It's still a manual action unfortunately and one that, unlike the datacenter switchover above, can be more automated, assuming we understand correctly the requirements and limitations

We'd then be in a semi-compromised state until we could bring the host or a new host back online.

Nitpick: State without redundancy :-)

The potential impact of a better strategy for managing redis would be in reducing the downtime experienced during the switchover to the other datacenter. If that is automatic (as I understand) that impact might be minimal. If it is manual, we probably ought to make it automatic, but sentinel would then have a higher impact on our uptime.

With the correction of switching to the backup redis host (and not to the other datacenter), overall agreed.

None of this directly answers @Joe's questions about uptime expectations, but I think it may help us think through what we hope to gain in the context of a failure event.

It definitely is. It is also helping in formulating what we want to happen (and what we expect to happen).

Fair questions. I'd been thinking about this in terms of avoiding a SPOF, but your availability % question is reasonable. If our redis backend were to become unavailable, we'd not be able to process scoring requests. If the whole redis host were to become unavailable, we would also lose our score cache. That means we'd be 100% unavailable in one datacenter. I believe that we'd then auto-fail to the other datacenter after return a sustained period of returning 500s (@akosiaris, correct me if I'm wrong here). So we'd experience a brief period of downtime during the failover (not sure how long we expect that switch to happen) and we'd be running in a compromised state while a live human pushed a configuration change to use the backup redis host and re-enabled the original datacenter. We'd then be in a semi-compromised state until we could bring the host or a new host back online.

I'm not sure if I understood you correctly here. Sentinel avoids the whole ores cluster in one datacenter goes down when the main redis node goes down. Let me make it as concrete examples:

  • Status quo: If oresrdb1001 goes down, all of ores will go down until someone manually change redis backend to oresrdb1002.
  • Status quo: If both redis hosts go down, ores will go down until someone manually changes traffic to codfw.
  • With sentinel: If oresrdb1001 goes down, sentinel automatically finds a proper master and aligns ores to handle it. We'll be having at most one minute of down time. <- This is what sentinel is trying to address.
  • With sentinel: if all redis nodes go down, ores goes down until someone manually changes traffic to codfw.

For what is worth, all of the above is correct. There are however some extra requirements that make this more interesting

Per T210584#4780485, the queue is transient. So losing all data in them in the case of a catastrophic redis host failure, while not ideal, is acceptable. That means that at least for the celery job queue it is fine to switch to a fresh redis instance to, and in fact it's acceptable to do so rather quickly (the exact number to be determined, but I expect it in the low tens of seconds). The same holds more or less true for the celery result store as it's tightly bound to the celery job queue (correct me if I am wrong)

The scoring cache datastore (still on the same redis host currently, just on another redis instance - one listening on a different port) on the other hand is a different thing. Always from the PoV of data loss, a loss of the scoring cache (aka we point the ORES service to a fresh redis instance) will result in increased CPU usage, an increased latency for results. That has the potential to escalate to a partial (but perhaps extended) outage due a triggering of the protecting mechanisms of ORES which limit the amount of concurrent scorings. However that datastore is also twemproxy compatible, which means we can decrease the reliance of it on a single host increasing overall the availability and remove the SPOF.

None of this by the way answers the questions in T122676#4796953, just adding some more information.

@Halfak: I had the feeling you're suggesting ores uses redis in the other datacenter until the redis goes back online in the first dc, given 36ms latency for any connection between these two dc, I don't think it's feasible and doable. Ignore my comment if I misunderstood you.

I understood it as a switchover of the entire ORES service, not just the redis host (probably because that is indeed infeasible).

awight removed a subscriber: awight.Mar 21 2019, 4:00 PM

@akosiaris, I'm checking in on this task. Have we come to any conclusions about investing in Sentinel or not from the ops side of things? I.e., should we pick up this task or decline it?

@akosiaris, I'm checking in on this task. Have we come to any conclusions about investing in Sentinel or not from the ops side of things? I.e., should we pick up this task or decline it?

No, not yet. SRE hasn't really had time to familiarize itself with sentinel. In the meantime some interesting discussions have crept up. Namely:

  • There is the potential for reusing the misc cluster of redises we have for ORES. This will allow to fold in ORES redis to the cluster of other redises. This would lead to more available resources (essentially more memory) and better operational response times and monitoring (i.e. more eyes looking into the service)
  • There have been some discussions as to whether it's better to invest in the technologies we are already very familiar with (fault aware loadbalancers like pybal/haproxy and twemproxy/nutcracker) to provide the same levels of availability. Discussions are still ongoing.

I 'd stall this, pending for SRE to provide some input if you don't mind.

Sounds reasonable. Thank you for the update!

MoritzMuehlenhoff changed the status of subtask T210582: New node request: oresrdb[12]003 from Open to Stalled.Jul 2 2019, 11:09 AM
Halfak lowered the priority of this task from High to Low.Wed, Sep 11, 9:21 PM