Page MenuHomePhabricator

Make switching Redis server simpler
Closed, ResolvedPublic

Description

When the Redis server should be switched between tools-redis-1001 and tools-redis-1002, https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Redis says:

Redis runs on two instances - tools-redis-01 and -02, and the currently active master is set via hiera on toollabs::active_redis (defaults to tools-redis-01). The other is set to be a slave of the master. Switching over can be done by:

  1. Switchover on hiera, set toollabs::active_redis to the hostname (not fqdn) of the up host
  2. Force a puppet run on the redis hosts
  3. Restart redis on the redis hosts, this resets current connections and makes master / slaves see themselves as master / slave
  4. Set the IP address for 'tools-redis.tools.eqiad.wmflabs' and 'tools-redis.eqiad.wmflabs' in hieradata/common/dnsrecursor/labsaliaser.yaml to point to the IP of the new master. This needs a puppet merge + run on the DNS hosts (labservices1001 and holmium as of now). Eventually we'd like to move this step to Horizon...

So there are two truths for "the active Redis server", one in toollabs::active_redis (set at https://wikitech.wikimedia.org/wiki/Hiera:Tools; used for /etc/hosts and setting up replication between the Redis servers), one in hieradata/common/dnsrecursor/labsaliaser.yaml (used for DNS). In addition, the documentation says that the Redis services need to be restarted (I'm not sure that this is actually necessary ATM).

Instead, the truth for "the active Redis server" should only live in hieradata/common/dnsrecursor/labsaliaser.yaml. Redis servers should check whether their IP is the same as tools-redis and consider themselves master/slave per that. They should automatically restart on changes to the replication direction.

While this requires someone with +2 on operations/puppet to switch the Redis server, these changes are rare and often done by administrators with +2.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bd808 raised the priority of this task from Low to Medium.Jun 16 2020, 5:00 PM
bd808 subscribed.

Could be implemented with a service ip that is managed via keepalived thanks to more modern tooling

Mentioned in SAL (#wikimedia-cloud) [2021-05-13T08:07:16Z] <Majavah> creating toolsbeta-redis-[1-3] as g3.cores1.ram2.disk20 to experiment with redis-sentinel / T153810

taavi added a subscriber: aborrero.

Assigning to @aborrero to create virtual IP address that toolsbeta-redis-[1-3] can use with keepalived

Change 690528 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] toolforge: Add separate role for Redis Sentinel

https://gerrit.wikimedia.org/r/690528

Mentioned in SAL (#wikimedia-cloud) [2021-05-14T11:16:45Z] <arturo> aborrero@cloudcontrol1005:~ $ sudo wmcs-openstack --os-project-id=toolsbeta port create --network lan-flat-cloudinstances2b toolsbeta-redis-vip (T153810)

Mentioned in SAL (#wikimedia-cloud) [2021-05-14T11:21:58Z] <arturo> allowed VIP address from the new port 172.16.3.26 into the ports of toolsbeta-redis-[1-3] (T153810)

I think you are all set to continue!

The automatic failover with Sentinel works. Next up https://gerrit.wikimedia.org/r/c/operations/puppet/+/690528 needs to be reviewed and then we can create a migration plan. I'm not yet sure if the switch from the current cluster to the new one can be done without downtime.

I'd also like to turn the tools-redis name into a CNAME to something in svc.tools.eqiad1.wikimedia.cloud.

side note: The automatic Sentinel failover might take up to 15 seconds from the original master going down (five for Sentinel to notice it's down and ten for Keepalived to move over the VIP). I think that's acceptable and much better than the current manual system which requires +2 in operations/puppet.

In T153810#7090003, @Majavah wrote:

I'd also like to turn the tools-redis name into a CNAME to something in svc.tools.eqiad1.wikimedia.cloud.

Currently tools-redis.tools.eqiad.wmflabs is a CNAME for tools-redis.svc.eqiad.wmflabs. At the time that tools-redis.svc.eqiad.wmflabs was made the "canonical" service name I don't think that we had decided that having DNS zones for the various projects was easy to deal with.

The tools-redis.svc.eqiad.wmflabs A record is managed by the wmcs-wikireplica-dns.py script which is provisioned by puppet. It's config is in ops/puppet.git:modules/openstack/files/util/wikireplica_dns.yaml. This might not be the right place to manage a new record if it is in a zone owned by the tools project. It can probably just be a manually created Horizon managed record pointing to the VIP that will float across the backing instances. The current tools-redis.svc.eqiad.wmflabs should also be made a CNAME to whatever record is considered canonical, possibly with a deprecation announcement so we don't have to keep that CNAME around forever.

side note: The automatic Sentinel failover might take up to 15 seconds from the original master going down (five for Sentinel to notice it's down and ten for Keepalived to move over the VIP). I think that's acceptable and much better than the current manual system which requires +2 in operations/puppet.

15 seconds is an amazing reduction in time vs the current process! Seriously thank you very, very much for this work.

For anyone wondering the current process would look something like this:

  • someone or something notices redis outage and yells
    • maybe toolschecker, but it only checks once per minute (or is it 5 minutes?)
  • someone who can do something about it learns that the service is down
  • they get to their computer
  • they verify the outage
  • they find the runbook for manual failover
  • they do the failover
    • hiera change
    • forced puppet runs
    • ops/puppet.git change
    • ops/puppet.git merge
    • forced puppet run
    • run of wmcs-wikireplica-dns.py

I would be astounded if even in the most ideal case this took less than 10 minutes of wall clock time today. That would make 15s a 97.5% decrease or stated another way 40 times faster than we would expect a recovery today.

Change 690528 merged by Bstorm:

[operations/puppet@production] toolforge: Add separate role for Redis Sentinel

https://gerrit.wikimedia.org/r/690528

Closing this since the Puppet patch was merged. The parent task will be used to switch from the current cluster to the new one.

Change 758090 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::redis_sentinel: fix hardcoded interface

https://gerrit.wikimedia.org/r/758090

Change 758090 merged by Andrew Bogott:

[operations/puppet@production] P:toolforge::redis_sentinel: fix hardcoded interface

https://gerrit.wikimedia.org/r/758090