Make switching Redis server simpler
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	scfc
	Dec 21 2016, 2:52 AM

Description

When the Redis server should be switched between tools-redis-1001 and tools-redis-1002, https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Redis says:

Redis runs on two instances - tools-redis-01 and -02, and the currently active master is set via hiera on toollabs::active_redis (defaults to tools-redis-01). The other is set to be a slave of the master. Switching over can be done by:

Switchover on hiera, set toollabs::active_redis to the hostname (not fqdn) of the up host

Force a puppet run on the redis hosts

Restart redis on the redis hosts, this resets current connections and makes master / slaves see themselves as master / slave

Set the IP address for 'tools-redis.tools.eqiad.wmflabs' and 'tools-redis.eqiad.wmflabs' in hieradata/common/dnsrecursor/labsaliaser.yaml to point to the IP of the new master. This needs a puppet merge + run on the DNS hosts (labservices1001 and holmium as of now). Eventually we'd like to move this step to Horizon...

So there are two truths for "the active Redis server", one in toollabs::active_redis (set at https://wikitech.wikimedia.org/wiki/Hiera:Tools; used for /etc/hosts and setting up replication between the Redis servers), one in hieradata/common/dnsrecursor/labsaliaser.yaml (used for DNS). In addition, the documentation says that the Redis services need to be restarted (I'm not sure that this is actually necessary ATM).

Instead, the truth for "the active Redis server" should only live in hieradata/common/dnsrecursor/labsaliaser.yaml. Redis servers should check whether their IP is the same as tools-redis and consider themselves master/slave per that. They should automatically restart on changes to the replication direction.

While this requires someone with +2 on operations/puppet to switch the Redis server, these changes are rare and often done by administrators with +2.

Details

	Subject	Repo	Branch	Lines +/-
	P:toolforge::redis_sentinel: fix hardcoded interface	operations/puppet	production	+2 -1
	toolforge: Add separate role for Redis Sentinel	operations/puppet	production	+211 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Andrew	T289883 cloud-vps: actively deprecate/phase out use of Debian Stretch
Resolved	taavi	T306099 Cloud VPS "tools" project Stretch deprecation
Resolved	taavi	T306100 Cloud VPS "toolsbeta" project Stretch deprecation
Resolved	taavi	T275864 Toolforge: migrate to Debian Buster or later
Open	None	T262350 bad failure cases for wmcs custom puppet enc
Resolved	taavi	T267082 Rebuild Toolforge servers that should not have NFS mounted (and with affinity)
Resolved	taavi	T278541 Toolforge: migrate redis servers to Debian Buster or later
Resolved	• Bstorm	T139190 Move tools-db and tools-redis into DNS
Resolved	taavi	T153810 Make switching Redis server simpler

Event Timeline

scfc created this task.Dec 21 2016, 2:52 AM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptDec 21 2016, 2:52 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

scfc updated the task description. (Show Details)Dec 21 2016, 2:53 AM

scfc added a parent task: T139190: Move tools-db and tools-redis into DNS.Dec 21 2016, 3:39 AM

• Bstorm edited projects, added cloud-services-team (Kanban); removed Cloud-Services.Mar 30 2020, 10:54 PM

Could be implemented with a service ip that is managed via keepalived thanks to more modern tooling

I'll see if I can get something like https://aikester.com/2017/smart-failover-with-redis-sentinel-and-keepalived working on toolsbeta

Restricted Application added a project: User-Majavah. · View Herald TranscriptMay 13 2021, 8:00 AM

Mentioned in SAL (#wikimedia-cloud) [2021-05-13T08:07:16Z] <Majavah> creating toolsbeta-redis-[1-3] as g3.cores1.ram2.disk20 to experiment with redis-sentinel / T153810

Assigning to @aborrero to create virtual IP address that toolsbeta-redis-[1-3] can use with keepalived

Change 690528 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] toolforge: Add separate role for Redis Sentinel

https://gerrit.wikimedia.org/r/690528

gerritbot added a project: Patch-For-Review.May 13 2021, 2:38 PM

Mentioned in SAL (#wikimedia-cloud) [2021-05-14T11:16:45Z] <arturo> aborrero@cloudcontrol1005:~ $ sudo wmcs-openstack --os-project-id=toolsbeta port create --network lan-flat-cloudinstances2b toolsbeta-redis-vip (T153810)

Mentioned in SAL (#wikimedia-cloud) [2021-05-14T11:21:58Z] <arturo> allowed VIP address from the new port 172.16.3.26 into the ports of toolsbeta-redis-[1-3] (T153810)

I think you are all set to continue!

taavi added a parent task: T278541: Toolforge: migrate redis servers to Debian Buster or later.May 15 2021, 10:45 AM

The automatic failover with Sentinel works. Next up https://gerrit.wikimedia.org/r/c/operations/puppet/+/690528 needs to be reviewed and then we can create a migration plan. I'm not yet sure if the switch from the current cluster to the new one can be done without downtime.

I'd also like to turn the tools-redis name into a CNAME to something in svc.tools.eqiad1.wikimedia.cloud.

side note: The automatic Sentinel failover might take up to 15 seconds from the original master going down (five for Sentinel to notice it's down and ten for Keepalived to move over the VIP). I think that's acceptable and much better than the current manual system which requires +2 in operations/puppet.

In T153810#7090003, @Majavah wrote:

I'd also like to turn the tools-redis name into a CNAME to something in svc.tools.eqiad1.wikimedia.cloud.

Currently tools-redis.tools.eqiad.wmflabs is a CNAME for tools-redis.svc.eqiad.wmflabs. At the time that tools-redis.svc.eqiad.wmflabs was made the "canonical" service name I don't think that we had decided that having DNS zones for the various projects was easy to deal with.

The tools-redis.svc.eqiad.wmflabs A record is managed by the wmcs-wikireplica-dns.py script which is provisioned by puppet. It's config is in ops/puppet.git:modules/openstack/files/util/wikireplica_dns.yaml. This might not be the right place to manage a new record if it is in a zone owned by the tools project. It can probably just be a manually created Horizon managed record pointing to the VIP that will float across the backing instances. The current tools-redis.svc.eqiad.wmflabs should also be made a CNAME to whatever record is considered canonical, possibly with a deprecation announcement so we don't have to keep that CNAME around forever.

side note: The automatic Sentinel failover might take up to 15 seconds from the original master going down (five for Sentinel to notice it's down and ten for Keepalived to move over the VIP). I think that's acceptable and much better than the current manual system which requires +2 in operations/puppet.

15 seconds is an amazing reduction in time vs the current process! Seriously thank you very, very much for this work.

For anyone wondering the current process would look something like this:

someone or something notices redis outage and yells
- maybe toolschecker, but it only checks once per minute (or is it 5 minutes?)
someone who can do something about it learns that the service is down
they get to their computer
they verify the outage
they find the runbook for manual failover
they do the failover
- hiera change
- forced puppet runs
- ops/puppet.git change
- ops/puppet.git merge
- forced puppet run
- run of wmcs-wikireplica-dns.py

I would be astounded if even in the most ideal case this took less than 10 minutes of wall clock time today. That would make 15s a 97.5% decrease or stated another way 40 times faster than we would expect a recovery today.

taavi moved this task from Ready to be worked on to Waiting for code review on the Toolforge board.May 16 2021, 4:36 PM

Change 690528 merged by Bstorm:

[operations/puppet@production] toolforge: Add separate role for Redis Sentinel

https://gerrit.wikimedia.org/r/690528

Maintenance_bot removed a project: Patch-For-Review.May 25 2021, 12:10 AM