Page MenuHomePhabricator

Switch etcd back to eqiad, document switchover procedure
Closed, ResolvedPublic

Description

We want to make the eqiad etcd the master again, also in order to perfect a switchover procedure.

At the moment, this procedure is very manual, and involves various commits to puppet and the DNS. That's almost unavoidable if we want to avoid to store data about the etcd master *inside etcd*. I'd think twice before doing that.

The switchover is tentatively scheduled for May 31st at 09:00Z.

We expect etcd to be fully available for reading during the switchover, while there will be a short period in which writes will not be accepted by either cluster issuing a "EtcdRootReadOnly" error.

The procedure will be as follows:

  • Reduce the TTL of the .conftool SRV records (the ones used by confctl, and currently pointed to codfw)
  • Set both etcd clusters into read-only mode (eqiad currently already is)
  • Wait for replication to catch up (should be istantaneous, more or less)
  • Stop replica in eqiad via puppet
  • Switch the SRV records for conftool to point to the eqiad cluster
  • set the replication index in codfw to the current eqiad etcd index, start replica in codfw via puppet. This procedure should be scripted.
  • Set the eqiad cluster in read-write mode

Since we still don't have a generic spin-off of switchdc I will just prepare a simple list of commands for every step.

Event Timeline

Joe created this task.May 30 2017, 11:50 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 30 2017, 11:50 AM
Joe claimed this task.May 30 2017, 11:51 AM
Joe triaged this task as Medium priority.
Joe added a project: User-Joe.
Joe updated the task description. (Show Details)May 31 2017, 8:25 AM
Joe added a comment.May 31 2017, 8:28 AM

The simple script to set the replication index in codfw before starting replication:

import etcd
import sys

if len(sys.argv) < 3:
    raise RuntimeError("Usage: {prog} [REMOTE_HOST] [PREFIX]")

# Local connection without RW limitations
local = etcd.Client(port=2378)
# Remote connection to the cluster we're connecting to
remote = etcd.Client(host=sys.argv[1], port=2379, protocol='https')
# Let's get the remote etcd index
d = remote.read('/')
replica_index = d.etcd_index
replica_key = '/__replication/{prefix}'.format(prefix=sys.argv[2])
print "Writing index {idx} to key {key}".format(idx=replica_index, key=replica_key)
local.write(replica_key, replica_index)
Joe updated the task description. (Show Details)May 31 2017, 8:42 AM
Joe added a comment.EditedMay 31 2017, 9:00 AM

Play-by-play:

  1. Merge https://gerrit.wikimedia.org/r/356138
  2. sudo cumin 'R:class = role::configcluster and *.codfw.wmnet' 'run-puppet-agent' (begins read-only)
  3. sudo cumin 'R:class = role::configcluster' 'disable-puppet "etcd replication switchover"'
  4. Merge https://gerrit.wikimedia.org/r/#/c/356139,
  5. sudo cumin 'R:class = role::configcluster and *.eqiad.wmnet' 'run-puppet-agent -e "etcd replication switchover"' (stops replica in eqiad)
  6. Merge https://gerrit.wikimedia.org/r/#/c/356136/ and update dns
  7. sudo cumin 'conf2002.codfw.wmnet' 'python /home/oblivian/switch_replica.py conf1001.eqiad.wmnet conftool' (sets the replication index in codfw)
  8. sudo cumin 'R:class = role::configcluster and *.codfw.wmnet' 'run-puppet-agent -e "etcd replication switchover"' (starts replica in codfw)
  9. Merge https://gerrit.wikimedia.org/r/356341
  10. sudo cumin 'R:class = role::configcluster and *.eqiad.wmnet' 'run-puppet-agent' (ends read-only)
  11. Merge and deploy https://gerrit.wikimedia.org/r/#/c/356137/
Joe added a comment.May 31 2017, 9:42 AM

All done, the play-by-play is how I executed the switchover. I'll write up some more documentation, and close the ticket as resolved.

Joe closed this task as Resolved.May 31 2017, 12:57 PM