⚓ T159687 etcd switchover/enhancements

Subject	Repo	Branch	Lines +/-
conftool: switch prefix to /eqiad.wmnet/conftool	operations/puppet	production	+2 -2
Switch conftool etcd records to codfw	operations/dns	master	+6 -6
Increase TTL for etcd client records	operations/dns	master	+12 -12
profile::etcd::tlsproxy: allow read-only mode	operations/puppet	production	+11 -1
profile::etcd::tlsproxy: turn off proxy buffering	operations/puppet	production	+3 -0
Restore TTL for RW etcd records	operations/dns	master	+12 -12
role::configcluster: stop replicating to codfw for etcd	operations/puppet	production	+0 -2
Switch etcd records to codfw	operations/dns	master	+12 -12
Reset TTL on etcd RO client record, lower it on RW ones	operations/dns	master	+24 -24
Swap etcd client records to point to codfw	operations/dns	master	+12 -12
Switch all pybals to using codfw etcd cluster	operations/puppet	production	+4 -4
Lower TTL for etcd client records	operations/dns	master	+12 -12
Use conf2001 for secondary eqiad LVS's pybal	operations/puppet	production	+3 -0
etcd: make our rw clients use the new SRV record	operations/puppet	production	+1 -1
Add separated SRV records for etcd to consume for conftool	operations/dns	master	+26 -3
role::configcluster: reconfigure etcd replication	operations/puppet	production	+2 -2

Joe created this task.Mar 6 2017, 10:38 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 6 2017, 10:38 AM

Just to give some context: it might be possible to try to have a true multi-dc cluster for etcd, but that will need:

N machines in eqiad
N machines in codfw
1 or 2 tiebreakers, probably in ULSFO, for accounting for intra-dc network partitions

it will also need some fine tuning and extensive testing, because I suspect raft over large latencies can be pretty demanding in terms of write latencies too.

I am willing to give it a chance but it will take time and effort to be tested; and probably a move to etcdv3 would be a good idea in that case. I have more short-term goals in mind at the moment, like having a relatively easy to achieve active-active read, active-passive write configuration.

Paladox subscribed.Mar 6 2017, 10:59 AM

Joe triaged this task as Medium priority.Mar 6 2017, 11:50 AM

Joe added a project: User-Joe.

Joe moved this task from Backlog to Doing on the User-Joe board.Mar 8 2017, 3:13 PM

Change 341989 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/puppet] conftool: switch prefix to /eqiad.wmnet/conftool

https://gerrit.wikimedia.org/r/341989

gerritbot added a project: Patch-For-Review.Mar 9 2017, 10:51 AM

Joe mentioned this in T156924: Allow integration of data from etcd into the MediaWiki configuration.Mar 14 2017, 5:55 PM

Joe moved this task from Doing to Backlog on the User-Joe board.Apr 3 2017, 6:42 AM

Joe updated the task description. (Show Details)Apr 20 2017, 10:47 AM

Joe moved this task from Backlog to Doing on the User-Joe board.Apr 20 2017, 10:50 AM

Volans subscribed.Apr 20 2017, 11:54 AM

Change 349380 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/dns@master] Add separated SRV records for etcd to consume for conftool

https://gerrit.wikimedia.org/r/349380

Change 349385 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/puppet@production] role::configcluster: reconfigure etcd replication

https://gerrit.wikimedia.org/r/349385

Change 349386 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/puppet@production] etcd: make our rw clients use the new SRV record

https://gerrit.wikimedia.org/r/349386

Change 349385 merged by Giuseppe Lavagetto:
[operations/puppet@production] role::configcluster: reconfigure etcd replication

https://gerrit.wikimedia.org/r/349385

Joe updated the task description. (Show Details)Apr 21 2017, 6:18 PM

I've set up the replica and prepared changes for most next steps. When I'm back on Wednesday morning, we can decide if we want to failover to the new cluster directly or just do it in case something bad happens with the network maintenance and the eqiad cluster, and perform the switchover at a later date.

This is still needed to move the eqiad cluster away from its current setup, where auth is enabled at the etcd layer, where it's expensive and we want to avoid that.

Change 349380 merged by Alexandros Kosiaris:
[operations/dns@master] Add separated SRV records for etcd to consume for conftool

https://gerrit.wikimedia.org/r/349380

Change 349386 merged by Alexandros Kosiaris:
[operations/puppet@production] etcd: make our rw clients use the new SRV record

https://gerrit.wikimedia.org/r/349386

akosiaris updated the task description. (Show Details)Apr 25 2017, 1:30 PM

Change 350204 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] Use conf2001 for secondary eqiad LVS's pybal

https://gerrit.wikimedia.org/r/350204

Change 350204 merged by Alexandros Kosiaris:
[operations/puppet@production] Use conf2001 for secondary eqiad LVS's pybal

https://gerrit.wikimedia.org/r/350204

lvs1004, lvs1005, lvs1006 now use conf2001 per the patch above successfully. Proceeding with the rest of the plan

akosiaris updated the task description. (Show Details)Apr 25 2017, 1:57 PM

Change 350212 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/dns@master] Lower TTL for etcd client records

https://gerrit.wikimedia.org/r/350212

Change 350216 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/dns@master] Switch conftool etcd records to codfw

https://gerrit.wikimedia.org/r/350216

Change 350223 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] Switch all pybals to using codfw etcd cluster

https://gerrit.wikimedia.org/r/350223

Change 350225 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/dns@master] Increase TTL for etcd client records

https://gerrit.wikimedia.org/r/350225

Change 350212 merged by Alexandros Kosiaris:
[operations/dns@master] Lower TTL for etcd client records

https://gerrit.wikimedia.org/r/350212

Change 350223 merged by Alexandros Kosiaris:
[operations/puppet@production] Switch all pybals to using codfw etcd cluster

https://gerrit.wikimedia.org/r/350223

Change 350214 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/dns@master] Swap etcd client records to point to codfw

https://gerrit.wikimedia.org/r/350214

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:33:15Z] <akosiaris> restart pybal on lvs[2004-2006].codfw.wmnet,lvs3004.esams.wmnet,lvs4004.ulsfo.wmnet,lvs[1004-1006].wikimedia.org T159687

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:47:34Z] <akosiaris> restart pybal on lvs2003.codfw.wmnet,lvs3003.esams.wmnet,lvs4003.ulsfo.wmnet,lvs1003.wikimedia.org T159687

Mentioned in SAL (#wikimedia-operations) [2017-04-25T15:59:30Z] <akosiaris> restart pybal on lvs[2001-2002].codfw.wmnet,lvs[3001-3002].esams.wmnet,lvs[4001-4002].ulsfo.wmnet,lvs[1001-1002].wikimedia.org T159687

Change 350214 merged by Alexandros Kosiaris:
[operations/dns@master] Swap etcd client records to point to codfw

https://gerrit.wikimedia.org/r/350214

akosiaris updated the task description. (Show Details)Apr 25 2017, 4:46 PM

I 've restarted confd across the fleet after merging the DNS change above in order for it to be picked up by the daemons (5mins had passed and I saw no difference in the number of ESTABLISHED connections in lsof output).

A quick test with mw2255 pooling and depooling verified that everything continues to work fine.

I 've left the part of changing conftool DNS records and stopping the replication for tomorrow morning

Change 350365 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/dns@master] Reset TTL on etcd RO client record, lower it on RW ones

https://gerrit.wikimedia.org/r/350365

Change 350366 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/dns@master] Switch etcd records to codfw

https://gerrit.wikimedia.org/r/350366

Change 350367 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/dns@master] Restore TTL for RW etcd records

https://gerrit.wikimedia.org/r/350367

Change 350365 merged by Giuseppe Lavagetto:
[operations/dns@master] Reset TTL on etcd RO client record, lower it on RW ones

https://gerrit.wikimedia.org/r/350365

Change 350368 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/puppet@production] role::configcluster: stop replicating to codfw for etcd

https://gerrit.wikimedia.org/r/350368

Change 350366 merged by Giuseppe Lavagetto:
[operations/dns@master] Switch etcd records to codfw

https://gerrit.wikimedia.org/r/350366

Joe updated the task description. (Show Details)Apr 26 2017, 6:35 AM

Change 350368 merged by Giuseppe Lavagetto:
[operations/puppet@production] role::configcluster: stop replicating to codfw for etcd

https://gerrit.wikimedia.org/r/350368

Change 350367 merged by Giuseppe Lavagetto:
[operations/dns@master] Restore TTL for RW etcd records

https://gerrit.wikimedia.org/r/350367

Joe updated the task description. (Show Details)Apr 26 2017, 6:58 AM

All clients have been successfully switched to codfw, and replication has been stopped; I tested depooling and pooling back a client (to test again that nginx-based auth works) and everything seems working flawlessly for now.

I'll start working ASAP on moving conf1001-1003 to role::configcluster and drop the builtin auth module of etcd.

Mentioned in SAL (#wikimedia-operations) [2017-05-02T06:46:29Z] <_joe_> disabling etcd auth on conf1*, converting to use nginx for TLS/auth T159687

I converted the etcd cluster in eqiad to use nginx for auth/TLS, moved to ecdsa certs with the correct SANs, and started replication codfw => eqiad.

I might start to make clients read from eqiad in a few. That would basically resolve the initial purpose of this ticket, but it's still a bit short of an ideal or even good situation.

Specifically I want to work on etcdmirror so that it can do the following:

Have a mode in which if no replication data is available, it can start fresh automatically (for cluster bootstrap)
Allow to read a defaults file where to add (manually) reload commands / etc
Make etcdmirror read on the SOURCE cluster if that cluster is active for replication; if not, just do nothing.

Change 351257 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::etcd::tlsproxy: turn off proxy buffering

https://gerrit.wikimedia.org/r/351257

Change 351257 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::etcd::tlsproxy: turn off proxy buffering

https://gerrit.wikimedia.org/r/351257

Change 353231 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::etcd::tlsproxy: allow read-only mode

https://gerrit.wikimedia.org/r/353231

Joe moved this task from Doing to Backlog on the User-Joe board.May 15 2017, 9:49 AM

Change 353231 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::etcd::tlsproxy: allow read-only mode

https://gerrit.wikimedia.org/r/353231

Change 350225 abandoned by Alexandros Kosiaris:
Increase TTL for etcd client records

Reason:
No longer relevant

https://gerrit.wikimedia.org/r/350225

Change 350216 abandoned by Alexandros Kosiaris:
Switch conftool etcd records to codfw

Reason:
No longer relevant

https://gerrit.wikimedia.org/r/350216

Change 341989 abandoned by Giuseppe Lavagetto:
conftool: switch prefix to /eqiad.wmnet/conftool

Reason:
we went in another direction.