Create an etcd cluster in codfw
Closed, ResolvedPublic

Description

We need to set up a second etcd cluster in codfw.

I would adopt the following tactics:

  • Create a 3-node cluster on the conf2xxx servers already used for ZK
  • Use an nginx proxy instead of the builtin etcd auth for additional scalability
  • Replicate from the eqiad cluster with a process we can start/stop depending on which of the two clusters is the master.

Unfortunately the replication would be very easy to accomplish with etcd 3, as etcdctl has a "mirror maker" function that allows to do just that.

I'd stay short of suggesting we upgrade to etcd3 at the moment (although it might be a good idea at a later point).

Joe created this task.Jan 23 2017, 3:29 PM
Joe added a project: User-Joe.
Joe moved this task from Backlog to Doing on the User-Joe board.Jan 24 2017, 8:34 AM
Joe claimed this task.

Change 334123 had a related patch set uploaded (by Giuseppe Lavagetto):
etcd: add ability to use a TLS/auth proxy

https://gerrit.wikimedia.org/r/334123

Change 334124 had a related patch set uploaded (by Giuseppe Lavagetto):
role::etcd::common: move to profile, refactor

https://gerrit.wikimedia.org/r/334124

Change 334126 had a related patch set uploaded (by Giuseppe Lavagetto):
profile::etcd::tlsproxy: nginx auth proxy for etcd

https://gerrit.wikimedia.org/r/334126

Change 334127 had a related patch set uploaded (by Giuseppe Lavagetto):
conf2xx: install etcd cluster

https://gerrit.wikimedia.org/r/334127

Change 334123 merged by Giuseppe Lavagetto:
etcd: add ability to use a TLS/auth proxy

https://gerrit.wikimedia.org/r/334123

Change 334124 merged by Giuseppe Lavagetto:
role::etcd::common: move to profile, refactor

https://gerrit.wikimedia.org/r/334124

Change 334126 merged by Giuseppe Lavagetto:
profile::etcd::tlsproxy: nginx auth proxy for etcd

https://gerrit.wikimedia.org/r/334126

Change 334312 had a related patch set uploaded (by Giuseppe Lavagetto):
Add SRV records for etcd peer discovery in codfw

https://gerrit.wikimedia.org/r/334312

Change 334312 merged by Giuseppe Lavagetto:
Add SRV records for etcd peer discovery in codfw

https://gerrit.wikimedia.org/r/334312

Change 334127 merged by Giuseppe Lavagetto:
conf2xx: install etcd cluster

https://gerrit.wikimedia.org/r/334127

Change 334426 had a related patch set uploaded (by Volans):
etcd: add missing group definition for codfw cluster

https://gerrit.wikimedia.org/r/334426

Change 334426 merged by Volans:
etcd: add missing group definition for codfw cluster

https://gerrit.wikimedia.org/r/334426

Joe added a comment.Jan 30 2017, 11:25 AM

The cluster in codfw is installed and tested to work correctly with conftool. The performance of the cluster using nginx as a TLS/proxy auth seems to be much better too.

For now, we are able to dump and load the content from the eqiad cluster to the codfw cluster.

I am looking at ways to do replication, but apart from people asking for it I found very little about this for etcd version 2, while there is an official tool within etcdctl for etcd version 3, mirror-maker.

I would say our best bet is to port mirror-maker to version 2 of the API, where its logic should even be simpler. Or to just reproduce the functionality; in fact I wrote a small replication system yesterday as a proof-of-concept; depending how confident I feel with it I might propose to use it.

Mentioned in SAL (#wikimedia-operations) [2017-02-07T10:02:48Z] <_joe_> uploaded etcd-mirror 0.0.1 to jessie-wikimedia (T156009)

Change 336596 had a related patch set uploaded (by Giuseppe Lavagetto):
role::configcluster: enable replication from eqiad to codfw

https://gerrit.wikimedia.org/r/336596

Change 336596 merged by Giuseppe Lavagetto:
role::configcluster: enable replication from eqiad to codfw

https://gerrit.wikimedia.org/r/336596

Joe added a comment.Feb 9 2017, 7:36 AM

The codfw cluster is getting replicated data from eqiad under /eqiad.wmnet/conftool.

What remains to be done:

  • Monitoring the replication process (it exposes both its health and prometheus-style metrics)
  • Deploy etcd-mirror 0.0.2 which has an important fix
  • Change puppet so that it's easy for us to change both the replica direction and which cluster we're using.

Change 336850 had a related patch set uploaded (by Giuseppe Lavagetto):
profile::etcd::replication: refactor to make failover easier

https://gerrit.wikimedia.org/r/336850

Change 336850 merged by Giuseppe Lavagetto:
profile::etcd::replication: refactor to make failover easier

https://gerrit.wikimedia.org/r/336850

Joe closed this task as "Resolved".Feb 24 2017, 5:27 PM