Page MenuHomePhabricator

Create etcd VMs for use with ML platform
Closed, ResolvedPublic

Description

We need 3 each in codfw and eqiad

Specs: 1 cpu, 3G of ram, 20G of disk for root fs

Event Timeline

Hostnames: ml-etcd100x.eqiad and ml-etcd200x.codfw
For networking, we want row diversity, which should be easy enough for VMs this tiny.

All machines are now base installed (puppet-runs done with insetup).

Change 663200 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] Add etcd role for ML Team's new clusters

https://gerrit.wikimedia.org/r/663200

Change 663836 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/dns@master] dns: Add SRV records for ml-etcd clusters

https://gerrit.wikimedia.org/r/663836

Change 663836 merged by Alexandros Kosiaris:
[operations/dns@master] dns: Add SRV records for ml-etcd clusters

https://gerrit.wikimedia.org/r/663836

Change 664568 had a related patch set uploaded (by Klausman; owner: Klausman):
[labs/private@master] secrets: Add dummy keys for ml_etcd clusters

https://gerrit.wikimedia.org/r/664568

Change 664568 merged by Klausman:
[labs/private@master] secrets: Add dummy keys for ml_etcd clusters

https://gerrit.wikimedia.org/r/664568

Mentioned in SAL (#wikimedia-operations) [2021-02-16T15:44:03Z] <klausman@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd[1001-1003].eqiad.wmnet with reason: klausman: Pushing new etcd changes from T273071

Mentioned in SAL (#wikimedia-operations) [2021-02-16T15:44:08Z] <klausman@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd[1001-1003].eqiad.wmnet with reason: klausman: Pushing new etcd changes from T273071

Change 663200 merged by Klausman:
[operations/puppet@production] Add etcd role for ML Team's new clusters

https://gerrit.wikimedia.org/r/663200

Change 664587 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] files/ssl: Fix broken name of ML etcd SSL certs

https://gerrit.wikimedia.org/r/664587

Change 664587 merged by Klausman:
[operations/puppet@production] files/ssl: Fix broken name of ML etcd SSL certs

https://gerrit.wikimedia.org/r/664587

root@ml-etcd1001:~# etcdctl  -C https://ml-etcd1001.eqiad.wmnet:2379 cluster-health
member 27250fb9655951c0 is healthy: got healthy result from https://ml-etcd1003.eqiad.wmnet:2379
member bec8796f64226950 is healthy: got healthy result from https://ml-etcd1002.eqiad.wmnet:2379
member ec678e26e1c1f07a is healthy: got healthy result from https://ml-etcd1001.eqiad.wmnet:2379
cluster is healthy
root@ml-etcd1001:~#

Mentioned in SAL (#wikimedia-operations) [2021-02-16T16:25:19Z] <klausman@cumin2001> START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd[2001-2003].codfw.wmnet with reason: klausman: Pushing new etcd changes from T273071

Mentioned in SAL (#wikimedia-operations) [2021-02-16T16:25:24Z] <klausman@cumin2001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd[2001-2003].codfw.wmnet with reason: klausman: Pushing new etcd changes from T273071

Change 664595 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] site: move mml-etcd in codfw from insetup to etcd role

https://gerrit.wikimedia.org/r/664595

Change 664595 merged by Klausman:
[operations/puppet@production] site: move mml-etcd in codfw from insetup to etcd role

https://gerrit.wikimedia.org/r/664595

root@ml-etcd2001:~# etcdctl -C https://ml-etcd2001.codfw.wmnet:2379 cluster-health
member 367f7076aea55538 is healthy: got healthy result from https://ml-etcd2002.codfw.wmnet:2379
member 3eaef5f31c9d4f07 is healthy: got healthy result from https://ml-etcd2001.codfw.wmnet:2379
member 6ec81f119df22c02 is healthy: got healthy result from https://ml-etcd2003.codfw.wmnet:2379
cluster is healthy