We need 3 each in codfw and eqiad
Specs: 1 cpu, 3G of ram, 20G of disk for root fs
We need 3 each in codfw and eqiad
Specs: 1 cpu, 3G of ram, 20G of disk for root fs
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T272917 Lift Wing proof of concept | |||
Resolved | klausman | T272918 Create ml-serve k8s cluster | |||
Resolved | klausman | T273071 Create etcd VMs for use with ML platform |
Hostnames: ml-etcd100x.eqiad and ml-etcd200x.codfw
For networking, we want row diversity, which should be easy enough for VMs this tiny.
Change 663200 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] Add etcd role for ML Team's new clusters
Change 663836 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/dns@master] dns: Add SRV records for ml-etcd clusters
Change 663836 merged by Alexandros Kosiaris:
[operations/dns@master] dns: Add SRV records for ml-etcd clusters
Change 664568 had a related patch set uploaded (by Klausman; owner: Klausman):
[labs/private@master] secrets: Add dummy keys for ml_etcd clusters
Change 664568 merged by Klausman:
[labs/private@master] secrets: Add dummy keys for ml_etcd clusters
Mentioned in SAL (#wikimedia-operations) [2021-02-16T15:44:03Z] <klausman@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd[1001-1003].eqiad.wmnet with reason: klausman: Pushing new etcd changes from T273071
Mentioned in SAL (#wikimedia-operations) [2021-02-16T15:44:08Z] <klausman@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd[1001-1003].eqiad.wmnet with reason: klausman: Pushing new etcd changes from T273071
Change 663200 merged by Klausman:
[operations/puppet@production] Add etcd role for ML Team's new clusters
Change 664587 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] files/ssl: Fix broken name of ML etcd SSL certs
Change 664587 merged by Klausman:
[operations/puppet@production] files/ssl: Fix broken name of ML etcd SSL certs
root@ml-etcd1001:~# etcdctl -C https://ml-etcd1001.eqiad.wmnet:2379 cluster-health member 27250fb9655951c0 is healthy: got healthy result from https://ml-etcd1003.eqiad.wmnet:2379 member bec8796f64226950 is healthy: got healthy result from https://ml-etcd1002.eqiad.wmnet:2379 member ec678e26e1c1f07a is healthy: got healthy result from https://ml-etcd1001.eqiad.wmnet:2379 cluster is healthy root@ml-etcd1001:~#
Mentioned in SAL (#wikimedia-operations) [2021-02-16T16:25:19Z] <klausman@cumin2001> START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd[2001-2003].codfw.wmnet with reason: klausman: Pushing new etcd changes from T273071
Mentioned in SAL (#wikimedia-operations) [2021-02-16T16:25:24Z] <klausman@cumin2001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd[2001-2003].codfw.wmnet with reason: klausman: Pushing new etcd changes from T273071
Change 664595 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] site: move mml-etcd in codfw from insetup to etcd role
Change 664595 merged by Klausman:
[operations/puppet@production] site: move mml-etcd in codfw from insetup to etcd role
root@ml-etcd2001:~# etcdctl -C https://ml-etcd2001.codfw.wmnet:2379 cluster-health member 367f7076aea55538 is healthy: got healthy result from https://ml-etcd2002.codfw.wmnet:2379 member 3eaef5f31c9d4f07 is healthy: got healthy result from https://ml-etcd2001.codfw.wmnet:2379 member 6ec81f119df22c02 is healthy: got healthy result from https://ml-etcd2003.codfw.wmnet:2379 cluster is healthy