With T301272 resolved, hosts are ready to be put into production and take over the role of conf100[456]
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Cmjohnson | T311408 Decomission conf100[456] | |||
Resolved | • Cmjohnson | T301272 Q3:(Need By: TBD) rack/setup/install conf100[789] | |||
Resolved | akosiaris | T310062 Update conf1* servers | |||
Restricted Task | |||||
Resolved | akosiaris | T311407 Put conf100[789] in production |
Event Timeline
Change 811728 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/dns@master] Add conf100[789] in DNS SRV records
Change 811729 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/puppet@production] Assign conf100[789] roles and add them to the cluster
Change 811885 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/dns@master] Add client side conf100[789] in DNS SRV records
Change 811728 merged by Alexandros Kosiaris:
[operations/dns@master] Add conf100[789] in DNS SRV records
Change 811729 merged by Alexandros Kosiaris:
[operations/puppet@production] Assign conf100[789] roles and add them to the cluster
Change 812018 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/puppet@production] Update _etcd-server-ssl._tcp.v3.eqiad.wmnet.crt
Change 812018 merged by Alexandros Kosiaris:
[operations/puppet@production] Update _etcd-server-ssl._tcp.v3.eqiad.wmnet.crt
Change 812035 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/puppet@production] Add v3.eqiad.wmnet to _etcd-server-ssl._tcp.v3.eqiad.wmnet cert
Change 812035 merged by Alexandros Kosiaris:
[operations/puppet@production] Add v3.eqiad.wmnet to _etcd-server-ssl._tcp.v3.eqiad.wmnet cert
Change 812084 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/dns@master] Add conf1008 in DNS SRV records
Change 812084 merged by Alexandros Kosiaris:
[operations/dns@master] Add conf1008 in DNS SRV records
Change 812088 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/dns@master] Add conf1009 in DNS SRV records
Change 812088 merged by Alexandros Kosiaris:
[operations/dns@master] Add conf1009 in DNS SRV records
Took a while but:
etcdctl --endpoints https://conf1004.eqiad.wmnet:2379 cluster-health member 4cdd4cdde64b18d3 is healthy: got healthy result from https://conf1004.eqiad.wmnet:4001 member 815826b71cbad9ea is healthy: got healthy result from https://conf1006.eqiad.wmnet:4001 member 94055724277c08c8 is healthy: got healthy result from https://conf1007.eqiad.wmnet:4001 member 9b6588f020ad0f66 is healthy: got healthy result from https://conf1005.eqiad.wmnet:4001 member e95b40037bb59612 is healthy: got healthy result from https://conf1008.eqiad.wmnet:4001 member ebd9a0d3f013ed0f is healthy: got healthy result from https://conf1009.eqiad.wmnet:4001 cluster is healthy
So, etcd wise the cluster is ready. We still have 2 remaining actionables:
- T312539 to fix the zookeeper versioning issue and thus initialize zookeeper on those hosts.
- Get etcd-mirror packaged for bullseye
Get etcd-mirror packaged for bullseye just got fixed.
https://gerrit.wikimedia.org/r/c/operations/software/etcd-mirror/+/812306 and https://gerrit.wikimedia.org/r/c/operations/software/etcd-mirror/+/812241/16
Built and uploaded to apt.wikimedia.org and tested on conf1007. The only thing left appears to be T312539 after which we should be able to decommission the old hosts.
Change 816181 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] zookeeper: Disable notification on conf1007,conf1008,conf1009
Change 816181 merged by Alexandros Kosiaris:
[operations/puppet@production] zookeeper: Disable notifications on conf1007,conf1008,conf1009
Change 816149 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] zookeeper: Reenable notifications for conf1007,conf1008,conf1009
^Please remember to reenable alerting just before closing this task! :-) Creating the revert will make sure the gerrit bot keeps the patch-for-review tag.
Change 811885 merged by Alexandros Kosiaris:
[operations/dns@master] Add client side conf100[789] in DNS SRV records
Change 816149 merged by Alexandros Kosiaris:
[operations/puppet@production] zookeeper: Reenable notifications for conf1007,conf1008,conf1009
conf100[789] are now proper members of both etcd and zookeeper clusters.
I 've got a slow cumin based rolling restart of all kafka-related components (kafka, kafka-mirror-maker, burrow) to pick up the list of new hosts and I 've just merged the client side DNS RR for etcd. hosts can be considered fully in production now.
Change 817806 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/puppet@production] conf100[456]: Disable notifications
Change 817806 merged by Alexandros Kosiaris:
[operations/puppet@production] conf100[456]: Disable notifications
Mentioned in SAL (#wikimedia-operations) [2022-07-28T11:41:22Z] <akosiaris> slow (10minutes interval) rolling restart of all pybals to pick up new conf hosts config. T311407