Page MenuHomePhabricator

Put conf100[789] in production
Closed, ResolvedPublic

Description

With T301272 resolved, hosts are ready to be put into production and take over the role of conf100[456]

Event Timeline

Change 811728 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/dns@master] Add conf100[789] in DNS SRV records

https://gerrit.wikimedia.org/r/811728

akosiaris added a subscriber: Ottomata.

Adding @Ottomata too since conf100* hosts also run zookeeper

Change 811729 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Assign conf100[789] roles and add them to the cluster

https://gerrit.wikimedia.org/r/811729

Change 811885 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/dns@master] Add client side conf100[789] in DNS SRV records

https://gerrit.wikimedia.org/r/811885

Change 811728 merged by Alexandros Kosiaris:

[operations/dns@master] Add conf100[789] in DNS SRV records

https://gerrit.wikimedia.org/r/811728

Change 811729 merged by Alexandros Kosiaris:

[operations/puppet@production] Assign conf100[789] roles and add them to the cluster

https://gerrit.wikimedia.org/r/811729

Change 812018 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Update _etcd-server-ssl._tcp.v3.eqiad.wmnet.crt

https://gerrit.wikimedia.org/r/812018

Change 812018 merged by Alexandros Kosiaris:

[operations/puppet@production] Update _etcd-server-ssl._tcp.v3.eqiad.wmnet.crt

https://gerrit.wikimedia.org/r/812018

Change 812035 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Add v3.eqiad.wmnet to _etcd-server-ssl._tcp.v3.eqiad.wmnet cert

https://gerrit.wikimedia.org/r/812035

Change 812035 merged by Alexandros Kosiaris:

[operations/puppet@production] Add v3.eqiad.wmnet to _etcd-server-ssl._tcp.v3.eqiad.wmnet cert

https://gerrit.wikimedia.org/r/812035

Change 812084 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/dns@master] Add conf1008 in DNS SRV records

https://gerrit.wikimedia.org/r/812084

Change 812084 merged by Alexandros Kosiaris:

[operations/dns@master] Add conf1008 in DNS SRV records

https://gerrit.wikimedia.org/r/812084

Change 812088 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/dns@master] Add conf1009 in DNS SRV records

https://gerrit.wikimedia.org/r/812088

Change 812088 merged by Alexandros Kosiaris:

[operations/dns@master] Add conf1009 in DNS SRV records

https://gerrit.wikimedia.org/r/812088

Took a while but:

etcdctl --endpoints https://conf1004.eqiad.wmnet:2379 cluster-health
member 4cdd4cdde64b18d3 is healthy: got healthy result from https://conf1004.eqiad.wmnet:4001
member 815826b71cbad9ea is healthy: got healthy result from https://conf1006.eqiad.wmnet:4001
member 94055724277c08c8 is healthy: got healthy result from https://conf1007.eqiad.wmnet:4001
member 9b6588f020ad0f66 is healthy: got healthy result from https://conf1005.eqiad.wmnet:4001
member e95b40037bb59612 is healthy: got healthy result from https://conf1008.eqiad.wmnet:4001
member ebd9a0d3f013ed0f is healthy: got healthy result from https://conf1009.eqiad.wmnet:4001
cluster is healthy

So, etcd wise the cluster is ready. We still have 2 remaining actionables:

  • T312539 to fix the zookeeper versioning issue and thus initialize zookeeper on those hosts.
  • Get etcd-mirror packaged for bullseye

Get etcd-mirror packaged for bullseye just got fixed.

https://gerrit.wikimedia.org/r/c/operations/software/etcd-mirror/+/812306 and https://gerrit.wikimedia.org/r/c/operations/software/etcd-mirror/+/812241/16

Built and uploaded to apt.wikimedia.org and tested on conf1007. The only thing left appears to be T312539 after which we should be able to decommission the old hosts.

Change 816181 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] zookeeper: Disable notification on conf1007,conf1008,conf1009

https://gerrit.wikimedia.org/r/816181

Change 816181 merged by Alexandros Kosiaris:

[operations/puppet@production] zookeeper: Disable notifications on conf1007,conf1008,conf1009

https://gerrit.wikimedia.org/r/816181

Change 816149 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] zookeeper: Reenable notifications for conf1007,conf1008,conf1009

https://gerrit.wikimedia.org/r/816149

^Please remember to reenable alerting just before closing this task! :-) Creating the revert will make sure the gerrit bot keeps the patch-for-review tag.

Change 811885 merged by Alexandros Kosiaris:

[operations/dns@master] Add client side conf100[789] in DNS SRV records

https://gerrit.wikimedia.org/r/811885

Change 816149 merged by Alexandros Kosiaris:

[operations/puppet@production] zookeeper: Reenable notifications for conf1007,conf1008,conf1009

https://gerrit.wikimedia.org/r/816149

conf100[789] are now proper members of both etcd and zookeeper clusters.

I 've got a slow cumin based rolling restart of all kafka-related components (kafka, kafka-mirror-maker, burrow) to pick up the list of new hosts and I 've just merged the client side DNS RR for etcd. hosts can be considered fully in production now.

akosiaris claimed this task.

Gonna resolve this. Work to remove the old hosts is tracked in T311408

Change 817806 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] conf100[456]: Disable notifications

https://gerrit.wikimedia.org/r/817806

Change 817806 merged by Alexandros Kosiaris:

[operations/puppet@production] conf100[456]: Disable notifications

https://gerrit.wikimedia.org/r/817806

Mentioned in SAL (#wikimedia-operations) [2022-07-28T11:41:22Z] <akosiaris> slow (10minutes interval) rolling restart of all pybals to pick up new conf hosts config. T311407