Page MenuHomePhabricator

upgrade conf2* servers to stretch
Closed, ResolvedPublic

Description

Currently the conf* (configcluster) servers in eqiad are on stretch but the ones in codfw are still on jessie.

There are 2 separate roles, configcluster and configcluster_stretch applied to them.

We should reimage the jessie machines with stretch and make them use the same role so that .. we get rid of jessie and servers are the same across both DCs.

Noticed this when confirming an unrelated change as noop and seeing these kinds of puppet error on conf2001:

Error: /Stage[main]/Apt/File[/usr/local/share/apt/base_packages.txt]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/apt/base_packages.jessie

APT base packages have not been generated for jessie anymore. So this is starting to cause other issues. The puppet run does still finish though.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+0 -9
operations/puppetproduction+0 -16
operations/puppetproduction+0 -403
operations/puppetproduction+46 -113
operations/puppetproduction+23 -67
operations/puppetproduction+16 -0
operations/puppetproduction+3 -27
operations/puppetproduction+10 -7
operations/dnsmaster+9 -9
operations/puppetproduction+0 -6
operations/puppetproduction+0 -5
operations/puppetproduction+3 -6
operations/puppetproduction+2 -2
operations/puppetproduction+2 -1
operations/puppetproduction+2 -2
operations/puppetproduction+2 -0
operations/puppetproduction+29 -0
labs/privatemaster+3 -0
operations/puppetproduction+6 -0
operations/software/etcd-mirrordebian+7 -0
operations/puppetproduction+45 -6
operations/dnsmaster+6 -2
labs/privatemaster+2 -2
labs/privatemaster+0 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Seems good! For Zookeeper it should be a matter of reimaging one node at the time, conf200x are already running the stretch version backported, so there shouldn't be any upgrade to do in theory. Data for zookeeper is re-created when the node joins the ensemble so no constraints on that side too.

Also in the procedure we should remember to stop and downtime etcdmirror on conf2002 when we reimage the node.

Sorry to bother you, but do you happen to know when this upgrade will happen? I am not blocked by it (so please don't change any plans because of my question), but depending on when it is probably going to happen (e.g. if it is scheduled soon or will take a bit more) I would solve jessie backups blocker in different ways: T273182

I just realized, after closer inspection, that the blocker is indeed real, and we need these in stretch or higher to revert T273182. Is there something I can do to help?

@jcrespo me and Giuseppe are discussing the problem, so your pings are not unseen, but the problem is complex since it requires a lot of clients to move to eqiad first (Pybals, etcd DNS configs, etc..). The main question mark is around the shape of the etcd cluster, but we should have some answers during the next days. The ETA for the upgrade is in the order of few weeks I fear (to write a plan, find the time, execute it, etc..) , how does it look from your side? If it is too impacting we could try to figure out a workaround for these nodes :(

If it is too impacting we could try to figure out a workaround for these nodes :(

How bad would it be to disable monitoring of backups (and backups to fail) of these (but keeping the backups of conf1*) until they are upgraded? The problem with backing up jessie hosts is that they force the downgrade of all hosts to use TLS1.0, which is not great, specially cross-dc.

If we keep the backups for conf1* it should be fine, conf2* replicates via etcdmirror from conf100*, so if this unblocks you it should be doable. Let's check with @Joe to be sure :)

Let's make sure this is true, and only canonical data will be on conf100* before moving ahead with that, otherwise I prefer to get blocked and speed up the upgrade.

Change 664313 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] configcluster: Enable etcd v3 backups for stretch hosts

https://gerrit.wikimedia.org/r/664313

I've sent:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/664313

Independently of the pace of upgrading, we should give some priority to generating fresh backups from the (more?) active cluster?

@wkandek Hi! Do you think that we could find somebody in your team to work with me on this task? It seems very important and potentially blocking others (also the hosts are still running Jessie sigh).

@wkandek Hi! Do you think that we could find somebody in your team to work with me on this task? It seems very important and potentially blocking others (also the hosts are still running Jessie sigh).

Hi, me and @JMeybohm will work on this with you. We had a knowledge dump meeting this morning, and while doing so we came up with a better plan than the one I outlined first:

  1. Wait until we have procured the new servers T271346
  2. Add a DNS SRV entry for cluster discovery for those new machines for etcd
  3. Install them with buster and apply the configcluster_stretch (yeah I know...) role, where we'll add a switch to avoid starting zookeeper on these machines at first
  4. Check that the etcd cluster works as designed, add replication
  5. Switch read traffic for etcd from the old to the new cluster
  6. Proceed with the migration of zookeeper one server at a time

How does that sound?

Change 679727 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Setup new etcd3 cluster on conf200[456] in codfw

https://gerrit.wikimedia.org/r/679727

Change 679731 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add SRV records for new etcd3 cluster in codfw

https://gerrit.wikimedia.org/r/679731

Change 679734 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Add key for _etcd-server-ssl._tcp.v3.codfw.wmnet.key

https://gerrit.wikimedia.org/r/679734

Change 679734 merged by JMeybohm:

[labs/private@master] Add key for _etcd-server-ssl._tcp.v3.codfw.wmnet.key

https://gerrit.wikimedia.org/r/679734

Change 679748 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] htpasswd(): salt must be 8 characters

https://gerrit.wikimedia.org/r/679748

Change 679748 merged by JMeybohm:

[labs/private@master] htpasswd(): salt must be 8 characters

https://gerrit.wikimedia.org/r/679748

Change 679731 merged by JMeybohm:

[operations/dns@master] Add SRV records for new etcd3 cluster in codfw

https://gerrit.wikimedia.org/r/679731

Change 679727 merged by JMeybohm:

[operations/puppet@production] Setup new etcd3 cluster on conf200[456] in codfw

https://gerrit.wikimedia.org/r/679727

Change 679770 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/etcd-mirror@debian] Repackaging for buster

https://gerrit.wikimedia.org/r/679770

Change 679770 abandoned by JMeybohm:

[operations/software/etcd-mirror@debian] Repackaging for buster

Reason:

https://gerrit.wikimedia.org/r/679770

etcd cluster is set up now on conf200[4,5,6] although I had some trouble setting it up and I do not yet know why:

After the initial puppet runs, the ectd's rejected each others certificate for weird reasons:

Apr 16 08:28:57 conf2004 etcd[20404]: rejected connection from "10.192.48.59:49538" (error "remote error: tls: bad certificate", ServerName "v3.codfw.wmnet")
Apr 16 08:28:57 conf2004 etcd[20404]: rejected connection from "10.192.48.59:49536" (error "remote error: tls: bad certificate", ServerName "v3.codfw.wmnet")

Where 10.192.48.59 is conf2006.

Apr 16 08:28:38 conf2006 etcd[13177]: health check for peer d421c441462980ef could not connect: x509: certificate is valid for conf2004.codfw.wmnet, conf2005.codfw.wmnet, conf2006.codfw.wmnet, _etcd-server-ssl._tcp.v3.codfw.wmnet, not v3.codfw.wmnet (prober "ROUND_TRIPPER_SNAPSHOT")
Apr 16 08:28:38 conf2006 etcd[13177]: health check for peer d421c441462980ef could not connect: x509: certificate is valid for conf2004.codfw.wmnet, conf2005.codfw.wmnet, conf2006.codfw.wmnet, _etcd-server-ssl._tcp.v3.codfw.wmnet, not v3.codfw.wmnet (prober "ROUND_TRIPPER_RAFT_MESSAGE")

v3.codfw.wmnet is clearly not in the certificates (and it should not be), but _etcd-server-ssl._tcp.v3.codfw.wmnet is. According to s_client the correct certs where used when connecting (DNS:conf2004.codfw.wmnet, DNS:conf2005.codfw.wmnet, DNS:conf2006.codfw.wmnet, DNS:_etcd-server-ssl._tcp.v3.codfw.wmnet).

To enable debug I restarted etcd and out of a sudden the nodes where happily joining...
If someone has a clue, please point me into the right direction.

Do you think, with the work done, we could drop support of jessie bacula backups (only etcd cluster was pending with jessie)?

Do you think, with the work done, we could drop support of jessie bacula backups (only etcd cluster was pending with jessie)?

If those are the last jessie nodes I guess so. But migration is not done yet (just a new, empty etcd cluster). We'll let you know when we've taken the jessie nodes out of service.

The tlsproxy currently serves a certificate not valid for conf200[4,5,6] (Prometheus errors with: Get https://conf2004:4001/metrics: x509: certificate is valid for conf2001.codfw.wmnet, conf2002.codfw.wmnet, conf2003.codfw.wmnet, conf2001, conf2002, conf2003, etcd.codfw.wmnet, not conf2004)

This is because I overlooked the fact that I need to re-create the etcd certificate:

# This cert is generated using puppet-ecdsacert, and includes
# all the hostnames for the etcd machines in the SANs
# Will need to be regenerated if we add servers to the cluster.
profile::etcd::tlsproxy::cert_name: "etcd.%{::domain}"

Change 680874 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add conf200[4-6] IPs to zookeeper's main firewall config

https://gerrit.wikimedia.org/r/680874

Change 680875 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Swap zookeeper from conf2001 to conf2004

https://gerrit.wikimedia.org/r/680875

@JMeybohm I created the first two patch to swap the zookeeper servers, in theory it should work fine. The delicate step is the roll restart of the daemons after the second one, but it should be ok. If you want we can review them together and decide how to proceed :)

Change 680874 merged by Elukey:

[operations/puppet@production] Add conf200[4-6] IPs to zookeeper's main firewall config

https://gerrit.wikimedia.org/r/680874

Change 682403 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Add new tlsproxy cert for configcluster etcd

https://gerrit.wikimedia.org/r/682403

Change 682493 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] configcluster: Add new tlsproxy certificate

https://gerrit.wikimedia.org/r/682493

Change 682403 merged by JMeybohm:

[labs/private@master] Add new tlsproxy cert for configcluster etcd

https://gerrit.wikimedia.org/r/682403

Change 682493 merged by JMeybohm:

[operations/puppet@production] configcluster: Add new tlsproxy certificate

https://gerrit.wikimedia.org/r/682493

Change 682497 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] configcluster: Enable replication on conf2005

https://gerrit.wikimedia.org/r/682497

Icinga downtime set by jayme@cumin1001 for 1:00:00 1 host(s) and their services with reason: for initial etcd replication

conf2005.codfw.wmnet

Icinga downtime set by jayme@cumin1001 for 1:00:00 1 host(s) and their services with reason: for initial etcd replication

conf2005.codfw.wmnet

Change 682497 merged by JMeybohm:

[operations/puppet@production] configcluster: Enable replication on conf2005

https://gerrit.wikimedia.org/r/682497

Icinga downtime set by jayme@cumin1001 for 2:00:00 6 host(s) and their services with reason: for zookeeper migration

conf[2001-2006].codfw.wmnet

Change 680875 merged by JMeybohm:

[operations/puppet@production] Swap zookeeper from conf2001 to conf2004

https://gerrit.wikimedia.org/r/680875

Change 682655 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] configcluster: Update zookeeper version to debian upstream

https://gerrit.wikimedia.org/r/682655

Change 682655 merged by JMeybohm:

[operations/puppet@production] configcluster: Update zookeeper version to debian upstream

https://gerrit.wikimedia.org/r/682655

Icinga downtime set by jayme@cumin1001 for 1 day, 0:00:00 1 host(s) and their services with reason: for zookeeper migration

conf2001.codfw.wmnet

Change 682666 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Swap zookeeper from conf2002 to conf2005

https://gerrit.wikimedia.org/r/682666

Change 682667 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Swap zookeeper from conf2003 to conf2006

https://gerrit.wikimedia.org/r/682667

Switched zookeeper from conf2001 to conf2004.
We decided to leave it like this for today and see if anything comes up.

Change 682669 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] configcluster: No longer include zookeeper in old configcluster role

https://gerrit.wikimedia.org/r/682669

Icinga downtime set by jayme@cumin1001 for 1 day, 0:00:00 2 host(s) and their services with reason: for zookeeper migration

conf[2002-2003].codfw.wmnet

Icinga downtime set by jayme@cumin1001 for 2:00:00 3 host(s) and their services with reason: for zookeeper migration

conf[2004-2006].codfw.wmnet

Change 682666 merged by JMeybohm:

[operations/puppet@production] Swap zookeeper from conf2002 to conf2005

https://gerrit.wikimedia.org/r/682666

Icinga downtime set by jayme@cumin1001 for 2:00:00 3 host(s) and their services with reason: for zookeeper migration

conf[2004-2006].codfw.wmnet

Change 682667 merged by JMeybohm:

[operations/puppet@production] Swap zookeeper from conf2003 to conf2006

https://gerrit.wikimedia.org/r/682667

Zookeeper has completely moved from conf200[1-3] to conf200[4-6], kafka-main, mirror-maker and kafka-logging in codfw have been restarted to catch up with that as well.

Change 682669 merged by JMeybohm:

[operations/puppet@production] configcluster: No longer include zookeeper in old configcluster role

https://gerrit.wikimedia.org/r/682669

Change 683244 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Move codfw etcd clients to new cluster

https://gerrit.wikimedia.org/r/683244

Change 683246 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] common: Remove old zookeer hosts

https://gerrit.wikimedia.org/r/683246

Change 683248 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Switch conf200[1-3] to conf200[4-6]

https://gerrit.wikimedia.org/r/683248

Change 683246 merged by JMeybohm:

[operations/puppet@production] common: Remove old zookeeper hosts

https://gerrit.wikimedia.org/r/683246

Change 683244 merged by JMeybohm:

[operations/dns@master] Move codfw etcd clients to new cluster

https://gerrit.wikimedia.org/r/683244

Mentioned in SAL (#wikimedia-operations) [2021-04-28T12:39:21Z] <jayme> restarting pybal on lvs2010 - T271573

Mentioned in SAL (#wikimedia-operations) [2021-04-28T12:42:03Z] <jayme> restarting pybal on lvs5003,lvs4007 - T271573

Mentioned in SAL (#wikimedia-operations) [2021-04-28T12:48:07Z] <jayme> restarting pybal on lvs2009 - T271573

Change 683248 merged by JMeybohm:

[operations/puppet@production] Switch conf200[1-3] to conf200[4-6]

https://gerrit.wikimedia.org/r/683248

Mentioned in SAL (#wikimedia-operations) [2021-04-28T13:10:42Z] <jayme> restarting pybal on lvs5002,lvs4006,lvs2008 - T271573

Mentioned in SAL (#wikimedia-operations) [2021-04-28T13:15:00Z] <jayme> restarting pybal on lvs5001,lvs4005,lvs2007 - T271573

DNS SRV records, pybal's and confd instances in codfw, eqsin, ulsfo moved to the new cluster. navtiming.service on webperf needed a restart as well.

Change 683358 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Remove references to conf200[1-3] after decom

https://gerrit.wikimedia.org/r/683358

Change 683358 merged by JMeybohm:

[operations/puppet@production] Remove references to conf200[1-3] after decom

https://gerrit.wikimedia.org/r/683358

Change 664313 merged by JMeybohm:

[operations/puppet@production] configcluster: Enable etcd v2 backups

https://gerrit.wikimedia.org/r/664313

Change 683551 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Rename role configcluster_stretch to configcluster

https://gerrit.wikimedia.org/r/683551

Change 684315 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Rename configcluster_stretch to configcluster in hiera

https://gerrit.wikimedia.org/r/684315

Change 684316 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Remove unused profile::etcd and related classes

https://gerrit.wikimedia.org/r/684316

Change 684315 abandoned by JMeybohm:

[operations/puppet@production] Rename configcluster_stretch to configcluster in hiera

Reason:

Squashed into 683551

https://gerrit.wikimedia.org/r/684315

Change 683551 merged by JMeybohm:

[operations/puppet@production] Rename role configcluster_stretch to configcluster

https://gerrit.wikimedia.org/r/683551

Change 684316 merged by JMeybohm:

[operations/puppet@production] Remove unused profile::etcd and related classes

https://gerrit.wikimedia.org/r/684316

Change 684801 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus: Clean up absent file resource

https://gerrit.wikimedia.org/r/684801

Change 684848 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] puppet_compiler: Remove etcd and conftool::client

https://gerrit.wikimedia.org/r/684848

Change 684848 merged by JMeybohm:

[operations/puppet@production] puppet_compiler: Remove etcd and conftool::client

https://gerrit.wikimedia.org/r/684848