Page MenuHomePhabricator

Upgrading Wikidough and durum VMs to bullseye
Closed, ResolvedPublic

Description

The Wikidough and durum hosts (12 each for a total of 24 hosts) are Ganeti VMs currently running buster (10.11). We should upgrade them to bullsye and this task discusses the best path forward on undertaking that upgrade.

We have already backported the dnsdist and pdns-recursor patches from unstable, building them for bullseye. For the actual upgrade of the VMs, we have two main options:

  1. Perform an in-place upgrade, using apt full-upgrade. This helps preserve the IP addresses (reminder: these services are anycasted and changing the IP requires additional updates) and might be the least stressful path forward.
    1. The downside of this approach being that it's not as clean as spinning up a new host and doesn't help us catch corner cases when initializing a new Wikidough or durum host with bullseye (such as the Puppetization). We can however always perform the in-place upgrade and spin up a test VM on the side to test a clean install.
    2. Steps: depool host, run full-upgrade, reboot, host is pooled.
  2. Decomission the existing hosts, run makevm to spin up new VMs with bullseye. The downside is that the IPs of the hosts will change and we would have to update them in homer for the anycast configuration (not a big deal?) and that it might result in some weird state issue elsewhere (such as Netbox).
    1. This approach helps us test setting up new instances of Wikidough and durum with bullseye and ensures a clean install. There is no state on the Wikidough hosts as there are no logs or other state data that we have to preserve so decomissioning and creating new hosts should not be a problem.
    2. Steps: decomission host, set up new VM, update existing hostnames and IP addresses in homer and other places. Need to be done for each host individually.

Both of these steps are manual as there is no cookbook (as of now) for performing these upgrades. The main intention of this task is to discuss the best path forward on upgrading Ganeti VMs to bullseye and for the Wikidough and durum hosts.


I do notice that some of the Ganeti VMs (just codfw, as an example) have already been upgraded to bullseye:

===== NODE GROUP =====
(11) failoid2002.codfw.wmnet,kubernetes[2005-2006,2015-2016].codfw.wmnet,ldap-replica[2005-2006].wikimedia.org,ml-staging-ctrl[2001-2002].codfw.wmnet,netflow2002.codfw.wmnet,rpki2002.codfw.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
11.2
===== NODE GROUP =====
(5) build2001.codfw.wmnet,kubestagemaster2001.codfw.wmnet,mx2001.wikimedia.org,people2002.codfw.wmnet,puppetboard2002.codfw.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
11.3

While the majority of them have not been:

===== NODE GROUP =====
(7) debmonitor2002.codfw.wmnet,deneb.codfw.wmnet,grafana2001.codfw.wmnet,miscweb2002.codfw.wmnet,pybal-test2001.codfw.wmnet,urldownloader2001.wikimedia.org,xhgui2001.codfw.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
10.12
===== NODE GROUP =====
(62) acmechief2001.codfw.wmnet,acmechief-test2001.codfw.wmnet,apifeatureusage2001.codfw.wmnet,apt2001.wikimedia.org,chartmuseum2001.codfw.wmnet,doc2001.codfw.wmnet,doh[2001-2002].wikimedia.org,dragonfly-supernode2001.codfw.wmnet,durum[2001-2002].codfw.wmnet,gitlab2001.wikimedia.org,gitlab-runner2001.codfw.wmnet,idp2001.wikimedia.org,idp-test2001.wikimedia.org,install2003.wikimedia.org,irc2001.wikimedia.org,kafkamon2002.codfw.wmnet,kubemaster[2001-2002].codfw.wmnet,kubestagetcd[2001-2003].codfw.wmnet,kubetcd[2004-2006].codfw.wmnet,ldap-corp2001.wikimedia.org,logstash[2023-2025,2030-2031].codfw.wmnet,ml-etcd[2001-2003].codfw.wmnet,ml-serve-ctrl[2001-2002].codfw.wmnet,ml-staging-etcd[2001-2003].codfw.wmnet,mwdebug[2001-2002].codfw.wmnet,ncredir[2001-2002].codfw.wmnet,netbox2001.wikimedia.org,netbox-dev2001.wikimedia.org,netboxdb2001.codfw.wmnet,orespoolcounter[2003-2004].codfw.wmnet,ping2002.codfw.wmnet,planet2002.codfw.wmnet,poolcounter[2003-2004].codfw.wmnet,puppetdb2002.codfw.wmnet,registry[2003-2004].codfw.wmnet,releases2002.codfw.wmnet,schema[2003-2004].codfw.wmnet,search-loader2001.codfw.wmnet,serpens.wikimedia.org,urldownloader2002.wikimedia.org
----- OUTPUT of 'cat /etc/debian_version' -----
10.11

If we do decide that a cookbook is the path forward, then it might be worth taking into account if it just will be beneficial to the 24 Wikidough hosts, or the others above as well, to make the effort worth it.

Event Timeline

My 2 cents:

cookbook not worth it in this case, likely more work to create and debug it than the actual time savings with installs because it will just happen like once every 2 years or less and maybe by then it will be on k8s or who knows.

In place upgrade not common and expect to have some unexpected fallout, would only do if there is a really good reason to deviate from the standard workflow.

Creating new VMs has the advantage that system can be tested on new distro without any time pressure and then you can decom older machines whenever you like.

regarding the downsides you list for that:

"update IPs in homer for the anycast configuration (not a big deal?) " yes, not a big deal

"and that it might result in some weird state issue elsewhere (such as Netbox)." why would it? You would give them new names, just don't reuse the same name. If they are 1001 now, make 1002 etc.

My 2 cents:

Thanks for the feedback!

cookbook not worth it in this case, likely more work to create and debug it than the actual time savings with installs because it will just happen like once every 2 years or less and maybe by then it will be on k8s or who knows.

Yeah I am a bit split about the cookbook as well and a brief discussion about it with Riccardo on IRC. I think the cookbook makes sense -- and I say this while recognizing that I don't want to add to the work of the IF team -- if there is a need for the other VMs as well and it can be used for those. Otherwise, the work involved in a cookbook for 24 hosts (we may add more later but that's besides the point) might be more than the work required for the manual steps. Or if we do decide to automate this, maybe we can build upon some existing work but I am not sure about that.

In place upgrade not common and expect to have some unexpected fallout, would only do if there is a really good reason to deviate from the standard workflow.

Creating new VMs has the advantage that system can be tested on new distro without any time pressure and then you can decom older machines whenever you like.

regarding the downsides you list for that:

"update IPs in homer for the anycast configuration (not a big deal?) " yes, not a big deal

"and that it might result in some weird state issue elsewhere (such as Netbox)." why would it? You would give them new names, just don't reuse the same name. If they are 1001 now, make 1002 etc.

Yeah that's one way, if we move to new hostnames, then there should not be any problem. There is more manual work involved in this than just running apt full-upgrade but it definitely is the cleaner approach. This then involves: decommission a host, spin up a new VM, update homer, repeat. I don't see any concerns with but I wanted to make sure that I am not missing anything!

Thanks @ssingh for kickstarting the discussion!

My two cents as an owner (with o11y) of some VMs that will need upgrading (grafana, logstash, etc): I think our strategy when it comes to large host numbers and upgrade to Bullseye will be to:

  • High level test provisioning a Bullseye host from scratch in Pontoon, to validate that e.g. packages are all present, no obvious breakage, etc
  • Provision a Bullseye VM in production from scratch and put it in service to validate all is working well
  • Progressively upgrade in place the remaining VMs to Bullseye

AIUI the decom cookbook doesn't support VMs yet (?) though definitely something we want. There isn't a cookbook for in-place upgrades yet, though of course the more we do it the more appealing it becomes to have sth like that. I haven't done an in-place upgrade to Bullseye yet myself but I think we'll do at least some of them this quarter for sure.

AIUI the decom cookbook doesn't support VMs yet (?)

That's not actually correct, the decommission cookbook does support VMs since the start.
What is missing is that the makevm cookbook doesn't do yet the completion of the installation with the puppet runs and such.
My idea would be to complete that part so that makevm can automate the whole process of creating a new VM.
At that point there will be two options:

  1. decommission + makvevm with the same hostname (and we could make adjustments to keep the same IP too). That will be logically equivalent to a physical hardware reimage, including the downtime of the host.
  2. makevm with a new name and IPs + decommission of the old VM. That will be logically equivalent to a physical hardware refresh (getting new hardware), including the possibility to bring the new host up before decommissioning the old one.

AIUI the decom cookbook doesn't support VMs yet (?)

That's not actually correct, the decommission cookbook does support VMs since the start.
What is missing is that the makevm cookbook doesn't do yet the completion of the installation with the puppet runs and such.
My idea would be to complete that part so that makevm can automate the whole process of creating a new VM.
At that point there will be two options:

  1. decommission + makvevm with the same hostname (and we could make adjustments to keep the same IP too). That will be logically equivalent to a physical hardware reimage, including the downtime of the host.
  2. makevm with a new name and IPs + decommission of the old VM. That will be logically equivalent to a physical hardware refresh (getting new hardware), including the possibility to bring the new host up before decommissioning the old one.

My apologies for the misinformation! Looking forward to be able to do decom + makevm = reimages for VMs (i.e. option 1)

Thanks for the feedback @fgiunchedi and @Volans!

AIUI the decom cookbook doesn't support VMs yet (?)

That's not actually correct, the decommission cookbook does support VMs since the start.
What is missing is that the makevm cookbook doesn't do yet the completion of the installation with the puppet runs and such.
My idea would be to complete that part so that makevm can automate the whole process of creating a new VM.
At that point there will be two options:

  1. decommission + makvevm with the same hostname (and we could make adjustments to keep the same IP too). That will be logically equivalent to a physical hardware reimage, including the downtime of the host.
  2. makevm with a new name and IPs + decommission of the old VM. That will be logically equivalent to a physical hardware refresh (getting new hardware), including the possibility to bring the new host up before decommissioning the old one.

I think 1) will be a huge help, specifically the "keep the same IP too". I am being a bit selfish here of course but that helps avoid the problem of having to update the IP address in other places and also helps you to keep a IP as when I was trying to set up a Wikidough host in ulsfo, there were no public /32s and I had to decommission a bastion host and reclaim the IP from there! So one of my fears (unfounded?) is to lose the IP(s) we already have to some other host in this process of decommissioning and commissioning.

I guess my main concern though is if the amount of work involved in the cookbook is worth the effort, i.e., if there will be a use for it for the other VMs as well. And I say this while being aware of the fact that this mostly falls on to @Volans and IF, and I am happy to help in any way I can. The intention of this ticket to be clear is not to say, "please cook the cookbook" but to actually discuss the best path forward, for Wikidough and durum of course but keeping in mind the other VMs. (This is mostly a continuation of the conversation Riccardo and I had on IRC.)

Change 779531 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: make doh and durum hosts use the bullseye installer

https://gerrit.wikimedia.org/r/779531

Change 779936 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] dnsrecursor: refactor module (see detailed commit message)

https://gerrit.wikimedia.org/r/779936

ssingh triaged this task as Medium priority.Apr 21 2022, 12:05 PM

Change 779531 abandoned by Dzahn:

[operations/puppet@production] DHCP: make doh and durum hosts use the bullseye installer

Reason:

https://gerrit.wikimedia.org/r/779531

Change 779936 merged by Ssingh:

[operations/puppet@production] dnsrecursor: refactor module (see detailed commit message)

https://gerrit.wikimedia.org/r/779936

Change 779531 restored by Ssingh:

[operations/puppet@production] DHCP: make doh and durum hosts use the bullseye installer

https://gerrit.wikimedia.org/r/779531

Change 793495 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] aptrepo: add a component for dnsdist/pdns-recursor for bullseye

https://gerrit.wikimedia.org/r/793495

Change 793495 merged by Ssingh:

[operations/puppet@production] aptrepo: add a component for dnsdist/pdns-recursor for bullseye

https://gerrit.wikimedia.org/r/793495

Mentioned in SAL (#wikimedia-operations) [2022-05-30T13:03:38Z] <sukhe> upload pdns-recursor_4.6.2-1wm1 to apt.wm.o (bullseye) - T305589

Mentioned in SAL (#wikimedia-operations) [2022-05-30T13:14:28Z] <sukhe> upload dnsdist_1.7.1-1wm1 to apt.wm.o (bullseye) - T305589

Change 779531 merged by Ssingh:

[operations/puppet@production] DHCP: make doh and durum hosts use the bullseye installer

https://gerrit.wikimedia.org/r/779531

Mentioned in SAL (#wikimedia-operations) [2022-07-13T17:54:30Z] <sukhe> upload dnsdist_1.7.2-1+wmf11u1 to apt.wm.org (bullseye) - T305589

Mentioned in SAL (#wikimedia-operations) [2022-07-13T18:20:32Z] <sukhe> upload pdns-recursor_4.6.2-1+wmf11u1 to apt.wm.org (bullseye) - T305589

Mentioned in SAL (#wikimedia-operations) [2022-11-21T18:44:14Z] <sukhe> reprepro -C component/dnsdist include bullseye-wikimedia dnsdist_1.7.2-1+wmf11u1_amd64.changes: T305589

ssingh claimed this task.

Closing this in favour of T321309 where it is being tracked and also given that the Ganeti reimaging cookbook exists which was the primary motivation behind this task.